Where Is the Collaborative Filtering? 1

Posted by Christopher Smith Thu, 31 Jan 2008 05:39:00 GMT

Rob Malda recently discussed why Digg, reddit, etc. all stink. He’s bang on the money, but this brings to mind the question the thing that has been driving me batty about these news sites: what’s going wrong with the collaborative filtering?

In theory, collaborative filtering algorithms should effectively work like this: lots of people of people label different bits of a dataset based on their tastes. The collaborative filtering algorithm chews through all the labels in the dataset and then predicts how you would label other bits of the dataset based on those whose labels most closely resemble yours have labeled them. When it comes to news, this should mean the engine selects news items based on what is interesting to other people who usually find the same news interesting as you do. My experience on reddit is that this somehow means that no matter how many Ron Paul articles I rate as totally uninteresting, it still seems to find new Ron Paul articles which reddit believes I will find completely fascinating.

Now, I’m well aware that creative people will find ways to game the system, but frankly, proper collaborative filtering should make it really hard to game the so overwhelmingly. If you spam the system much, other people most similar to you start labeling your spam the opposite to how you have, and very quickly your recommendations don’t impact them any more. The only way they can get back in the game would be to create a new account and quickly try to label a whole bunch of data to get in to a trusted position again. This will get increasingly difficult as a community matures, as there will be more and more labeled data for at least the older accounts. I’ve been on reddit for ages, faithful labeling articles most ever day, and the recommended page is still completely useless.

I’ve seen comments from the folks behind reddit that describe their collaborative filtering algorithm as “not working well when there are wide divergences between segments of the community”, which makes me think it isn’t collaborative filtering at all, but rather a machine learning algorithm that is trying to predict an overall “interesting” score, with a karma system to boost the weights of redditor’s labels. When you read the faq this is how it all seems to work.

So is that the deal? The term is just horribly abused, and nobody has bothered to put together a proper collaborative filtering news site, or is there some inherent problem with the concept that I’m missing out on?

15 Minutes of Fame and The Joel Effect 11

Posted by Christopher Smith Wed, 13 Sep 2006 16:45:00 GMT

Last night I experienced my fifteen minutes of ‘net fame. I had submitted my article rebutting one of Joel Spolsky’s comments on Ruby’s performance to both reddit and digg. I watched how my submission performed on each while it was on the “new” lists on each, and it didn’t seem to garner much excitement. On reddit the first person to vote on it voted it down (my best guess is that they were sick of Joel articles) and I got one digg and then dropped off from view. Ah well, I already had strong evidence indicating that most of what I say isn’t of interest to anyone, so this was just further confirmation.

Just before I went to sleep, I discovered the performance problems I was having with Typo, so I tweaked things and watched the logs for a bit, and that was when I noticed that my article was getting a lot of hits, and they weren’t all coming from ‘bots (which is what my logs usually look like). Sure enough, some generous souls had voted me up on reddit (digg never bumped me up beyond the initial digg, which is either an indictment of reddit or digg depending on your point of view ;-). In fact, I was in the top 4 on the hot list on both the main page and the programming subreddit for a brief while (actually, checking now I’m still in the top 5 on the programming subreddit). As of this morning I’ve had about 800 page views of the article from the non-bot world, which by comparison to the usual attention I get, makes me famous.

But that’s not the interesting part.

The interesting part is that when I looked a the rankings, particularly on the programming subreddit, it seems that anti-Joel articles were all over the hot list. In just the top 5 there was my article, one titled Coding Horror: Has Spolsky Jumped the Shark?, one titled Why Joel Is Wrong to say that Duck-typed Languages Cannot Optimize Down To a Single Jump (interestingly the title on the web page makes no mention of Joel), and another by DHH titled Outsourcing the performance-intensive functions.

So, when I said last night that Joel had kicked up a lot of dust, I was perhaps understating it.

Joel has been criticised for running his blog as a big publicity and recruitment engine for his company (to which the rather logical retort would be: “how unusual that an entrepreneur would leverage whatever assets they have, including fame, to help their company!”). His postings almost always get bumped up high on reddit, and his articles seem to get linked from all over the net. So, there is a strong incentive for him to keep posting to his blog.

Observing reddit though, you have to wonder if this produces and an opposing force in the blogosphere: critical incentive. See, by taking on Joel, you get a fair bit of traffic (since I was the Johnny come lately to the party, I suspect some of the other posts have drawn significantly more traffic), you make reddit’s hot list, etc. At this point it’s just a theory, which I’ve dubbed as “The Joel Effect”, but one can definitely observe that Joel bashing has become a bit of sport in the programmer blogosphere.

Now, before anyone comes up with an Underpants Gnome business plan that looks like:

  1. Critique Joel Repeatedly
  2. ?
  3. Profit!

I would like to point out that out of all the page views I got, I got zero AdSense clicks. I checked, and the AdSense ads were highly relevant to programmers, so I think it’s fair to say that programmers filter out sponsored links significantly more so than your average ‘net visitor. I probably could have made more money writing about some totally unsubstantiated rumor about a celebrity, or by listing out the “ten tips for finding the best mortgage”. So, you’re going to have to fill in step 2 with something fairly creative in order to make it to step 3.