John Resig - The Netflix Prize

The Netflix Prize

I don’t think I could possibly be any more giddy about something, than how I am concerning The Netflix Prize.

In short: Netflix’s vote prediction algorithm gets a deviation of 0.95 stars away from predicting your vote for a movie. If you can do 10% better, they’ll give you $1 million dollars.

That’s awesome and all, but what’s really awesome is their amazing training dataset. This is every data miners wet dream: 100,000,000 votes, 17,000 movies, ~~250,000~~ 500,000 users.

They have two tests that you can run: One against your known data, and one that you’ll submit to Netflix. As far as I can tell, your standing (aka, your current deviation) is made public. The lower your number, the higher your rank. Every year that the algo isn’t improved by 10%, $50,000 is paid out to the current leader.

Another thing that I find to be interesting: Netflix gets the score that they do without assuming anything about the movie titles, genre, actors, etc. They just do straight number crunching. I’m impressed.

I’ve already got some techniques that I wanna try. I’ve got a feeling that I’m overly optimistic at this point, and that I’m going to be highly disappointed when I see my first score. But first, I have to generate my test bed and get to work, this is so cool.

I don’t know what it is with me and large, nicely formatted, datasets, but I don’t think there’s anything that can get me more excited.

Posted: October 4th, 2006

Subscribe for email updates

13 Comments (Show Comments)

Dr Nic (October 4, 2006 at 4:57 am)

Open source projects around the world will suffer as we all chase the elusive $50k per year prize :)
Rydal Williams (October 4, 2006 at 2:22 pm)

I feel you! I’ve already joined, not only will I have the opportunity to try something cool, but I do get a humongous anonymous data that may become useful one of these days. The probably about movie ratings is not so much about the algorithm but the physcological side of it, its really not easy to predict the mind.

Some of the netflix prize coders are taking some weird stuff into consideration such as movie endings but it does get complicated. For example, I may simply rate a movie 5 because its my favorite actor or I was in the right mood for the right movie, now how do you calculate that? What happens when my mood changes? I’m probably taking it too far!
Doyle (October 4, 2006 at 2:42 pm)

It’s actually for 480,189 users. Even better!
Jeff Daly (October 4, 2006 at 4:06 pm)

Reading your post on the Netflix Prize describes exactly how I felt upon learning about the contest. It sounds like fun!

If you haven’t already heard of it, you may also like The Hutter Prize where the goal is to write a piece of data compression software that can “create a compressed version (self-extracting archive) of the 100MB file enwik8 [the first 100MB of Wikipedia] of less than 18MB.” Oh man, that’s the sort of thing I can’t wait to get home from work so I can play with.

The Hutter Prize is worth 50_000 â‚¬ (which google says is 63_705 U.S. dollars).
Bob Aman (October 4, 2006 at 5:17 pm)

I was pretty excited about this too when I saw it last night. But I’m most excited about the data rather than the prize thing. Very cool.
Zack Gilbert (October 4, 2006 at 9:40 pm)

Best of luck, John. I anxiously await hearing of your winning of the price. ;)
Blair Allen Stark (October 7, 2006 at 10:27 am)

Have any of you actually looked at the data? With what is given it is impossible to make a resonable predicition.

All you get is the title of 17,000 movies and a set of user rankings.

You dont get any cast/crew/genre info.

How are you going to make anby predictions???
John Resig (October 7, 2006 at 12:36 pm)

“With what is given it is impossible to make a reasonable prediction.”

Hardly. If you’ll notice in the README for this contest, Netflix doesn’t even use the title of the movie to make this analysis – only straight statistical analysis. This means that they probably do different types of clustering or density based analysis. That just means that their analysis gets that much better once they introduce other variables (such as actors or directors) – but we’re not competing at that level, which is comforting ;-)
SteveC (December 7, 2006 at 8:23 pm)

I would guess that the including actors/directors etc is not that relevant to finding clusters. If users do segment their ratings by these characteristics it will become apparent by clustering techniques without this information. That’s to say, if a cluster exists you can find it with just the ratings data, once you find it you can give at a name e.g. “John Wayne movies” or “hand drawn animation”. The same goes for movie endings. There is little point in trying to guess what the genres are before hand. I am sure you would not guess “gay male drama with an unrequited love and at least one death” as a genre, but trust me, it’s there.
Pancho Lopez (December 9, 2006 at 1:59 pm)

Has anyone thought about the theoretical limit of the problem?
It is plausible that the task is actually impossible given the characteristics of the data due to the inherent diffrence in human criteria.
Proving that would be super-cool!
Pattern Excavator (July 2, 2007 at 9:15 am)

I am chronicling my adventures on the Netflix Prize. If you are interested visit http://patternexcavator.wordpress.com/
jail1233 (October 13, 2007 at 2:34 am)

All about jails, inmates, prisoners pima county jailpima county jail
niagara1123 (October 21, 2007 at 8:11 pm)

All about niagara fall niagara falls restaurantniagara falls restaurant

Comments are closed.
Comments are automatically turned off two weeks after the original post. If you have a question concerning the content of this post, please feel free to contact me.