Thursday, September 27, 2007

Thoughts on KDD Cup 2007 Task 2

Originally posted to an earlier version of this blog on 30 April 2007.

The next several entries in the Netflix Data thread will examine questions to do with the timing of ratings. These that will be important to anyone attempting task two of the 2007 KDD Cup data mining competition.

This task is to predict the number of additional ratings that users from the Netflix Prize training dataset will give to the movies in that dataset during 2006. Ratings in the training data are from 11 November 1999 through 31 December 2005, so the task requires looking into the future. In other words, a predictive model of some kind.

To build a predictive model, we divide the past into the distant past and the recent past. If we can find rules or patterns in a training set from the distant past that explain what occurred in validation set from the recent past, there is some hope that applying these rules to the recent past will produce valid predictions about the future.

Creating a validation set for predictive modeling is not as simple as it first appears. It is not sufficient to simply partition the training data at some arbitrary date, we must also take into account all the known ways that the recent past differs from the future. In this particular case, it is important to note that:

  • Past ratings--including those from the recent past--are made by a mix of new and old raters. New raters appear throughout the five-year observation period. The 2006 ratings in the KDD Cup qualifying set, by contrast, will all be made by old raters since the KDD Cup raters are a subset of the Netscape Contest raters. If new raters behave differently than old raters, the presence of ratings by new raters in the training and validation set is problematic.
  • Past ratings are of a mixture of old and new movies. New movies (and old movies newly released to DVD) appear throughout the five-year observation window. The KDD Cup qualifying set contains only old movies by definition since the movies in the qualifying set are a subset of the movies in the Netflix Contest observation window. If new movies are rated more (or less) frequently than old movies, the presence of new movies in the training and validation sets is problematic.
  • Only active raters rate movies. The ratings habits of individual subscribers change over time. The raters in the training data are all active raters in the sense that they have rated at least one movie or they wouldn't have a row in the trainings set. Over time, some of these raters will get tired of rating movies, or switch to Block Buster, or die, or move to Brazil. Understanding the rate at which active raters become inactive is central to the task.
  • As shown in an earlier post, the overall rating rate changes over time. In particular, is was declining at the end of 2005. If this can be explained by the changing mix of new and old raters and/or the changing mix of new and old movies, it is not a problem. If it is due to some exogenous effect that may or may not have continued during 2006, it is problematic.

The next few posts will examine these points.

No comments:

Post a Comment

Your comment will appear when it has been reviewed by the moderators.