Originally posted to a previous version of this blog on 30 April 2007.
As mentioned in a previous post, raters may behave differently when rating new movies than they do when rating old movies. The first step towards investigating that hypothesis is to come up with definitions for "old" and "new."
One possible definition for "new" is that the time between the release date and the rating date is less than some constant. There are several problems with implementing this definition in the Netflix data. One is that, according to the organizers, the release year may sometimes refer to the release of the movie in theaters and in other cases to the release of the movie on DVD. The distribution of release years makes it seem likely that it is usually the date of the release in theaters. Another problem is that although we know when ratings were made to the day, we have only the year of release so the elapsed time from release to rating cannot be measured very accurately.
Another possible definition for "new" is that the time between the rating date and the date the movie was first available for rating on the Netflix site is smaller than some constant. This has the advantage that we can measure it to the day, but suffers from the problem that it does not distinguish between the release of a movie which was recently in theaters from the release to DVD of one that has been in the studio's back catalog for decades.
Whether a movie is new or old, rating behavior may be different when it first becomes available for rating. There may, for instance, be pent-up demand to express opinions about certain movies. A good approximation to when a movie first became available for rating is its earliest rating date. Of course, this only works for movies that became available for rating during the observation period. Any movies that were already available from Netflix before Armistice Day of 1999 will first appear in our data on or shortly after that date. These observations are interval censored. That is, we know only that they became available for rating sometime between 1 January of their release year and 11 November 1999.
The following charts explore the effect of this censoring:
The first chart plots the number of movies having an earliest rating date on each day of the observation period. Only movies with a release year of 2000 or later are included so none of the data is censored. Although the chart is quite spiky, there is nothing special about the first few days, and the overall distribution is similar to the overall distribution of rating counts.
The second chart shows the distribution of earliest observed ratings for movies with release years prior to 2000. There are two interesting things to note. First, the large spike on the left represents movies that were already available for rating prior to the observation period. Second, although all these movies have release dates prior to 2000, the earliest rating dates are distributed across the five-year window similarly to the new releases. This suggests that old movies were becoming available on Netflix at a fairly steady rate across the observation period. At some time in the future, there will be no old movies left to release so all newly available movies will be new. Such shifts in the mixture over time can have a dramatic effect on predictions of future behavior.
Earliest ratings for new and old movies shown together on the same scale: