Thursday, September 27, 2007

Netflix rating activity over time

This post was salvaged from the wreckage of a former version of the Data Miners blog. It was part of a series of postings on the data used for the Netflix competition. The original posting date was 12 May 2007. At that time, the 2007 KDD Cup competition was active. One of the tasks was to predict how many ratings would be made in 2006 by a group of raters observed through 2005 rating a group of movies that were all released during or before the observation period. Although the KDD contest has come and gone, the subject of how rating activity changes over time remains interesting.

In this post, I begin to look at what happens to a movie's propensity to be rated as a function of time since first availability. To avoid having to deal with the issue of interval censoring that I brought up in a previous post, I restrict my attention to movies that were first rated after 01 February 2000, well into the observation period. This first chart shows the raw, unadjusted count of ratings for all movies first rated after 01 February 2000 by days since first rating.

Raw count of ratings by movie age (days since first rating)

It is tempting to dive right in and start trying to interpret this raw data (what in the world is going on at day 1003? for example), but that temptation might lead to erroneous conclusions. There are two important things to keep in mind:
  1. The number of raters is not constant over time. As the customer base grows, we can expect more rating activity so the absolute number of ratings a movie receives over time may not decay as quickly as it would for a fixed population. During the observation period, the number of raters grew substantially. For the KDD Cup task, on the other hand, we will be looking at the activity of a shrinking subpopulation. (Some customers who were active in the observation period are bound to quit, die, move away, or loose interest in making ratings; their replacements do not show up in our data.)

  2. As the days since first rating number gets higher and higher, our data on rating behavior is based on a smaller and smaller number of movies. By definition, all the many thousands of movies in our sample experienced day 0. All but the few first rated on the last day of the observation period also experienced day 1. By the time we get to day 2000, only a handful of movies released in February of 2000 remain in the sample. Small numbers mean high variance. Also, as the time since first rating gets longer, the actual dates of first rating come from a narrower and narrower window, exposing us to seasonal effects that are averaged away for "younger" movies.

Change in rating activity over the observation period

In the next chart, I have adjusted the raw number of ratings received for each day since a movie was first rated to take into account the overall level of rating activity on that day. Since the Nth day since first rating comes on a different calendar date for each movie, this adjustment is made at the level of the rating transaction table before summarization. The adjustment is to count each rating as 1 over the total number of ratings that day.

Rating intensity over time adjusted for number of raters

The query to produce the data for both the unadjusted and adjusted charts is shown here:

picture not found

One interesting feature is the high number of ratings on the first day of availability for rating. I suspect this represents people who see movies in theaters and are impatient to rate them when they are first available on Netflix. After that initial burst, I suspect that rating volume is largely a function of the number of disks in circulation for a newly released DVD. This appears to ramp up quickly. After a peak at around 230 days, a movie's rating activity goes into decline.

Comparison of adjusted and unadjusted rating intensity

What of the effect of the declining number of movies for large values of days since first rating? The following chart shows the number of movies with earliest rating dates after 01 February 2000 that experienced each number of days since first rating.

As can be seen on the chart below, movies became available for rating throughout the period of 02 February 2005 through the end of 2005.

Movies newly available for rating by date

*How many movies of each age were available to be rated?
data atrisk(keep=daysout available);
set releases;
retain available decrease;
daysout = _N_-1;
if daysout = 0 then available=15405;
else available= available-decrease;

*Rating activity by days available adjusted for size of pool
proc sql;
create table ratetenure_adj as
select l.daysout, l.ratings,
l.ratings/r.available as rpm,
l.adj_ratings/r.available as arpm
from ratetenure3 l, atrisk r
where l.daysout = r.daysout and r.available > 100
order by l.daysout;

The code above is in SAS. For the most part, I restrict myself to PROC SQL which is close enough to regular SQL that I do not bother to provide any explanation. This data step code is admittedly a bit odd, however. Had I followed my usual practice of storing my data in J arrays, things like subtracting a cumulative sum from an initial constant would be trivially accomplished using a scan expression such as c-+/\releases.

At the moment, however, the data is sitting in SAS tables. Given a table RELEASES containing the number of movies that became available on each date in reverse chronological order, I want to know how many movies were available for rating at each "age" where age is defined as days since the movie was first rated. A total of 15,402 movies became available for rating in the period from 02 February 2000 and the end of 2005. Any one of these movies could have been rated on its day 0 (which I am calling the "release date" although in fact, it is just the earliest rating date which might or might not be the same thing). The last day that any movies were released was 09 December 2005. I therefore consider that the end of the ratings window. The two movies released that day were not available to be rated at age 1 day because they did not achieve age 1 day within the window. All the rest of the movies were available to be rated at age 1. All movies released before the last two days of the window were available to be rated on both day 0 and day 1, and so forth.

Number of movies available for rating by age

In the chart below, the blue line shows absolute ratings per movie by movie "age." The red line shows ratings per movie by movie age adjusted for the overall growth in rating activity.

Rating intensity by age of movie

A plea for comments
I find it quite surprising that the large spike in number of ratings that is visible on the calendar time line seems to survive translation to the days since first rating time line. My initial hypothesis about spikes on the calendar time line is that they represent glitches in the system where, for example, several days worth of rating activity got posted all at once. That sort of thing would disappear when moved to the days since first rating time line since the movies rated on a particular date are all at different ages. A spike on the days since first rating time line suggests that all the extra rating activity was around a single movie (that happens to had been available for rating for 1003 days when the event occurred) or that something else is going on that I haven't thought of. If anyone out there knows what is going on, please post a comment.

1 comment:

  1. I think what you're seeing is exogenous shocks to the data, where some external event is causing a lot of people to rate the movie (maybe a sequel comes out, an actor/director dies, the movie wins an award, etc.) This would be a similar sort of thing to what's happening to Michael Jackson's albums right now - their sales are experiencing exogenous shocks due to his death.
    Am I understanding your calculations/data correctly? It's hard to tell, because your graphs aren't displaying.

    I realize you posted this two years ago, so I don't know whether you're still interested in thinking about this data. I found this post by googling "netflix ratings over time," because I'm doing research on the dataset myself (I'm looking for herding effects, where users' ratings are affected somehow by the average rating displayed at the point in time when they go to rate the movie.)


Your comment will appear when it has been reviewed by the moderators.