Thursday, September 27, 2007

Outliers for number of movies rated

Originally posted to a previous version of this blog on 18 April 2007.

My raters signature has a column for the number of movies each subscriber has rated. In the J code below, this column is called n_ratings.

+/ n_ratings >/ 1000 5000 10000

In English, this compares each subscriber's number of ratings with the three values creating a table with three columns. The table contains 1 where the number of ratings is greater and 0 where it is less than or equal to the corresponding value. These 1's and 0's are then summed. The result vector is 13100 43 5.

The 13,100 people who rated more than a thousand movies are presumably legitimate movie buffs who have seen, and have opinions on, a lot of movies. Rating 10,000 movies does not seem like the expected behavior of a single human. Could these be the collective opinions of an organization? Or automatic ratings generated by a computer program? I don't know. What I do know is that such outliers should be treated with care. One concern is that for movies that have been rated by very few subscribers, the ratings will be dominated by these outliers.

There has been some discussion of this issue on the Netflix Prize Forum.

