Saturday, October 1, 2011

The Average Hotel Does Not Get The Average Rating

The millions of travelers who review hotels, restaurants, and other attractions on TripAdvisor also supply a numeric rating by clicking one of five circles ranging from 1 for "terrible" to 5 for "excellent." On the whole, travelers are pretty kind.The average review rating for hotels and other lodgings is over 3.9. The median score is 4 and since that middle review is lost somewhere in a huge pile of 4-ratings, well over half of hotel reviews give a 4 or 5 rating.

So with such kind reviewers, most hotels must have a rating over 4 and hoteliers must all love us, right? Actually, no. The average of all hotel ratings is 3.6. Here's why: some large, frequently-reviewed hotels have thousands of reviews. It is hardly surprising that the Bellagio in Las Vegas has about 250 times more reviews than say, the Cambridge Gateway Inn, an unloved motel in Cambridge, Massachusetts. It may or may not be surprising that these oft-reviewed properties tend to be well-liked by our reviewers. Surprising or not, it's true: the hotels with the most reviews have a higher average rating than the long tail of hotels, motels, B&Bs, and Inns with only a handful of reviews each.

The chart compares the distribution of user review scores with the distribution of hotel average scores.

For the curious, here are the top 10 hotels on TripAdvisor by number of reviews:

Luxor Las Vegas
Majestic Colonial Punta Cana
Bellagio Las Vegas
MGM Grand Hotel and Casino
Excellence Punta Cana
Flamingo Hotel & Casino
Venetian Resort Hotel Casino
Hotel Pennsylvania New York
Excalibur Hotel & Casino
Treasure Island - TI Hotel & Casino

Not all of these are beloved by TripAdvisor users. The Hotel Pennsylvania drags the average down since it receives more ones than any other score. Despite that, as a group these hotels have a higher than average score. The moral of the story is that you can't extrapolate from one level of aggregation to another without knowing how much weight to give each unit. In the last US presidential election, the average state voted Republican, but the average voter voted Democrat.


  1. I didn’t have space to discuss which sorts of brands are more or less threatened by reviews sites, but it’s worth pointing out that consumer product brands like hotels and motels are safe.

  2. Hi Michael,

    great post, thx for sharing with us!



  3. Interesting. Your point that “you can't extrapolate from one level of aggregation to another without knowing how much weight to give each unit” is well taken. I wish we could spread the word--it might change the way news outlets report polling results. I wonder, though, about the implications of performing these analyses on ordinal data. Is the distance between a 1 and a 2 really the same as between a 4 and a 5? If 1 through 3 were collapsed into 2 buckets, would that remove some the disparity?

    Interesting analysis nonetheless, and the fact that the data is ordinal doesn’t affect the validity of your point about extrapolation. I’m just curious whether you have looked at the ordinal issue, and, if so, what you have found.

  4. One place that I see this type of use that causes the same problem, and that is in top x from a graduating class. That top is not always a good indicator of the actual performance, since each group class weighs differently. This is both a blessing and a curse since a top performer might perform better with other more aggressive classmantes, as it could also fall behind the pack. School systems need to adjust their metrics.

  5. It seems that the solution to this problem is to compute averages weighted by bucket size and/or variance. In statistics, this is the same idea as correcting for non-independent data points. For instance, if I'm fitting a linear model over a set of observations of human activity, and multiple observations may be derived from the same individual, then I may want a random effect in my model to account for individuals as a grouping factor. In the case of hotel scores, if you were actually interested in the average hotel score across all hotels, then you would certainly want to consider the hotel as a grouping factor that impacts individual ratings. One simple solution is to use an average of averages, but you can get more fancy by taking into account things like variance within each bucket.

  6. This is an interest observation. I would be curious to see if the trend holds when you look at more specific groups of hotels. For example, if I look at just the hotels in a specific city, do the most reviewed hotels still get the higher ratings? I could see it going the other way if we talk about a particular city as a whole really does have crumby hotels compared to cities like Las Vegas.

  7. This is an interesting observation. I would be curious to see if this trend holds when you examine more specific groups of hotels. For example, if you took all the hotels in a particular city, do the most reviewed hotels always get the higher ratings. I could see it going the other way if a particular city really did have crumby hotels as a whole compared to cities like Las Vegas. I could also imagine a cute little BnB getting higher reviews in a small town, even though its size prevents it from having the number of reviews as a big hotel in that same city.

    1. It is certainly not always true that the hotels with the most reviews have higher ratings, but it is true often enough to be true on average.

  8. Thank you, I have recently been looking for information about this subject matter for ages and yours is the best I have found so far.The time and hard work put into this is very remarkable. This is a very educational blog. Good luck on the upcoming entry you put into it.

    hotel londra centro

  9. It seems like a lot of statistics have similar aggregation issues. Without including the size of the “pie” that the statistics come from, it is hard to tell what we can learn from the statistics. Take for example the idea that the richest people have a larger portion of wealth than they did 30 years ago. If what we care about is the standard of living of poor people, this fact doesn’t actually tell us anything, even though it seems related. To figure out whether poor people are better off today than 30 years ago, we need to include how much the overall economy grew to get an accurate picture. Without both the proportions and the total size, we cannot answer the questions we want to answer. In the case of hotels, we really want to know whether or not we will be pleased with our stay. The data from star ratings is probably not enough to make this decision.

  10. The moral of your post is great advice to any data scientist and eloquently stated.

    I think a good rule of thumb is to be weary of aggregation metrics that penetrate elements of scope, explicit or implicit, from the context in which the data were originally provided. For example, when a user offers her ratings on TripAdvisor, she does it for a specific hotel, rather than submitting a rating for all hotels. As such, calculating the average of user review scores across hotels does not make much sense. Doing so removes the implicit scope from the original data, since there exist hotels which the user has not rated. As the post shows, such an average does not take into account proper weighting, which would have served to alleviate the bias caused by the implicit scope. In contrast, if TripAdvisor was architechted in a way such that upon offering ratings each user had to rate every hotel in the database, then the scope of the original data would encompass all hotels. An aggregate statistic across hotels in this latter case would then not violate the scope and account for the weighting issue naturally.

    We can more easily recognize sources of bias in our aggregation if we consider both the data we have and the data we do not have as our “complete dataset”. To find the average rating of a single hotel, we would normally use all the ratings contained in the database for that hotel. However, with this paradigm we also consider the missing ratings from users who have not rated the hotel. These ratings are necessarily null and cannot be replaced with hard numbers without some type of justification or explicit assumption. This exercise is helpful because it forces us to recognize the bias of our aggreation metrics and explicitly tackle them before obtaining the desired metric. Such an approach can easily be applied in practice. For example, by starting with a table of registered users and performing a left join with the ratings for a given hotel would reveal any null entries and essentially provide us with our theoretical complete dataset.

    Since we cannot always fill in missing data or obtain the theoretical complete dataset to remove bias, we have to at least ensure that the original data cannot affect measurements outside their scope. The average hotel score is a great example of this. Ratings are hotel specific so they must be aggregated at that level first. The effect of our “missing ratings” is that hotels which have been rated often are given more control over the metric than they should (as reflected by the blue curve). The average hotel score metric avoids this effect by prohibiting raw (unormalized) data from crossing the lines between hotels. While aggregation metrics for a complete dataset (one with ratings for each hotel from every user) would yield more preferable results and avoid many aggregation pitfalls alltogether, obtaining such a dataset is often infeasible and we must make due with what we have by respecting scope.


Your comment will appear when it has been reviewed by the moderators.