Tuesday, February 14, 2012

Using Matched Pairs to Test for Cannibalization

When a company introduces a new product into the same market served by an existing one, it is possible that the new product will achieve success at the expense of the first. For example, when Netflix introduced movie downloading, it knew it would put a dent in DVD subscriptions. This is called cannibalization. Here at TripAdvisor, I recently did a study to determine whether this was occurring on our sites.
A good methodology to use for this kind of study is matched pairs. This allows you to isolate the effects of a single variable while controlling for many others. The idea is simple: To measure the effect of a treatment, you take pairs of subjects who are similar in every way and give the treatment to one, but not the other. In medical studies, twins come in handy for this purpose. 

To simplify slightly, at TripAdvisor we have two ways to generate revenue from the millions of travelers who come to one of our sites to read reviews: they can click on link A to be taken to an on-line travel agency which pays us for the referral or they can click on link B to be taken directly to the site of a hotel that has subscribed to our business listing product. So the question is “Does the presence of link B have an effect on the number of clicks received by link A?” To answer this question, each property with a business listing is paired with a “twin” that does not have a business listing. The result is two cohorts with extremely similar distributions of average daily rate, number of reviews, amount of traffic on review page, number of rooms, and everything else I could think of that might influence clicks on link A. Since the only consistent difference between the cohorts is the presence or absence of link B, any statistically significant difference in Link-A clicks can be attributed to the presence of the business listing.

Why not just compare a random sample of hotels with links A and B with a random sample of hotels with only link A?  Such a comparison would be very flattering to link B; on average, hotels with a business listings subscription perform better than those without one on all kinds of metrics including clicks on link A. This is not surprising. Business listings do not appeal to all properties equally, nor have they been marketed with equal vigor in all markets and market segments. Such a study cannot distinguish between a difference caused by link B and one that is merely correlated with link B. For example, perhaps link B appeals more to hotels in high-traffic destinations and those same properties also attract more clicks of all kinds

Why not do a longitudinal study? The goal would be to compare the click rate before and after link B goes live on a hotel’s review page. The problem with this approach is that though the change in click rate is easy to measure, it is hard to interpret. The quantity of clicks varies over time for all sorts of reasons that have nothing to do with the presence or absence of a business listing. In addition to seasonality, there is trend: The ever increasing number of TripAdvisor users means that clicks will tend to increase over time. Add to that the effects of marketing campaigns, competition, changing exchange rates, and political factors and there is a lot of noise obscuring whatever signal is in the data. A cross-sectional study controls for all that.

How is similarity measured?  The matched pairs methodology calls for each subscriber to be paired with the non-subscriber most similar to it. For this study, there is a list of features that must match exactly and another list of features which, as a group, must be “pretty close.” The exact match features are categorical. The pretty close features are numeric.

Exact match features
·         Same price business listing.
·         Same geographic region.
·         Same category (Hotel, B&B, Specialty Lodging).
·         Same chain status (a Hilton can match a Marriott, but neither can match an independent property).
·         Matching properties are both on the first page of listings for a destination or both on some other page.
·         Presence or absence of reviews supplied by our users.
Hotels that match on all of the above are candidates for matching. A hotel’s actual match is its closest neighbor as determined by the “pretty close” features. The exact match features control for many variables that are not mentioned explicitly. For example, the price charged for a business listing depends on the popularity of the destination and the size of the property so hotels in the same pricing slice are similar in size and traffic. Matching on geography controls for currency, climate, language, and much else.

Pretty close features
·         Average daily rate.
·         Number of rooms.
·         Popularity ranking.
·         Review page views.

The values of these features place each property at a point in a four-dimensional space so it is easy to calculate the Euclidean distance between any pair of properties. The closest candidate by Euclidean distance is picked as the match. Because the features are all measured on different scales, they must first be standardized to make distance along one dimension comparable to distance along any other.
A few pairs are so well matched that, according to this measure, they are distance 0 from each other.
The hotels on the left have business listings. The ones on the right are their twins without business listings. Podere Perelli and Agriturismo il Borghetto are twins because each has 12 rooms, each got exactly 72 page views during the observation period, and each is seventh on its page.
The results
Deciding on the distance metric and creating the matched pairs was most of the work. Once I had the pairings, I loaded 36,000 closely matched pairs into JMP, a data exploration and analysis tool that includes a matched pairs module.

In the diamond-shaped chart, the horizontal axis represents increasing number of clicks on link A (“commerce clicks” in the figure). To the left, where the number of clicks is low, there are some dots below the red line indicating pairs where the non-subscriber got more link-A clicks, but as the number of clicks increases, the business listings subscriber nearly always wins.
In conclusion, after controlling for differences due to geography, traffic, popularity, hotel category, number of rooms, presence or absence of reviews, appearance on page one, and average daily rate, we counted the number of clicks each twin received during a fixed observation period. There was a statistically significant difference in the number of clicks on link A. The average number of clicks for business listing subscribers was 597.49. The average number for non-subscribers was 411.69. This is good news for our subscribing hoteliers: In addition to the traffic we drive directly to their sites, they see increased indirect traffic as well.


  1. It occurs to me that testing for cannibalization might be done in different ways for different applications. Your approach for this scenario makes sense, you wouldn't want to test using a longitudinal study because hotels change over time and you have data to map hotels together to do the comparison. However, lets say that a library wanted to find out if enhancements to their online offerings was cannibalizing their physical offerings. A longitudinal study would make sense in that case. Assuming you had the data, you would be able to determine physical usage of library resources before the online resources were offered and then their usage after. The only thing that has changed is the presence of online resources.

  2. So the hotels with business links received more clicks than those without the business link. Does this mean they had more clicks on their own (hotel) website? I have a few questions about what you did.

    Question 1: Clearly counting clicks is a good metric of how well your website is towards directing traffic to the hotels’ websites. However, is there a way you can incorporate how many of those web leads turned into reservations? I would like to think that the business link is more beneficial than your results show. It would seem to me that the users who click on link A are less likely to make a reservation than those who take the business link to the hotel’s website because there are more options under link A (isn’t link A a link to a travel agency that includes the hotel as one option). Instead of using just clicks on the link, I think that a better metric would be number of reservations made through link A vs through the business link. Thoughts?

    Question 2: It appears that the business link doesn’t suggest cannibalization in the hotels’ business but rather, in yours. What I mean is that the hotels’ rates are the same under link A or link B and consequently there is no cannibalization as there isn’t a new product being offered. Is the TripAdvisor website set up with some pages having only link A and others having both link A and link B? If so, it appears that the cannibalization is affecting TripAdvisor instead of the hotels. I say this because introducing link B would theoretically affect the traffic to link A. If this is the case, then the online travel agencies would be losing ‘business,’ i.e. traffic. It would be interesting to see the difference in number of clicks on link A while there is a link B versus when there isn’t a business link.

    1. For Q1, I do not have any directly observable data on conversion rates. It certainly seems reasonable to assume that arriving at the hotel's own reservation page is more likely to result in a reservation at that hotel that arriving at the on-line travel agency page that features many hotels, but I do not have proof.

      For Q2, my study was only concerned with cannibalization of TripAdvisor's own traffic. The question was "are we losing cost-per-click advertising revenue by offering the subscription-based business listing".


Your comment will appear when it has been reviewed by the moderators.