Sunday, March 30, 2014

Doing the Right Thing: Are your measures correct?

"A lot of good analysis is wasted doing the wrong thing."
Anyone who has worked with data on business problems is probably aware of this adage.  And this past week, I was reminded once again of this fact while analyzing a marketing program.  This example is so striking, because difference between doing the "right" thing and the "almost-right" thing ended up being more than a factor of 10 -- a really big variance on a financial calculation.

Some background.  One of my clients does a lot of prospecting on the web.  They have various campaigns to increase leads to their web site.  These campaigns cost money.  Is it worth it to invest in a particular program?

This seems easy enough to answer, assuming the incoming leads are coded with their source (and they seem to be).  Just look at the leads coming in.  Compare them to the customers who sign up.  And the rest, as they say, is just arithmetic.

Let's say that a customer who signups up on the web has an estimated value of \$300.  And, we can all agree on this number because it is the Finance Number.  No need to argue with that.

The first estimate for the number of leads brought in was around 160, produced by the Business Intelligence Group.  With an estimated value of \$300, the pilot program was generating long term revenue of \$48,000 -- much more than the cost of the program.  No brainer here.  The program worked! Expand the program!  Promote the manager!

The second estimate for the number of leads brought in was 12.  With an estimated value of \$300, the pilot was generating \$3,600 in long term revenue -- way less than the cost of the program.  Well, we might as well burn the cash and roast marshmellows over the flame.  No promotion here.  Know any good recruiters?

Both these estimates used the same data sources.  The difference was in the understanding of how the "visitor experience" is represented in the data.

For instance, a visitor has come to the site 300 times in the past.  The 301st visit was through the new marketing program.  Then two weeks later on the 320th visit, magic happens and the visitor becomes a customer.  Is the lead responsible for the acquisition?  This problem is called channel attribution.  If the customer had signed up when s/he clicked as a lead then yes, you could attribute all or most value to that marketing program.  But two weeks and 20 visits later?  Not likely.  The lead was already interested.

A more serious problem occurs through the complexities of web visits.  If a visitor is not logged in, there is no perfect way to track him or her (or "it" if it were a dog).  Of course, this company uses cookies and browser caches and tries really, really hard to keep track of visitors over time.  But the visitor cannot be identified as a customer until s/he has logged in.  So, I may be a real customer, but happen to be trying out a new browser on my machine.  Or, I visit from an airport lounge and don't log in.  Or some other anonymous visit.  This seems like a bona fide lead when arriving through the marketing program.

And then . . .  the visitor keeps using the new browser (or whatever).  And then later, s/he decides to login.  At that point, the visitor is identified as a customer.  And, more importantly, the VisitorId associated with the visitor is now a customer.  But that doesn't mean that the lead created the customer.  The logging in merely identified an existing customer.

Guess what?  This happened more times than you might imagine.  In many, many cases, the 160 "customers" generated by the leads had been customers for months and years prior to this marketing campaign.  It doesn't make sense to attribute their value to the campaign.

The moral of this story:  it is important to understand the data and more importantly, to understand what the data is telling you about the real world.  Sometimes in our eagerness to get answers we might miss very important details.

As a final note, we found the problem through a very simple request.  Instead of just believing the number 160 in the report generated by the Business Intelligence Group, we insisted on the list of leads and account numbers created by the program.  With the list in-hand, the problems were fairly obvious.

1. This is a fantastic data mining example of a principle I believe extends beyond the domain of just data mining – solving the problem isn't the problem, understanding the problem is the problem. As a computer scientist myself, I can sympathize with the desire to just blindly follow the spec, and hash out an answer to the problem as stated. That, however, is a pitfall for the unwary. The key to the process of correct problem solving begins and ends with problem formulation. A proper understanding of what it is you're really trying to do is essential to the design of a correct solution. Are you solving the problem as stated, or are you solving the problem you really have? (In this context, are you performing arithmetic with the given numbers - which were flawed in this case - or are you solving the actual problem you had, determining the cost-benefit ratio of the ad campaigns?)

The ability to solve the problem you really have, and not just the one that's stated in the text, is what I believe we refer to when we talk about “thinking outside the box”. That point is well illustrated with an example from one Laura Gilchrist, a teacher. She tasked her students with writing their “first name in as many ways as [they could]”. They proceeded to perform the task as they perceived it: simply rewriting their names in different styles. After being prompted by their teacher, “I notice you're all still sitting in your chairs writing your names,” however, the students realized that the bounds they had perceived to be inherent in the problem were instead artificially imposed, and the classroom quickly became a flurry of activity.

This all begs the question, however; assuming there is a real problem hidden behind our stated one, how do we go about discovering what our real problem is? This is the precise idea behind a powerful technique used originally by Toyota known as the “5 whys”. The idea is to iteratively repeat the question “why” to the problem statement in order to explore the true cause-and-effect relationships that underlie the problem to be solved. As I lack the expertise to discuss this process in any greater detail, I would point the interested reader to its Wiki page, which is both informative and serves as a good starting ground for further information on this and related techniques.

1. I agree with the comment above--more than just understanding the data and what the data tells us about the world, we must understand the underlying problem which we are trying to solve.

Data mining and other learning problems typically seek to build a representation, or model, of an unknown system. To do this, these problems require three things: data (as described in the article), a class of models from which we can choose a solution, and a metric by which we can judge which of the models in the class of models best represents the data.

Once a problem is formulated with the above elements, we can then choose a solution, such as a data mining algorithm, through which we build the model. This solution, however, is meaningless unless we, as the post suggests, have an understanding of how to map the solution back into the real world.

In brief, we need to know more than just how the data relates to the world. We need to have a solid understanding of the problem we are seeking to solve and how the entire data mining process (data collection, formulation, and solution generation) map to this problem and its respective solution.

Your comment will appear when it has been reviewed by the moderators.