Wednesday, April 30, 2014

Multi-Armed Bandits and Testing Online Marketing Campaigns

What is the connection between testing marketing campaigns and Las Vegas?

The Multi-Armed Bandit problem is a statistical problem that seeks to optimize a gambler playing multiple slot machines.  Each slot machine has an unknown probability of paying out, and the goal is to maximize the winnings over time.  Slot machines, by the way, are sometimes called "one-armed bandits", because -- on average -- you leave them with less money than you arrived with.

What is appealing about the Multi-Armed Bandit problem is that it is solvable.  The basic "learn-and-proceed" algorithm works quite well.  That is, for a certain period of time, play all the machines.  Then continue to play the winning'est machine.  Variations on this approach are also optimal, under a wide range of assumptions.

This is very appealing.  Imagine that you have a web site that makes money -- it actually doesn't matter whether that is through selling ads, selling merchandize, or getting subscribers.  You want to optimize the web site, and you can -- literally -- think of millions of different combinations of things.  You can change the layout of the home page, you can change the language, the colors, the images, and so on.  What works best?

The Multi-Armed Bandit solution would seem to provide a path to site optimization.  Start with a few variations.  Treat each variation as a separate "slot machine".  Each time the page is about to be loaded, peer into the historical data on how well each version has done.  Then choose the right version accordingly.  Keep track of how good the page is.  And, one of the optimization algorithms for the Multi-Armed Bandit problem can not only determine which page to load but guarantee optimal outcomes over time.  Truly impressive.

The problem is when this hits the real world.  In the real world, the slot machines are not all equal.  And, the probabilities may be more complicated.  For instance, the message "Visit us while you sip your morning cup of coffee" might work better in the morning than in the afternoon (I'm making this up).  Or the message, "Escape from the winter blues" might not work as well in Florida in January as in Minnesota.

So, this is the first problem.  In the web world, we do know something about even anonymous visitors.  We have a good idea of the time of day where they are.  We have a good idea of their "gross" geography.  We have a good idea of the device and web browser they are using.  What works in one place at one time on one device, may not work in another.

This makes the testing problem much harder.  If we know a handful of categories in advance, we do separate multi-armed bandit problems for each.  For instance:

  • US-based, web-based, weekday working hours.
  • US-based, mobile, weekday working hours.
And so on, through the 12 combinations suggested by these groups.

When the groups are not predetermined, it is a harder problem to solve.  This is often approached using off-line analysis, if there is adequate test data.

Another issue is that some messages may be targeted at a specific population.  For instance, some ip addresses have registered as .EDU indicating educational institutions.  How do you handle market testing for these groups?

If there is only one group, then you can set up separate Multi-Armed Bandit-type tests for EDU and for non-EDU.  But this might get more complicated.  For instance, some messages might be targeted for web versus mobile.  And now you have overlap -- because EDU users are in both groups.

Yet another problem is that determining the outcome may take time.  If the outcome is as simple as clicking on an ad, then that is known within a few minutes.  If the outcome is purchasing a particular, well, not everyone puts something in a cart and immediately checks out.  You might want a day lag.  If the outcome is a longer term measurement (did site engagement go up? time to next visit?), then even longer periods are needed.  This can result in a very slow feed back loop for the Multi-Armed Bandit.  After all, a slot machine returns its winnings (or losings) very quickly.

What is the real solution to such adaptive marketing online?  The real solution requires a comprehensive solution.  This solution has to understand "natural" divisions online (such as organization type) and the fact that a given message may have differential effects in different areas.  This solution has to take into account that some message are targeted to specific visitors or customers (if they can be known), so pure randomization may not be possible.  The solution has to understand the time dimension of the business, hopefully incorporating faster measures in the short-term, even if these are imperfect, along with validation of the measures in the longer term.

The Mutli-Armed bandit is a powerful idea.  Unfortunately, the solution to this problem is only one part of an online marketing solution.

Tuesday, April 8, 2014

Why I don't recommend MS Access

Recently, I was asked:
Why do you not recommend Access to use? Just curious. Read on page xi of your intro in Data Analysis Using SQL and Excel. 
Just beginning a class in SQL and bought your text. 
Thanks, Mort
This is a very fair question and one that I try to avoid answering.  Preferences on software can unleash religious passions.  Access is a powerful environment for doing development work in a Microsoft environment, and it does support a version of SQL.

All databases deviate from ANSI-standard SQL, but MS Access deviates significantly (in my opinion) both from the standard and from other databases.  From the functionality perspective, it lacks stored procedures, functions, and triggers.  The "database" itself is a file engine and doesn't have built-in transaction support or ACID-compliance, although the Jet/ACE engine (typically used) does have this functionality.

From the perspective of learning SQL, I simply find that the SQL that Access supports is "too different" for my liking.  Here are ten important differences:
  • Access does not support the case statement, so conditional logic is done with the non-standard iif() function.
  • Access requires parentheses around each pair-wise join, resulting in a proliferation of nesting in from clauses that only serves to confuse people learning SQL.
  • Access join syntax requires the inner for inner join.  While it may be a good idea to use inner for clarify, it is often omitted in practice (in other databases).
  • Access does not support full outer join.
  • Access does not allow union or union all in subqueries.
  • Access requires the as for table aliases.  In most databases, this is optional, and I prefer to only use as for column aliases.  Ironically, the use of as for table aliases is forbidden in Oracle.
  • Access uses double quotes to delimit strings (as opposed to single quotes) and is the only database (to my knowledge) that uses & as a string concatenation operator.
  • Access uses * for the wildcard in like rather than %.
  • Access allows BETWEEN AND .  This is allowed in other databases, but will always evaluate to false.
  • Access does not support window/analytic functions (using the over and partition by clauses).
In addition, many people seem to want to use the wizards to generate SQL and learn from that.  I find the output of the wizard to be an unreadable mess of parentheses and square braces and an impediment to learning to write good SQL.

In short, Access is a useful tool and there are many examples of it working well in practice.  However, the Access query language is sufficiently different from other versions of SQL that I would recommend another free query engine instead.  There are definitely other choices such as SQL Server Express, Postgres, Oracle Express, and MySQL.  Data Analysis Using SQL and Excel uses SQL Server syntax for its examples.