This is a continuation of my previous post on multi-armed bandits. And, I'm guessing there will be at least one more after this.

The Multi-Armed Bandit problem is a seemingly simple problem. A gambler is faced with a row of slot machines, each of which returns a different winning. S/he need to devise a strategy to find the winningest slot machine as quickly as possible and then just play that one.

Most of the strategies for doing this are based on a greedy-algorithm approach. They are some variation on: randomly (or round robinly) choose slot machines until some threshold has been reached. Then continue playing the winningest one. These actually work pretty well. But I am interested in applying basic statistics to this.

Before doing that, let me explain why I am interested. Imagine that I have a web site and I have an ad space to fill. Here are different things I might put there:

By the Central Limit Theorem, we know that we can estimate the average based on a sample of data. This estimate of the average has an average and a standard deviation, which (once there are enough samples) gets smaller and smaller, meaning that the average is better and better.

The idea is then simple. At the beginning, give each campaign the same estimate with a wide confidence interval. The intervals all overlap completely, so the choice of best campaign is random. Initially, we might want to round-robin the data to get some initial values. Relatively quickly, though, we should get estimates for each of the campaigns; these will be inaccurate but they will have wide confidence intervals.

At each iteration, we need to update the average and standard deviation. Fortunately, there are easy incremental algorithms for both, so all the historical data does not need to be saved. This article discusses various algorithms for calculating variance, and hence standard deviation.

The question is: if we have multiple averages and standard errors, how do we choose the appropriate campaign at each step. We can run a fast simulation to get the best campaign. For each campaign, generate a random number based on the estimated average and standard error. Choose the campaign that has the largest number.

What happens over time is that the campaign with the best payout should become more and more confident, as well as having the highest average. Its confidence interval will shift way from the others, further increasing the odds of that campaign being chosen. This is a positive feedback mechanism. Note that I am using the term "confidence interval" as an aid to visualizing what is happening; this method is not actually using any p-values generated from the confidence interval.

One nice feature about this method is that it can adapt to the chosen solution getting worse. If so, the average will decrease (but not the standard error) and other campaigns might be chosen. Getting this to work involves a bit more effort, because you probably want to keep the sample size fixed -- otherwise the learning rate would be too small.

A note about distributions. This solution is depending onto the distribution of the

I do believe that this method will produce reasonable results in practice. However, it does bring up subtle issues about how the underlying distributions of the payouts affect the averages. On the surface, it seems pretty sound, and it should work pretty well in practice.

The Multi-Armed Bandit problem is a seemingly simple problem. A gambler is faced with a row of slot machines, each of which returns a different winning. S/he need to devise a strategy to find the winningest slot machine as quickly as possible and then just play that one.

Most of the strategies for doing this are based on a greedy-algorithm approach. They are some variation on: randomly (or round robinly) choose slot machines until some threshold has been reached. Then continue playing the winningest one. These actually work pretty well. But I am interested in applying basic statistics to this.

Before doing that, let me explain why I am interested. Imagine that I have a web site and I have an ad space to fill. Here are different things I might put there:

- A run of network ad that will make some amount of money per impression.
- A click-through ad that will make some amount of money if someone clicks on it.
- A partner ad that will make some amount of money if someone signs up for something.

The Multi-Armed Bandit provides an automated means of testing all three of these at once, along with variations that may, or may not, prove better than business-as-usual. I think of it as automated champion-challenger models.

Here is a "statistical" approach to this problem. Let me assume that there are

*N*campaigns being run. Each campaign has a payout distribution. I can calculate the average payout for each campaign. In the end, I want to choose the campaign that has the largest average payout. Note that I'm make assumptions here that the the campaigns perform consistently across time and across the visitor population. Those are other issues I discussed earlier. Let's focus on the basic problem here.By the Central Limit Theorem, we know that we can estimate the average based on a sample of data. This estimate of the average has an average and a standard deviation, which (once there are enough samples) gets smaller and smaller, meaning that the average is better and better.

The idea is then simple. At the beginning, give each campaign the same estimate with a wide confidence interval. The intervals all overlap completely, so the choice of best campaign is random. Initially, we might want to round-robin the data to get some initial values. Relatively quickly, though, we should get estimates for each of the campaigns; these will be inaccurate but they will have wide confidence intervals.

At each iteration, we need to update the average and standard deviation. Fortunately, there are easy incremental algorithms for both, so all the historical data does not need to be saved. This article discusses various algorithms for calculating variance, and hence standard deviation.

The question is: if we have multiple averages and standard errors, how do we choose the appropriate campaign at each step. We can run a fast simulation to get the best campaign. For each campaign, generate a random number based on the estimated average and standard error. Choose the campaign that has the largest number.

What happens over time is that the campaign with the best payout should become more and more confident, as well as having the highest average. Its confidence interval will shift way from the others, further increasing the odds of that campaign being chosen. This is a positive feedback mechanism. Note that I am using the term "confidence interval" as an aid to visualizing what is happening; this method is not actually using any p-values generated from the confidence interval.

One nice feature about this method is that it can adapt to the chosen solution getting worse. If so, the average will decrease (but not the standard error) and other campaigns might be chosen. Getting this to work involves a bit more effort, because you probably want to keep the sample size fixed -- otherwise the learning rate would be too small.

A note about distributions. This solution is depending onto the distribution of the

*sample average*, not the distribution of the original payout. The*sample average*should (in the limit) have a normal distribution, characterized by the average and standard error. This is not a statement about the original data distribution, only about the average. And, in the end, we want to choose the campaign that has the best average. This is handy, because the three examples that I gave earlier are very different. One has a constant (but low) payout and the other two are biased toward zero payouts.I do believe that this method will produce reasonable results in practice. However, it does bring up subtle issues about how the underlying distributions of the payouts affect the averages. On the surface, it seems pretty sound, and it should work pretty well in practice.

## No comments:

## Post a Comment

Your comment will appear when it has been reviewed by the moderators.