Tuesday, August 26, 2014

An Achievement on Stack Overflow

Last Friday (August 22nd), I achieved a milestone on Stack Overflow, a computing question-and-answer site, by hitting 200,000 points.  So far this year, I have also earned more points than anyone, a testament to my obsession with the site (you can see yearly user rankings here).  As for overall points, I will never match the leader Jon Skeet because who had a head start of many years.

My answers are almost almost exclusively for answers related to SQL and various databases.   The site is highly geared toward "tools" questions, so there are few general analysis questions.

So, this blog is sharing some of my thoughts and my history on the site.

Clearly, I have helped a lot of people on Stack Overflow, around the world.  The most rewarding part are the thank-yous from real people working on real problems.  On several instances, I have helped speed up code by more than 99% -- turning hours of drudgery into seconds or minutes of response time.

But answering questions has helped me too:
  • My technical knowledge of databases has greatly improved, particularly the peculiarities (strengths and weaknesses) of each database engine.
  • I have learned patience for people who are confused by concepts in SQL.
  • I have (hopefully) learned how to explain concepts to people with different levels of competence.
  • I have learned how to figure out (sometimes) what someone is really asking.
  • I have a strong appreciation for what SQL can do and what SQL cannot do.
  • It has definitely increased the number of hits when I egosurf.
A few months after starting, I stopped down voting questions and answers.  "Down voting" is usually seen as "not-nice", making the other person defensive, confused, and perhaps angry.   A lesson for real life:  being mean (by down voting) is not nearly so useful as offering constructive criticism (by commenting).


This all started in January, 2012 (a bit over two and a half years ago).  The reason was simple:  I was writing a system called The Netting Machine for the Lehman Brothers Estate and it was stretching my knowledge of SQL Server.  One particular problem involved dynamic queries.  Google kept directing me to the same question on Stack Overflow.  This best answer was close to what I needed, but not quite.  It was only half-way there.  The third time I landed on the page, I added my own answer.  This was actually for quite selfish reasons:  the next time Google took me there, I wanted to see the full answer.

Lo and behold, my answer was accepted and up voted.  When the OP ("original poster" -- Stack Overflow lingo for the person asking the question) accepted my answer, s/he had to unaccept another.  That answer was by Aaron Bertrand, a SQL Server guru whose name I recognized from his instructive blog posts.  Aaron commented about the "unaccept".  In the back of my mind, If Aaron thinks this is important, then there must be something to it.  Ultimately, I can blame Aaron (whom I have not yet met in person) for getting me hooked.

For a few months, I sporadically answered questions.  Then, in the first week of May, my Mom's younger brother passed away.  That meant lots of time hanging around family, planning the funeral, and the like.  Answering questions on Stack Overflow turned out to be a good way to get away from things.  So, I became more intent.

Stack Overflow draws you in not only with points but with badges and privileges.  Each time I logged in, the system "thanked" me for my participation with more points, more badges, and more privileges.   This continued.  One day (probably in June), I hit the daily upvote maximum of 200 upvotes (you also get points when an answer is accepted or someone offers a bounty).  One week, I hit 1000 points.  One month, 5,000 points.  As an individual who is mesmerized by numbers, I noticed these things.

Last summer, I hit 100,000 points in September and slowed down.  I figured that six figures was enough, and I had other, more interesting things to do -- a trip to Myanmar, my sister's wedding, our apartment in Miami, classes to teach (San Francisco, Amsterdam, Chicago, San Antonio, Orlando) and so on.

I didn't start 2014 with the intention of spending too much of my time on the site.  But three things happened in January.  The first was a fever with a rash on my face.  It kept me home with not-enough to do.  So, I answered questions on Stack Overflow.  Then, I had an attack of gout.  That kept me home with not-enough to do.  And finally, the weather in January in New York was, well, wintery -- lots of cold and lots of snowy.  More reasons to stay home and answer questions.

By the end of January, I was the top scorer for the month.  "Hey, if I can do it in January, let's see what happens in February."  I had thought of relenting in April:   we flew to Greece and spent two nights on Mount Athos.  Mount Athos is a peninsula in northern Greece, devoted to twenty-one Orthodox monasteries -- and nothing else.  It is inhabited by a few thousand monks living medieval lifestyles.  The only way to visit is as a "pilgrim", staying at a monastery.  An incredible experience.  No internet.  But, I was able to make up the point deficit on Stack Overflow.

This year, each month that passes is another month where I seem to be the top point-gatherer on Stack Overflow.  At this point, I might as well make it to the end of the year.  I don't know if I will, but I do hope to help a few other people and to learn more about databases and how people are using them.




Tuesday, August 19, 2014

Gordon Linoff and Michael Berry Interviewed for JMP's Analytically Speaking Series

On August 13th, Anne Milley of JMP interviewed us for the webcast series, Analytically Speaking. The full program is available for viewing here. We discussed a range of topics including the upcoming supplement to Data Mining Techniques (3rd Edition). The supplement will contain hands-on exercises to accompany the book.

Saturday, May 31, 2014

Loading IP Test Data Into Postgres

Recently, I was trolling around the internet looking for some IP address data to play with.  Fortunately, I stumbled across MaxMind's Geolite Database, which is available for free.    All I have to do is include this notice:

This product includes GeoLite2 data created by MaxMind, available from <a href="http://www.maxmind.com">http://www.maxmind.com</a>.

That's easy enough.  The next question is how to get this into my local Postgres database.  A bit over a year ago, I very happily gave up on my Dell computer and opted for a Mac.  One downside to a Mac is that SQL Server doesn't run on it (obviously my personal opinion).  Happily, Postgres does and it is extremely easy to install by going to postgresapp.com.   An interface similar enough to SQL Server Management Studio (called pgadmin3) is almost as easy to install by going here.

So, the next problem is getting the MaxMind data into Postgres.  Getting the two tables into Postgres is easy, using the copy command.  The challenge is IPV6 versus IPV4 addresses.  The data is in IPV6 format with a subnet mask to represent ranges.  Most of us who are familiar with IP addresses are familiar with IPV4 addresses.  These are 32 bits and look something like this:  173.194.121.17 (this happens to be an address for www.google.com attained by running ping www.google.com in a terminal window).  Alas, the data from MaxMind uses IPV6 values rather than IPV4.

In IPV6, the above would look like: ::ffff:173.194.121.17 (to be honest, this is a hybrid format for representing IPV4 addresses in IPV6 address space).  And the situation is a bit worse, because these records contain address ranges.  So the address range is really:  ::ffff:173.194.0.0/112.

The "112" is called a subnet mask.  And, IPV4 also uses them.  In IPV4, they represent the initial range of digits, so they range from 1-32, with a number like "24" being very common.  "112" doesn't make sense in a 32-bit addressing scheme.   To fast forward, the "112" subnet mask for IPV6 corresponds to 16 in the IPV4 world.  This means that the first 16 bits are for the main network and the last 16 bits are for the subnet.  That is, the addresses range from 173.194.0.0 to 173.194.255.255.  The relationship between the subnet mask for IPV6 and IPV4 is easy to express:  the IPV4 subnet mask is the IPV6 subnet mask minus 96.

I have to credit this blog for helping me understand this, even though it doesn't give the exact formula.  Here, I am going to shamelessly reproduce a figure from that blog (along with its original attribution):
ipv6-address
Courtesy of ls-a.org

This figure says  the following.  The last 64 bits are new to IPV6, so they can be automatically subtracted out of the subnet mask.  Then, bits 0-32 also seem to be new, so they can also be subtracted out.  That totals 96 bits in the new version not in the old version.  To be honest, I am not 100% positive about my interpretation.  But it does seem to work.  Google does indeed own exactly this address range.

The Postgres code for creating the table then goes as follows:

create table ipcity_staging (
    network_start_ip varchar(255),
    network_mask_length int,
    geoname_id int,
    registered_country_geoname_id int,
    represented_country_geoname_id int,
    postal_code varchar(255),
    latitude decimal(15, 10),
    longitude decimal(15, 10),
    is_anonymous_proxy int,
    is_satellite_provider int
);

copy public.ipcity_staging
    from '...data/MaxMind IP/GeoLite2-City-CSV_20140401/GeoLite2-City-Blocks.csv'
    with CSV HEADER;

create table ipcity (
    IPCityId serial not null,
    IPStart int not null,
    IPEnd int not null,
    IPStartStr varchar(255) not null,
    IPEndStr varchar(255) not null,
    GeoNameId int,
    GeoNameId_RegisteredCountry int,
    GeoNameId_RepresentedCountry int,
    PostalCode varchar(255),
    Latitude decimal(15, 10),
    Longitude decimal(15, 10),
    IsAnonymousProxy int,
    IsSatelliteProvider int,
    unique (IPStart, IPEnd),
    unique (IPStartStr, IPEndStr)
);

insert into ipcity(IPStart, IPEnd, IPStartStr, IPEndStr, GeoNameId, GeoNameId_RegisteredCountry, GeoNameId_RepresentedCountry,
                   PostalCode, Latitude, Longitude, IsAnonymousProxy, IsSatelliteProvider
                  ) 
    select IPStart, IPEnd, IPStartStr, IPEndStr, GeoName_Id, registered_country_geoname_id, represented_country_geoname_id,
           Postal_Code, Latitude, Longitude, Is_Anonymous_Proxy, Is_Satellite_Provider
    from (select network_mask_length - 96,
                 hostmask(inet (substr(network_start_ip, 8) || '/' || network_mask_length - 96)) ,
                 inet(host(inet (substr(network_start_ip, 8) || '/' || network_mask_length - 96) )) |
                 hostmask(inet (substr(network_start_ip, 8) || '/' || network_mask_length - 96)
                ) as ipend_inet,
                substr(network_start_ip, 8) || '/' || network_mask_length - 96,
                ((split_part(IPStartStr, '.', 1)::int << 24) +
                 (split_part(IPStartStr, '.', 2)::int << 16) +
                 (split_part(IPStartStr, '.', 3)::int << 8) +
                 (split_part(IPStartStr, '.', 4)::int)
                ) as IPStart,
                ((split_part(IPEndStr, '.', 1)::int << 24) +
                 (split_part(IPEndStr, '.', 2)::int << 16) +
                 (split_part(IPEndStr, '.', 3)::int << 8) +
                 (split_part(IPEndStr, '.', 4)::int)
                ) as IPEnd,
                st.*
          from (select st.*,
                       host(inet (substr(network_start_ip, 8) || '/' || network_mask_length - 96)) as IPStartStr,
                       host(inet(host(inet (substr(network_start_ip, 8) || '/' || network_mask_length - 96) )) |
                            hostmask(inet (substr(network_start_ip, 8) || '/' || network_mask_length - 96))
                           ) as IPEndStr
                from ipcity_staging st 
                where network_start_ip like '::ffff:%'
               ) st
         ) st;



Sunday, May 18, 2014

Armed Bandits: A Statistical Approach

This is a continuation of my previous post on multi-armed bandits.  And, I'm guessing there will be at least one more after this.

The Multi-Armed Bandit problem is a seemingly simple problem.  A gambler is faced with a row of slot machines, each of which returns a different winning.  S/he need to devise a strategy to find the winningest slot machine as quickly as possible and then just play that one.

Most of the strategies for doing this are based on a greedy-algorithm approach.  They are some variation on:  randomly (or round robinly) choose slot machines until some threshold has been reached.  Then continue playing the winningest one.  These actually work pretty well.  But I am interested in applying basic statistics to this.

Before doing that, let me explain why I am interested.  Imagine that I have a web site and I have an ad space to fill.  Here are different things I might put there:

  • A run of network ad that will make some amount of money per impression.
  • A click-through ad that will make some amount of money if someone clicks on it.
  • A partner ad that will make some amount of money if someone signs up for something.
The Multi-Armed Bandit provides an automated means of testing all three of these at once, along with variations that may, or may not, prove better than business-as-usual.   I think of it as automated champion-challenger models.

Here is a "statistical" approach to this problem.  Let me assume that there are N campaigns being run.   Each campaign has a payout distribution.  I can calculate the average payout for each campaign.  In the end, I want to choose the campaign that has the largest average payout.  Note that I'm make assumptions here that the the campaigns perform consistently across time and across the visitor population.  Those are other issues I discussed earlier.  Let's focus on the basic problem here.


By the Central Limit Theorem, we know that we can estimate the average based on a sample of data.  This estimate of the average has an average and a standard deviation, which (once there are enough samples) gets smaller and smaller, meaning that the average is better and better.

The idea is then simple.  At the beginning, give each campaign the same estimate with a wide confidence interval.  The intervals all overlap completely, so the choice of best campaign is random.  Initially, we might want to round-robin the data to get some initial values.  Relatively quickly, though, we should get estimates for each of the campaigns; these will be inaccurate but they will have wide confidence intervals.

At each iteration, we need to update the average and standard deviation.  Fortunately, there are easy incremental algorithms for both, so all the historical data does not need to be saved.  This article discusses various algorithms for calculating variance, and hence standard deviation.

The question is:  if we have multiple averages and standard errors, how do we choose the appropriate campaign at each step.  We can run a fast simulation to get the best campaign.  For each campaign, generate a random number based on the estimated average and standard error.  Choose the campaign that has the largest number.

What happens over time is that the campaign with the best payout should become more and more confident, as well as having the highest average.   Its confidence interval will shift way from the others, further increasing the odds of that campaign being chosen.  This is a positive feedback mechanism.  Note that I am using the term "confidence interval" as an aid to visualizing what is happening; this method is not actually using any p-values generated from the confidence interval.

One nice feature about this method is that it can adapt to the chosen solution getting worse.  If so, the average will decrease (but not the standard error) and other campaigns might be chosen.  Getting this to work involves a bit more effort, because you probably want to keep the sample size fixed -- otherwise the learning rate would be too small.

A note about distributions.  This solution is depending onto the distribution of the sample average, not the distribution of the original payout.  The sample average should (in the limit) have a  normal distribution, characterized by the average and standard error.  This is not a statement about the original data distribution, only about the average.  And, in the end, we want to choose the campaign that has the best average.  This is handy, because the three examples that I gave earlier are very different.  One has a constant (but low) payout and the other two are biased toward zero payouts.

I do believe that this method will produce reasonable results in practice.  However, it does bring up subtle issues about how the underlying distributions of the payouts affect the averages.  On the surface, it seems pretty sound, and it should work pretty well in practice.


Wednesday, April 30, 2014

Multi-Armed Bandits and Testing Online Marketing Campaigns

What is the connection between testing marketing campaigns and Las Vegas?

The Multi-Armed Bandit problem is a statistical problem that seeks to optimize a gambler playing multiple slot machines.  Each slot machine has an unknown probability of paying out, and the goal is to maximize the winnings over time.  Slot machines, by the way, are sometimes called "one-armed bandits", because -- on average -- you leave them with less money than you arrived with.

What is appealing about the Multi-Armed Bandit problem is that it is solvable.  The basic "learn-and-proceed" algorithm works quite well.  That is, for a certain period of time, play all the machines.  Then continue to play the winning'est machine.  Variations on this approach are also optimal, under a wide range of assumptions.

This is very appealing.  Imagine that you have a web site that makes money -- it actually doesn't matter whether that is through selling ads, selling merchandize, or getting subscribers.  You want to optimize the web site, and you can -- literally -- think of millions of different combinations of things.  You can change the layout of the home page, you can change the language, the colors, the images, and so on.  What works best?

The Multi-Armed Bandit solution would seem to provide a path to site optimization.  Start with a few variations.  Treat each variation as a separate "slot machine".  Each time the page is about to be loaded, peer into the historical data on how well each version has done.  Then choose the right version accordingly.  Keep track of how good the page is.  And, one of the optimization algorithms for the Multi-Armed Bandit problem can not only determine which page to load but guarantee optimal outcomes over time.  Truly impressive.

The problem is when this hits the real world.  In the real world, the slot machines are not all equal.  And, the probabilities may be more complicated.  For instance, the message "Visit us while you sip your morning cup of coffee" might work better in the morning than in the afternoon (I'm making this up).  Or the message, "Escape from the winter blues" might not work as well in Florida in January as in Minnesota.

So, this is the first problem.  In the web world, we do know something about even anonymous visitors.  We have a good idea of the time of day where they are.  We have a good idea of their "gross" geography.  We have a good idea of the device and web browser they are using.  What works in one place at one time on one device, may not work in another.

This makes the testing problem much harder.  If we know a handful of categories in advance, we do separate multi-armed bandit problems for each.  For instance:

  • US-based, web-based, weekday working hours.
  • US-based, mobile, weekday working hours.
And so on, through the 12 combinations suggested by these groups.

When the groups are not predetermined, it is a harder problem to solve.  This is often approached using off-line analysis, if there is adequate test data.

Another issue is that some messages may be targeted at a specific population.  For instance, some ip addresses have registered as .EDU indicating educational institutions.  How do you handle market testing for these groups?

If there is only one group, then you can set up separate Multi-Armed Bandit-type tests for EDU and for non-EDU.  But this might get more complicated.  For instance, some messages might be targeted for web versus mobile.  And now you have overlap -- because EDU users are in both groups.

Yet another problem is that determining the outcome may take time.  If the outcome is as simple as clicking on an ad, then that is known within a few minutes.  If the outcome is purchasing a particular, well, not everyone puts something in a cart and immediately checks out.  You might want a day lag.  If the outcome is a longer term measurement (did site engagement go up? time to next visit?), then even longer periods are needed.  This can result in a very slow feed back loop for the Multi-Armed Bandit.  After all, a slot machine returns its winnings (or losings) very quickly.

What is the real solution to such adaptive marketing online?  The real solution requires a comprehensive solution.  This solution has to understand "natural" divisions online (such as organization type) and the fact that a given message may have differential effects in different areas.  This solution has to take into account that some message are targeted to specific visitors or customers (if they can be known), so pure randomization may not be possible.  The solution has to understand the time dimension of the business, hopefully incorporating faster measures in the short-term, even if these are imperfect, along with validation of the measures in the longer term.

The Mutli-Armed bandit is a powerful idea.  Unfortunately, the solution to this problem is only one part of an online marketing solution.






Tuesday, April 8, 2014

Why I don't recommend MS Access

Recently, I was asked:
Why do you not recommend Access to use? Just curious. Read on page xi of your intro in Data Analysis Using SQL and Excel. 
Just beginning a class in SQL and bought your text. 
Thanks, Mort
This is a very fair question and one that I try to avoid answering.  Preferences on software can unleash religious passions.  Access is a powerful environment for doing development work in a Microsoft environment, and it does support a version of SQL.

All databases deviate from ANSI-standard SQL, but MS Access deviates significantly (in my opinion) both from the standard and from other databases.  From the functionality perspective, it lacks stored procedures, functions, and triggers.  The "database" itself is a file engine and doesn't have built-in transaction support or ACID-compliance, although the Jet/ACE engine (typically used) does have this functionality.

From the perspective of learning SQL, I simply find that the SQL that Access supports is "too different" for my liking.  Here are ten important differences:
  • Access does not support the case statement, so conditional logic is done with the non-standard iif() function.
  • Access requires parentheses around each pair-wise join, resulting in a proliferation of nesting in from clauses that only serves to confuse people learning SQL.
  • Access join syntax requires the inner for inner join.  While it may be a good idea to use inner for clarify, it is often omitted in practice (in other databases).
  • Access does not support full outer join.
  • Access does not allow union or union all in subqueries.
  • Access requires the as for table aliases.  In most databases, this is optional, and I prefer to only use as for column aliases.  Ironically, the use of as for table aliases is forbidden in Oracle.
  • Access uses double quotes to delimit strings (as opposed to single quotes) and is the only database (to my knowledge) that uses & as a string concatenation operator.
  • Access uses * for the wildcard in like rather than %.
  • Access allows BETWEEN AND .  This is allowed in other databases, but will always evaluate to false.
  • Access does not support window/analytic functions (using the over and partition by clauses).
In addition, many people seem to want to use the wizards to generate SQL and learn from that.  I find the output of the wizard to be an unreadable mess of parentheses and square braces and an impediment to learning to write good SQL.

In short, Access is a useful tool and there are many examples of it working well in practice.  However, the Access query language is sufficiently different from other versions of SQL that I would recommend another free query engine instead.  There are definitely other choices such as SQL Server Express, Postgres, Oracle Express, and MySQL.  Data Analysis Using SQL and Excel uses SQL Server syntax for its examples.

Sunday, March 30, 2014

Doing the Right Thing: Are your measures correct?

"A lot of good analysis is wasted doing the wrong thing."
Anyone who has worked with data on business problems is probably aware of this adage.  And this past week, I was reminded once again of this fact while analyzing a marketing program.  This example is so striking, because difference between doing the "right" thing and the "almost-right" thing ended up being more than a factor of 10 -- a really big variance on a financial calculation.

Some background.  One of my clients does a lot of prospecting on the web.  They have various campaigns to increase leads to their web site.  These campaigns cost money.  Is it worth it to invest in a particular program?

This seems easy enough to answer, assuming the incoming leads are coded with their source (and they seem to be).  Just look at the leads coming in.  Compare them to the customers who sign up.  And the rest, as they say, is just arithmetic.

 Let's say that a customer who signups up on the web has an estimated value of $300.  And, we can all agree on this number because it is the Finance Number.  No need to argue with that.

The first estimate for the number of leads brought in was around 160, produced by the Business Intelligence Group.  With an estimated value of $300, the pilot program was generating long term revenue of $48,000 -- much more than the cost of the program.  No brainer here.  The program worked! Expand the program!  Promote the manager!

The second estimate for the number of leads brought in was 12.  With an estimated value of $300, the pilot was generating $3,600 in long term revenue -- way less than the cost of the program.  Well, we might as well burn the cash and roast marshmellows over the flame.  No promotion here.  Know any good recruiters?

Both these estimates used the same data sources.  The difference was in the understanding of how the "visitor experience" is represented in the data.

For instance, a visitor has come to the site 300 times in the past.  The 301st visit was through the new marketing program.  Then two weeks later on the 320th visit, magic happens and the visitor becomes a customer.  Is the lead responsible for the acquisition?  This problem is called channel attribution.  If the customer had signed up when s/he clicked as a lead then yes, you could attribute all or most value to that marketing program.  But two weeks and 20 visits later?  Not likely.  The lead was already interested.

A more serious problem occurs through the complexities of web visits.  If a visitor is not logged in, there is no perfect way to track him or her (or "it" if it were a dog).  Of course, this company uses cookies and browser caches and tries really, really hard to keep track of visitors over time.  But the visitor cannot be identified as a customer until s/he has logged in.  So, I may be a real customer, but happen to be trying out a new browser on my machine.  Or, I visit from an airport lounge and don't log in.  Or some other anonymous visit.  This seems like a bona fide lead when arriving through the marketing program.

And then . . .  the visitor keeps using the new browser (or whatever).  And then later, s/he decides to login.  At that point, the visitor is identified as a customer.  And, more importantly, the VisitorId associated with the visitor is now a customer.  But that doesn't mean that the lead created the customer.  The logging in merely identified an existing customer.

Guess what?  This happened more times than you might imagine.  In many, many cases, the 160 "customers" generated by the leads had been customers for months and years prior to this marketing campaign.  It doesn't make sense to attribute their value to the campaign.

The moral of this story:  it is important to understand the data and more importantly, to understand what the data is telling you about the real world.  Sometimes in our eagerness to get answers we might miss very important details.

As a final note, we found the problem through a very simple request.  Instead of just believing the number 160 in the report generated by the Business Intelligence Group, we insisted on the list of leads and account numbers created by the program.  With the list in-hand, the problems were fairly obvious.