Tuesday, April 8, 2014

Why I don't recommend MS Access

Recently, I was asked:
Why do you not recommend Access to use? Just curious. Read on page xi of your intro in Data Analysis Using SQL and Excel. 
Just beginning a class in SQL and bought your text. 
Thanks, Mort
This is a very fair question and one that I try to avoid answering.  Preferences on software can unleash religious passions.  Access is a powerful environment for doing development work in a Microsoft environment, and it does support a version of SQL.

All databases deviate from ANSI-standard SQL, but MS Access deviates significantly (in my opinion) both from the standard and from other databases.  From the functionality perspective, it lacks stored procedures, functions, and triggers.  The "database" itself is a file engine and doesn't have built-in transaction support or ACID-compliance, although the Jet/ACE engine (typically used) does have this functionality.

From the perspective of learning SQL, I simply find that the SQL that Access supports is "too different" for my liking.  Here are ten important differences:
  • Access does not support the case statement, so conditional logic is done with the non-standard iif() function.
  • Access requires parentheses around each pair-wise join, resulting in a proliferation of nesting in from clauses that only serves to confuse people learning SQL.
  • Access join syntax requires the inner for inner join.  While it may be a good idea to use inner for clarify, it is often omitted in practice (in other databases).
  • Access does not support full outer join.
  • Access does not allow union or union all in subqueries.
  • Access requires the as for table aliases.  In most databases, this is optional, and I prefer to only use as for column aliases.  Ironically, the use of as for table aliases is forbidden in Oracle.
  • Access uses double quotes to delimit strings (as opposed to single quotes) and is the only database (to my knowledge) that uses & as a string concatenation operator.
  • Access uses * for the wildcard in like rather than %.
  • Access allows BETWEEN AND .  This is allowed in other databases, but will always evaluate to false.
  • Access does not support window/analytic functions (using the over and partition by clauses).
In addition, many people seem to want to use the wizards to generate SQL and learn from that.  I find the output of the wizard to be an unreadable mess of parentheses and square braces and an impediment to learning to write good SQL.

In short, Access is a useful tool and there are many examples of it working well in practice.  However, the Access query language is sufficiently different from other versions of SQL that I would recommend another free query engine instead.  There are definitely other choices such as SQL Server Express, Postgres, Oracle Express, and MySQL.  Data Analysis Using SQL and Excel uses SQL Server syntax for its examples.

Sunday, March 30, 2014

Doing the Right Thing: Are your measures correct?

"A lot of good analysis is wasted doing the wrong thing."
Anyone who has worked with data on business problems is probably aware of this adage.  And this past week, I was reminded once again of this fact while analyzing a marketing program.  This example is so striking, because difference between doing the "right" thing and the "almost-right" thing ended up being more than a factor of 10 -- a really big variance on a financial calculation.

Some background.  One of my clients does a lot of prospecting on the web.  They have various campaigns to increase leads to their web site.  These campaigns cost money.  Is it worth it to invest in a particular program?

This seems easy enough to answer, assuming the incoming leads are coded with their source (and they seem to be).  Just look at the leads coming in.  Compare them to the customers who sign up.  And the rest, as they say, is just arithmetic.

 Let's say that a customer who signups up on the web has an estimated value of $300.  And, we can all agree on this number because it is the Finance Number.  No need to argue with that.

The first estimate for the number of leads brought in was around 160, produced by the Business Intelligence Group.  With an estimated value of $300, the pilot program was generating long term revenue of $48,000 -- much more than the cost of the program.  No brainer here.  The program worked! Expand the program!  Promote the manager!

The second estimate for the number of leads brought in was 12.  With an estimated value of $300, the pilot was generating $3,600 in long term revenue -- way less than the cost of the program.  Well, we might as well burn the cash and roast marshmellows over the flame.  No promotion here.  Know any good recruiters?

Both these estimates used the same data sources.  The difference was in the understanding of how the "visitor experience" is represented in the data.

For instance, a visitor has come to the site 300 times in the past.  The 301st visit was through the new marketing program.  Then two weeks later on the 320th visit, magic happens and the visitor becomes a customer.  Is the lead responsible for the acquisition?  This problem is called channel attribution.  If the customer had signed up when s/he clicked as a lead then yes, you could attribute all or most value to that marketing program.  But two weeks and 20 visits later?  Not likely.  The lead was already interested.

A more serious problem occurs through the complexities of web visits.  If a visitor is not logged in, there is no perfect way to track him or her (or "it" if it were a dog).  Of course, this company uses cookies and browser caches and tries really, really hard to keep track of visitors over time.  But the visitor cannot be identified as a customer until s/he has logged in.  So, I may be a real customer, but happen to be trying out a new browser on my machine.  Or, I visit from an airport lounge and don't log in.  Or some other anonymous visit.  This seems like a bona fide lead when arriving through the marketing program.

And then . . .  the visitor keeps using the new browser (or whatever).  And then later, s/he decides to login.  At that point, the visitor is identified as a customer.  And, more importantly, the VisitorId associated with the visitor is now a customer.  But that doesn't mean that the lead created the customer.  The logging in merely identified an existing customer.

Guess what?  This happened more times than you might imagine.  In many, many cases, the 160 "customers" generated by the leads had been customers for months and years prior to this marketing campaign.  It doesn't make sense to attribute their value to the campaign.

The moral of this story:  it is important to understand the data and more importantly, to understand what the data is telling you about the real world.  Sometimes in our eagerness to get answers we might miss very important details.

As a final note, we found the problem through a very simple request.  Instead of just believing the number 160 in the report generated by the Business Intelligence Group, we insisted on the list of leads and account numbers created by the program.  With the list in-hand, the problems were fairly obvious.


Tuesday, March 25, 2014

Three SQL Constructs You Can Forget About

SQL is a very powerful language, which could, of course, be made even more powerful and useful.  This post discusses three features of the language -- ANSI standard features -- that seem not only unnecessary but downright detrimental.  That is, they seem to cause much more confusion than they provide in functionality.  And, in all these cases, it would be easy to work around their absence.

Although it would be nice to remove these from the language itself, that is unlikely to happen.  However, they can be de-prioritized for people learning SQL.  These constructs are easy to work around and are less functional than their alternatives.  When learning SQL, these should be learned later in the process.

(1)  INSERT . . . VALUES()

The first construct is the use of VALUES with INSERT, as in:

insert into t(col1)    values(1);

In almost every database, this is easily replaced with:

insert into t(col1)
    select 1;
In some databases, you might have to add a from dual or from sys.dummy to make this work.
And, in every respect except one, the INSERT . . . SELECT method is better.   For instance, you can add a WHERE clause to be sure that the value doesn't already exist:
insert into t(col1) select 1 where not exists (select 1 from table t2 where t2.col1 = t)

Or, you can readily add other values, from this or another table:
insert into t(col1, col2)    select 1, (select count(*) from t2)
Trying to fit this into a VALUES statement just causes syntax errors and confusion.

And, you can use UNION ALL to add multiple rows at the same time.

The VALUES statement has exactly one advantage and that is the fact that it is standard.  The same code will work in multiple databases.  That seems very minor.  It would be better if the standard had a way of using SELECT to return a row without a table.


(2)  SELECT DISTINCT

The next unnecessary construct is SELECT DISTINCT.   First, this is easily replaced with GROUP BY.  So:
select distinct a, b, cfrom t;
is the same as:
select a, b, cfrom tgroup by a, b, c;

What makes the GROUP BY better?   Primarily the fact that you can have a HAVING clause.

So, SELECT DISTINCT is sometimes understood to be:  "Get me all the rows that are distinct".  Rather than, "Get me the distinct values from all the rows."  Actually, that first interpretation makes a lot of sense, even if it is wrong.  Not only is there no danger of confusion with the GROUP BY, but including HAVING COUNT(*) = 1 actually solves the first problem.    No way to do that with SELECT DISTINCT.

The second problem is perhaps more dangerous.  Have you ever seen someone write this?
select distinct(a) b, cfrom t;
Here, the DISTINCT seems to be used like a function.   The intention is "Get me distinct values of a along with arbitrary values of b and c".  Of course, this is exactly the same with or without the parentheses.  DISTINCT is not a function.  This usage is so prevalent that Postgres introduced the DISTINCT ON syntax to support it.

What advantages does SELECT DISTINCT have?  The syntax is shorter and you don't have to repeat the column names in a GROUP BY clause.    In a world of cut-and-paste, copying the column to GROUP BY is negligible effort.   And, it does allow SELECT DISTINCT *.   However that is a construct that I wouldn't miss at all.


(3)  COUNT(column)

Finally, there is the COUNT aggregation function with a column as an argument.  Just to be clear, I have no problem with COUNT(DISTINCT column) or COUNT(*) or COUNT(1).

No doubt, the designed of SQL were obsessed with NULL values (and despite the obsession, they still didn't get it right).   Wouldn't everyone in the world (who uses SQL) want to count the number of non-NULL values in a column?  What else could COUNT(column) mean?

Well, in many contexts, people probably think it means COUNT(DISTINCT column).  Consider the following query:
select c.country, count(c.CustomerId), count(o.OrderId)from Customers c join     Orders o     on c.CustomerId = o.CustomerId;

Many people might write this code, just like this, with the intention of getting the number of customers and the number of orders in each country.  How sad when they learn that these are the same!  There are no repeat purchasers anywhere.  (COUNT(DISTINCT c.CustomerId) fixes this problem.)

Such confusion would be a non-issue.

And, if you wanted to count non-NULL values?  Why not do it explicitly, so you can remember what the query is supposed to be doing:
select sum(case when a is not null then 1 else 0 end)
Yes, this takes a bit more typing but the query is much clearer on what it is doing.  It would be much shorter if all databases supported the "boolean" is an "integer" shortcut:
select sum(a is not null)

(4) ,

What is a list of three things without a fourth to cap it off?  Just don't use a comma in the FROM clause.  Explicit join syntax is more expressive and clearer in every case.  The , can be replaced by CROSS JOIN.

Thursday, March 20, 2014

Big Data and SQL

I happen to think that SQL is a very viable option for analyzing big data.  I was thinking about this when I a book review recently:
For instance, Siegel reports, people who buy small felt pads that adhere to the bottom of chair legs (to protect the floor) are more likely than others to be good credit risks.
For some people, results like this conjure up magic.  PhDs in white coats bustling around, surrounded by acres of machines humming away pondering this imponderable problem (or is that the air conditioning making the noise).  In fact, something like this is readily calculated from a normal decision support database containing historical data.

So, how hard is it to write the SQL?

The place to start is to rephrase the question.  Let's ask it as:
For all products purchased by customers in 2013, what is the non-payment rate for the first three months of 2014?
Note that this is carefully phrased as a "before" and "after" problem.  Although that does not guarantee causality, it does help.

Next, assume that we have the following tables:

  • Customers
  • Orders
  • OrderProducts
  • Invoices (monthly, with a flag to indicate non-payment)
The following query gets all the products from 2013:

select op.ProductId, count(*) as NumProducts,
       count(distinct o.CustomerId) as NumCustomers
from Orders o join
     OrderProducts op
     on o.OrderId = op.OrderId
where o.OrderDate >= '2013-01-01' and
      o.OrderDate < '2014-01-01'
group by op.ProductId;


The following gets all customers who didn't pay in the first three months of 2014.  This might look something like:

select i.CustomerId
from Invoices i
where i.InvoiceDate >= '2014-01-01' and
      i.InvoiceDate < '2014-04-01' and
      i.NotPaid = 1;

These can then easily be combined to get a list of products, by the proportion of customers who did not pay:

select ProductId, count(*) as NumCustomers,
       count(pc.CustomerId) as numNotPaid,
       count(*)*1.0 / count(pc.CustomerId) as NonPayRate
from (select op.ProductId, op.CustomerId
      from Orders o join
           OrderProducts op
           on o.OrderId = op.OrderId
      where o.OrderDate >= '2013-01-01' and
            o.OrderDate < '2014-01-01'
      group by op.ProductId, op.CustomerId

     ) pc left outer join
     (select i.CustomerId
      from Invoices i
      where i.InvoiceDate >= '2014-01-01' and
            i.InvoiceDate < '2014-04-01' and
            i.NotPaid = 1
     ) np
     on pc.CustomerId = np.CustomerId
group by pc.ProductId
order by NonPayRate desc;

This isn't a particularly complex SQL.  Instead, we can think about what is really important.  The first is being willing to ask the question.  I think a major constraint in business is that managers and executives are hesitant to ask questions.  They don't have a sense of what is "easy" to answer and what is "hard".  They also fear getting different answers from different people.

The second is the interpretation.  The statement that people who want to protect their furniture are better credit risks has a nice warm and fuzzy quality:  people who care about their belongings also care about their credit.  Perhaps other factors are at work.  People buy new furniture and want to protect it because they have access to cash or credit -- they may simply be richer than other people at least for a period of time.  Or, felt pads may only be sold in areas where people tend to own their homes, so there is a store-bias in the merchandizing.  Or, customers who buy these small items may be paying in cash and never make larger purchases that might measure credit risk.

To understand what is really happening would require further analysis.  To get started just takes asking some insightful questions.






Tuesday, March 11, 2014

Heuristics in Analytics

Last week, a book -- a real, hard-cover paper-paged book -- arrived in the mail with the title:  Heuristics in Analytics:  A Practical Perspective of What Influences Our Analytic World.  The book wasn't a total surprise, because I had read some of the drafts a few months ago.  One of the authors, Fiona McNeill is an old friend and the other Carlos is a newer friend.

What impressed me about the book is its focus on the heuristic (understanding) side of analytics rather than the algorithmic or mathematical side of the subject.  Many books that attempt to avoid technical detail end up resembling political sound-bites:  any substance is as lost as the figures in a Jackson Pollock painting.  You can peel away the layers, and still nothing shows up except eventually for a blank canvas.

A key part of their approach is putting analytics in the right context.  Their case studies do a good job of explaining how the modeling process fits into the business process.  So, a case study on collections discusses different models that might be used, answering questions such as:

  • How long until someone will pay a delinquent bill?
  • How much money can likely be recovered?
This particular example goes through multiple steps around the business process, including financial calculations on how much the modeling is actually worth.  It also goes through multiple types of models, such as a segmentation model (based on Kohonen networks) and the differences -- from the business perspective -- of the different segments.  Baked into the discussion is how to use such models and how to interpret the results.

In such a fashion, the book covers most of the common data mining techniques, along with special chapters devoted to graph analysis.  This is particularly timely, because graphs are a very good way to express relationships in the real world.

I do wish that the data used for some of the examples in the book were available.  They would make some very interesting examples.


Saturday, March 1, 2014

Lines and Circles and Logistic Regression

Euclidean geometry, formalized in Euclid's Elements about 2,300 years ago, is in many ways a study of lines and circles.  One might think that after more than two millennia, we have moved beyond such basic shapes particularly in a realm such as data mining.  I don't think that is so true.

One of the overlooked aspects of logistic regression is how it is fundamentally looking for a line (or a plane or a hyperplane in multiple dimensions).  When most people learn about logistic regression, they start with an understanding of the sinuous curve associated with it (you can check out the Wikipedia page, for instance).  Something like this in one dimension:


Or like this in two dimensions:



These types of pictures suggest that logistic regression is sinuous and curvaceous.  They are actually misleading.  Although the curve is sinuous and curvaceous, what is important is the boundary between the high values and the low values.  This separation boundary is typically a line or hyperplane; it is where the value of the logistic regression is 50%.  Or, assuming that the form of the regression is:

logit(x) = f(x) = a*x + y

Then it is where the f(x) is set to 0.  What does this look like?   A logistic regression divides the space into two parts, one part to the "left" (or "above") the line/hyperplane and one part to the "right" (or "below").  A given line just splits the plane into two parts:


In this case, the light grey would be "0" (that is, less than 50%) and the blue "1" (that is, greater than 50%).  The boundary is where the logistic function takes on the value 50%.

Note that this is true even when you build a logistic regression sparse data.  For instance, if your original data has about 5% 1s and 95% 0s, the average value of the resulting model on the input data will be about 5%.  However, somewhere in the input space, the logistic regression will take on the value of 50%, even if there is no data there.  Even if the interpretation of a point in that area of the data space is non-sensical (the customer spends a million dollars a year and hasn't made a purchase in 270 years, or whatever).  The line does exist, separating the 0s from the 1s, even when all the data is on one side of that line.

What difference does this make?  Logistic regression models are often very powerful.  More advanced techniques, such as decision trees, neural networks, and support-vector machines, offer incremental improvement, and often not very much.  And often, that improvement can be baked back into a logistic regression model by a adding one or more derived variables.

What is happening is that the input variables (dimensions) for the logistic regression are chosen very carefully.  In a real situation (as opposed to the models one might build in a class), much thought and care has gone into the choice of variables and how they are combined to form derived variables.  As a result, the data has been stretched and folded in such a way that different classification values tend to be on different "side"s of the input space.

This manipulation of the inputs helps not only logistic regression but almost any technique.  Although the names are fancier and the computing power way more advanced, very powerful techniques rely on geometries studied 2,300 years ago in the ancient world.




Wednesday, February 26, 2014

Taking a Random Sample on Amazon Redshift

Recently, I was approached by Vicky whom I'm working with at a client, to help with a particular problem.  She wanted to calculate page view summaries for a random sample of visitors from a table containing about a billion page views.  This is a common problem, especially as data gets larger and larger.  Note that the sample itself is based on visitors, so a simple random sample is not sufficient.  We needed a sample of visitors and then all the pages for each visitor.

Along the way, I learned some interesting things about Redshift, taking random samples, and working with parallel and columnar databases.

For those not familiar with it, Redshift is an Amazon cloud data store that uses ParAccel, a parallel columnar database a based on Postgres (an older version of Postgres).  Postgres is a standard-enough relational databases, used by several database vendors as the basis of their products.

Columnar databases have interesting performance characteristics, because the database stores each column separately from other columns.  Although generally bad performance-wise for ACID-compliant transactions (if you don't know what ACID is, then you don't need to know), columnar databases are good for analysis.

However, your intuition about how things work may not apply.  A seemingly simple query such as this:

select *
from PageViews
limit 10;

takes a relatively long time (several minutes) because all the columns have to be read independently.  On the other hand, a query such as:

select min(BrowserId), max(BrowserId)
from PageViews;

Goes quite fast (a few seconds), because only one column has to be read into memory.  The more columns the queries reads, the slower it is -- other things being equal.

Back to the random sample.  A typical way of getting this type of random sample is to first find the reduced set of visitors and then join them back to the full page views.   This sounds cumbersome, but the strategy actually works well on many databases.  Applied to the query we were working with, the resulting query looks something like:

select pv.BrowserId,
from (select distinct BrowserId
      from PageViews
      order by random()
      limit 100000
     ) list join
     PageViews pv
     on list.BrowserId = pv.BrowserId
group by BrowserId;

This is a reasonable and standard approach to reduce the processing overhead.  The subquery list produces all the BrowserIds and then sorts them randomly (courtesy of the random() function).  The limit clause then takes a sample of one hundred thousand (out of many tens of millions).  The join would normally use an indexed key, so it should go pretty fast.  On Redshift, the subquery to get list performs relatively well.  But the entire query did not finish (our queries time out after 15-30 minutes). We experimented with a several variations, to no avail.

What finally worked?  Well, a much simpler query and this surprised us.  The following returned in just a few minutes:

select BrowserId,
from PageViews pv
group by BrowserId
order by random()
limit 100000;

In other words, doing the full aggregation on all the data and then doing the sorting is actually faster than trying to speed up the aggregation by working on a subset of the data.

I've been working with parallel databases for over twenty years.  I understand why this works better than trying to first reduce the size of the data.  Nevertheless, I am surprised.  My intuition about what works well in databases can be inverted when using parallel and columnar databases.

One of Vicky's requirements was for a repeatable random sample.  That means that we can get exactly the same sample when running the same query again.  The random() function does not provide the repeatability.  In theory, by setting the seed, it should.  In practice, this did not seem to work.  I suspect that aspects of load balancing in the parallel environment cause problems.

Fortunately, Postgres supports the md5() function.  This is a hash function that converts a perfectly readable string into a long string containing hexadecimal digits.  These digits have the property that two similar strings have produce very different results, so this is a good way to randomize strings.  It is not perfect, because two BrowserIds could have the same hash value, so they would always be included or excluded together.  But, we don't need perfection; we are not trying to land a little Curiousity lander in a small landing zone on a planet tens of millions of miles away.

The final form of the query was essentially:

select BrowserId,
from PageViews pv
group by BrowserId
order by md5('seed' || BrowserId)
limit 100000;

The constant "seed" allows us to get different, repeatable sample when necessary.  And Vicky can extract her sample in just a few minutes, whenever she wants to.