Thursday, March 20, 2014

Big Data and SQL

I happen to think that SQL is a very viable option for analyzing big data.  I was thinking about this when I a book review recently:
For instance, Siegel reports, people who buy small felt pads that adhere to the bottom of chair legs (to protect the floor) are more likely than others to be good credit risks.
For some people, results like this conjure up magic.  PhDs in white coats bustling around, surrounded by acres of machines humming away pondering this imponderable problem (or is that the air conditioning making the noise).  In fact, something like this is readily calculated from a normal decision support database containing historical data.

So, how hard is it to write the SQL?

The place to start is to rephrase the question.  Let's ask it as:
For all products purchased by customers in 2013, what is the non-payment rate for the first three months of 2014?
Note that this is carefully phrased as a "before" and "after" problem.  Although that does not guarantee causality, it does help.

Next, assume that we have the following tables:

• Customers
• Orders
• OrderProducts
• Invoices (monthly, with a flag to indicate non-payment)
The following query gets all the products from 2013:

`select op.ProductId, count(*) as NumProducts,`
`       count(distinct o.CustomerId) as NumCustomers`
`from Orders o join`
`     OrderProducts op`
`     on o.OrderId = op.OrderId`
`where o.OrderDate >= '2013-01-01' and`
`      o.OrderDate < '2014-01-01'`
`group by op.ProductId;`

The following gets all customers who didn't pay in the first three months of 2014.  This might look something like:

``` select i.CustomerId```
`from Invoices i`
`where i.InvoiceDate >= '2014-01-01' and`
`      i.InvoiceDate < '2014-04-01' and`
`      i.NotPaid = 1;`

These can then easily be combined to get a list of products, by the proportion of customers who did not pay:

``` select ProductId, count(*) as NumCustomers,```
`       count(pc.CustomerId) as numNotPaid,`
`       count(*)*1.0 / count(pc.CustomerId) as NonPayRate`
`from (select op.ProductId, op.CustomerId`
`      from Orders o join`
`           OrderProducts op`
`           on o.OrderId = op.OrderId      where o.OrderDate >= '2013-01-01' and            o.OrderDate < '2014-01-01'      group by op.ProductId, op.CustomerId`
`     ) pc left outer join`
`     (select i.CustomerId`
`      from Invoices i`
`      where i.InvoiceDate >= '2014-01-01' and`
`            i.InvoiceDate < '2014-04-01' and`
`            i.NotPaid = 1`
`     ) np`
`     on pc.CustomerId = np.CustomerId`
`group by pc.ProductId`
`order by NonPayRate desc;`

This isn't a particularly complex SQL.  Instead, we can think about what is really important.  The first is being willing to ask the question.  I think a major constraint in business is that managers and executives are hesitant to ask questions.  They don't have a sense of what is "easy" to answer and what is "hard".  They also fear getting different answers from different people.

The second is the interpretation.  The statement that people who want to protect their furniture are better credit risks has a nice warm and fuzzy quality:  people who care about their belongings also care about their credit.  Perhaps other factors are at work.  People buy new furniture and want to protect it because they have access to cash or credit -- they may simply be richer than other people at least for a period of time.  Or, felt pads may only be sold in areas where people tend to own their homes, so there is a store-bias in the merchandizing.  Or, customers who buy these small items may be paying in cash and never make larger purchases that might measure credit risk.

To understand what is really happening would require further analysis.  To get started just takes asking some insightful questions.

1. This query:

select i.CustomerId
from Invoices i
where i.InvoiceDate >= '2014-01-01' and
i.InvoiceDate < '2014-04-01' and
i.NotPaid = 1;

gives an empty result set.