This post explains how to implement a multidimensional chi-square test using SQL queries by calculating the chi-square value.

For the purpose of demonstrating this, I will use data derived from the companion web site for Data Analysis Using SQL and Excel. The following query produces data with three dimensions:

`CREATE TABLE d3 as`

..SELECT paymenttype, MONTH(orderdate) as mon,

.........LEFT(zipcode, 1) as zip1, COUNT(*) as cnt

..FROM orders

..GROUP BY 1, 2, 3

The table d3 simply contains three dimensions: the payment type, the month of the order date, and the first digit of the zip code. These dimensions are for illustration purposes.

The formula for the expected values is ratio of the following quantities:

- The product of the sum of the counts along each dimension.
- The total sum of the counts to the power of the number of dimensions minus 1.

`SELECT paymenttype, mon, zip1,`

.......(dim1.cnt * dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as expected

FROM (SELECT paymenttype, SUM(cnt) as cnt

......FROM d3

......GROUP BY paymenttype) dim1 CROSS JOIN

.....(SELECT mon, SUM(cnt) as cnt

......FROM d3

......GROUP BY mon) dim2 CROSS JOIN

.....(SELECT zip1, SUM(cnt) as cnt

......FROM d3

......GROUP BY zip1) dim3 CROSS JOIN

.....(SELECT SUM(cnt) as cnt

......FROM d3) dimall

This query consists of four subqueries, one for each dimension and one for the total count. Each subquery calculates the appropriate sums along one (or no) dimensions. The results themselves are combined using CROSS JOIN, to ensure that the query returns results for all possible combinations of dimensions -- even those combinations that do not appear in the original data.

This latter point is an important point. Expected values are produced even for combinations not in the original data.

The previous query calculates the expected values. However, the chi-square calculation requires a bit more work. One approach is to join the above query to the original table, using a LEFT OUTER JOIN to ensure that no expected values are missing. The following approach uses simple JOINs and assumes that the original table has all combinations of the dimensions.

`SELECT paymenttype, mon, zip1, expected, dev,`

.......dev*dev/expected as chi_square

FROM (SELECT d3.paymenttype, d3.mon, d3.zip1,

.............(dim1.cnt * dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as expected,

.............d3.cnt-(dim1.cnt * dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as dev

......FROM d3 JOIN

.....(SELECT paymenttype, SUM(cnt) as cnt

......FROM d3

......GROUP BY paymenttype) dim1

.....ON d3.paymenttype = dim1.paymenttype JOIN

.....(SELECT mon, SUM(cnt) as cnt

......FROM d3

......GROUP BY mon) dim2

.....ON d3.mon = dim2.mon JOIN

.....(SELECT zip1, SUM(cnt) as cnt

......FROM d3

......GROUP BY zip1) dim3

.....ON d3.zip1 = dim3.zip1 CROSS JOIN

.....(SELECT SUM(cnt) as cnt

......FROM d3) dimall) a

This query joins in each of the subtotals along the dimensions, rather than using the CROSS JOIN to create all combinations. I suspect that in many databases, this approach has a more efficient execution plan (particularly if there are indexes on the dimensions). Note that the overall total is included using CROSS JOIN. I find this a convenient way to include constants in queries.

This query produces the chi-square value for each cell. The overall chi-square is the sum of these values. To interpret this value, we need the number of degrees of freedom, which is the product of the number of different values on each dimension minus one:

`SELECT (COUNT(DISTINCT paymenttype) - 1)*`

.......(COUNT(DISTINCT mon) - 1) *

.......(COUNT(DISTINCT zip1) - 1) as dof

FROM d3

Interpreting the value itself requires going outside the world of SQL, since there is no function that converts the chi-square value into a p-value within SQL. However, Excel does have such a function, CHIDIST().

It should be obvious how to extend these queries for larger numbers of dimensions. As discussed earlier, though, the chi-square test becomes less useful in multiple dimensions, especially since there need to be counts for all combinations of dimensions for best results (the heuristic rule is a minimum expected value of 5 in all cells). Nevertheless, doing the calculation in multiple dimensions is not difficult, and most of the work can be accomplished using basic SQL queries.

## No comments:

## Post a Comment

Your comment will appear when it has been reviewed by the moderators.