## Friday, November 13, 2009

### From Item Sets to Association Rules Using Chi-Square

In Data Analysis Using SQL and Excel, I introduce the chi-square metric for evaluating association rules. This posting extends that discussion, to explain how to use the chi-square metric for generating rules.

An association rule is of the form: left-hand-side --> right-hand-side (or, alternatively, LHS --> RHS). The left hand side consists of one or more items, and the right-hand side consists of a single item. A typical example of an association rule is "graham crackers plus chocolate bars implies marshmallows", which may readers will recognize that the recipe for a childhood delight called smores.

Association rules are not only useful for retail analysis. They are also useful for web analysis, where we are trying to track the parts of a web page where people go. I have also seen them used in financial services and direct marketing.

The key to understanding how the chi-square metric fits in is to put the data into a contingency table. For this discussion, let's consider that we have a rule of the form LHS --> RHS, where each side consists of one item. In the following table, the numbers A, B, C, D represent counts:

 RHS-present RHS-absent LHS-present A B LHS-absent C D

A is the count where the LHS and RHS items are both present. B has the LHS item but not the RHS item, and so on. Different rules have different contingency tables. We choose the best one using the chi-square metric (described in Chapter 3 of the above book). This tells us how unusual these counts are. In other words, the chi-square metric is a measure of how unlikely the counts that are measured are due to a random split of the data.

Once we get a contingency table, though, we still do not have a rule. A contingency table really has four different rules:
• LHS --> RHS
• LHS --> not RHS
• not LHS --> RHS
• not LHS --> not RHS
(Or, another way of saying this is that these rules all generate the same contingency table.) How can we choose which rule is the best one?

In this case, we'll choose the rule based on how much better they do than just guessing. This is called the lift or improvement for a rule. So, the rule LHS --> RHS is correct for A/(A+B) of the records: the LHS is true for A+B records, and for A of these, the rule is true.

Overall, simply guessing that RHS is true would be correct for (A+C)/(A+B+C+D) of the records. The ratio of these is the lift for the rule. A lift greater than 1 indicates that the rule does better than guessing; a lift less than 1 indicates that guessing is better.

The following are the ratios for the four possible rules in the table:
• (A/(A+B))/((A+C)/(A+B+C+D))
• (B/(A+B))/((A+C)/(A+B+C+D))
• (C/(C+D))/((B+D)/(A+B+C+D))
• (D/(C+D))/((B+D)/(A+B+C+D))
When choosing among these, choose the one with highest lift.

The process for choosing rules is to choose the item sets based on the highest chi-square value. And then to choose the rules using the best lift.

This works well for rules with a single item on each side. What do we do for more complicated rules, particularly ones with more items in the left hand side? One method would be to extend the chi-square test to multiple dimensions. I am not a fan of the multidimensional chi-square test, as I've explained in another blog.

In this case, we just consider rules with a single item on the RHS side. So, if an item set has four items, a, b, c, and d, then we would consider only the rules:
• a+b+c --> d
• a+b+d --> c
• a+c+d --> b
• b+c+d --> a
We are ignoring possibilities such as a+b-->c+d.

Each of these rules can now be evaluated using the chi-square metric, and then the best rule chosen using the lift of the rule.

1. Q: Since you say there are 4 rules per contingency table (2x2), and a ChiSq applies to the table (same for all 4 rules). Does that mean if the ChiSq test is significant, we can also say "all 4 of the 'lifts' for the 4 rules" are statistically significant?

I think the other way of asking this is, if I find a Rule with lift = 1.4, can I say this lift is "significant" (not random) if I know the associated ChiSq is significant?

Thx.

2. 