Saturday, November 24, 2007

Constructing a Model Set for Reccuring Events

In the previous post, I answered a question about how to set up a model set for binary churn. It is fairly common for data miners to find ways to express almost any problem as a binary outcome since binary outcome problems are easily approached with familiar tools such as logistic regression or decision trees. The context for the questions suggests an alternate approach, however. The event of interest was the purchase of refill pages for a calendar/planner. This is an example of a recurring event. Other examples include:
• Visits to a web page.
• Purchases of additional minutes for a pre-paid phone plan.
• Subscription renewals.
• Repeat purchases.
• Pregnancies.
• Incarcerations.
• Posts to a blog.
All of these are examples of counting processes. A counting process is one where each time an event occurs it increments a total count. The event frequency is governed by an intensity function which is a function of time and other covariates, much like the hazard function in survival analysis for non-recurring events. The intensity function can be estimated empirically, or it may be fit by a parametric or semi-parametric model using, for example, the SAS PHREG procedure. Either way, the data must first be transformed from the way it was probably recorded--dated transactions--to a form suitable for the required calculations.

These are customers making multiple purchases during an observation window. Each time a customer makes a purchase, a transaction record is created. When we add this data to a table in the counting process style, each customer contributes several rows. There is a row for the time from time 0, which may be the time of the initial purchase, to the second purchase, a row for the time to each subsequent purchase, and a row for the time between the final observed purchase and the end of the observation period.

Depending on the style of analysis used, each event may be seen as starting a new time 0 with the number of previous events as a covariate, or each event may be modeled separately with a customer only becoming part of the at-risk pool for event n after experiencing event n-1.
Either way, it is important to include the final censored time period. This period does not correspond to any transaction, but customers are "at risk" for another purchase during that period.

My approach to creating the table is to first create the table without the censored observations, which is reasonably straightforward. Each of these rows contains a flag indicating it is a complete, uncensored observation. Next I create just the censored observations by creating an observation going from the latest observed purchase to the end of the observation period (in this case, 22May2006). The censored rows can then be appended to the uncensored rows. These could, of course, be turned into subqueries in order to avoid creating the temporary tables.

This fully expanded version of the data is what is referred to as the counting process style of input. In a realistic situation where there might be millions of customers, it makes more sense to group by tenure so that there is one row showing how many customers made a purchase with that tenure and how many customers experienced the tenure and so could have made a purchase. This is the data needed to estimate the intensity function.
In Gordon Linoff's book, Data Analysis Using SQL and Excel, he provides sample code for making a related, but different table using the data available on the book's companion page. I reproduce it here for reference.

The code uses the DATEDIFF function to subtract a household's first order date from all its other order dates to put things on the tenure timeline. It then counts the number of second (or third, or fourth, . . .) purchases that happen at each tenure. This query does not track the population at risk so it is not the actual intensity function, but it never the less gives a nice visual image of the way intensity peaks at yearly intervals as many customers make regular annual purchases, just as the purchasers of calendars in the previous posting did.

1. Hi Gordon, Michael,
Let me include a comment seven years after;)
First of all, I'd like give you two guys thanks for your incredible book: 'Data Analysis Using SQL and Excel'.

For me, the most practical reference I have read in my life.

Then, I would like to share with you and sql script that I have adapted from 'chapter 8: Customer purchases and other repeated events' , section 'Time to next event'.

You split the process between a sql script and and Excel spreadsheet with formulas. I have only joined all the model in the same query:

2. SELECT
S.days
,S.h1
,S.acum_h1
,s.Previous_h1

--,sum(S.Previous_h1) over(order by days ) as ascG

,CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h1) over(order by days )) end as S1
,CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h2) over(order by days )) end as S2
,CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h3) over(order by days )) end as S3
,CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h4) over(order by days )) end as S4
,CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h5) over(order by days )) end as S5
,CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h6) over(order by days )) end as S6

3. ,1-(CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h1) over(order by days )) end) as _1_S1
,1-(CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h2) over(order by days )) end) as _1_S2
,1-(CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h3) over(order by days )) end) as _1_S3
,1-(CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h4) over(order by days )) end) as _1_S4
,1-(CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h5) over(order by days )) end) as _1_S5
,1-(CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h6) over(order by days )) end)as _1_S6

FROM
(

SELECT
h.*
--h1
,LEAD(h1, 1,0) OVER (ORDER BY days desc) AS Previous_h1
--,h2
,LEAD(h2, 1,0) OVER (ORDER BY days desc) AS Previous_h2
--,h3
,LEAD(h3, 1,0) OVER (ORDER BY days desc) AS Previous_h3
--,h4
,LEAD(h4, 1,0) OVER (ORDER BY days desc) AS Previous_h4
--,h5
,LEAD(h5, 1,0) OVER (ORDER BY days desc) AS Previous_h5
--,h6
,LEAD(h6, 1,0) OVER (ORDER BY days desc) AS Previous_h6
,sum(h1) over(order by days ) as Acum_h1
,sum(h2) over(order by days ) as Acum_h2
,sum(h3) over(order by days ) as Acum_h3
,sum(h4) over(order by days ) as Acum_h4
,sum(h5) over(order by days ) as Acum_h5
,sum(h6) over(order by days ) as Acum_h6
FROM
(
select
b.*
,case when Total <>0 then cast(cast(b.hasnext as decimal(14,6))/cast(Total as decimal(14,6)) as decimal(14,6)) else 0 end as h1
,case when Total_ord1 <>0 then cast(cast(b.hasnext1 as decimal(14,6))/cast(Total_ord1 as decimal(14,6)) as decimal(14,6)) else 0 end as h2
,case when Total_ord2 <>0 then cast(cast(b.hasnext2 as decimal(14,6))/cast(Total_ord2 as decimal(14,6)) as decimal(14,6)) else 0 end as h3
,case when Total_ord3 <>0 then cast(cast(b.hasnext3 as decimal(14,6))/cast(Total_ord3 as decimal(14,6)) as decimal(14,6)) else 0 end as h4
,case when Total_ord4 <>0 then cast(cast(b.hasnext4 as decimal(14,6))/cast(Total_ord4 as decimal(14,6)) as decimal(14,6)) else 0 end as h5
,case when Total_ord5 <>0 then cast(cast(b.hasnext5 as decimal(14,6))/cast(Total_ord5 as decimal(14,6)) as decimal(14,6)) else 0 end as h6

from

4. (
SELECT
A.*
,SUM(contador) OVER( order by days desc) AS Total
--POP:
,SUM(ord1) OVER( order by days desc) AS Total_ord1
,SUM(ord2) OVER( order by days desc) AS Total_ord2
,SUM(ord3) OVER( order by days desc) AS Total_ord3
,SUM(ord4) OVER( order by days desc) AS Total_ord4
,SUM(ord5) OVER( order by days desc) AS Total_ord5
--h
--,case when SUM(ord1) OVER( order by days desc)<>0 then cast(cast(A.hasnext as decimal(14,4))/cast(SUM(ord1) OVER( order by ord1 desc) as decimal(14,4)) as decimal(14,4)) else 'N/A' end as h1
FROM
(

SELECT DATEDIFF(dd, firstdate, ISNULL(nextdate, DATEADD(DAY, 568, GETDATE()))) as days

SUM(CASE WHEN numbef = 1 THEN 1 ELSE 0 END) as ord1,
SUM(CASE WHEN numbef = 2 THEN 1 ELSE 0 END) as ord2,
SUM(CASE WHEN numbef = 3 THEN 1 ELSE 0 END) as ord3,
SUM(CASE WHEN numbef = 4 THEN 1 ELSE 0 END) as ord4,
SUM(CASE WHEN numbef = 5 THEN 1 ELSE 0 END) as ord5,
SUM(hasnext) as hasnext,
SUM(CASE WHEN numbef = 1 THEN hasnext ELSE 0 END) as hasnext1,
SUM(CASE WHEN numbef = 2 THEN hasnext ELSE 0 END) as hasnext2,
SUM(CASE WHEN numbef = 3 THEN hasnext ELSE 0 END) as hasnext3,
SUM(CASE WHEN numbef = 4 THEN hasnext ELSE 0 END) as hasnext4,
SUM(CASE WHEN numbef = 5 THEN hasnext ELSE 0 END) as hasnext5
FROM
(
SELECT o1.householdid, o1.orderdate as firstdate,
SUM(CASE WHEN o2.orderdate < o1.orderdate THEN 1 ELSE 0 END) as numbef,
MIN(CASE WHEN o2.orderdate > o1.orderdate THEN o2.orderdate END) as nextdate,
COUNT(*) as numords,
MAX(CASE WHEN o2.orderdate > o1.orderdate THEN 1 ELSE 0 END) as hasnext
FROM
(
SELECT
c.householdid, o.*
FROM
orders o
JOIN
customer c
ON
o.customerid = c.customerid
) o1
LEFT OUTER JOIN
(
SELECT
c.householdid, o.*
FROM
orders o
JOIN
customer c
ON
o.customerid = c.customerid
) o2
ON
o1.householdid = o2.householdid
GROUP BY
o1.householdid, o1.orderdate
) a
GROUP BY
DATEDIFF(dd, firstdate, ISNULL(nextdate, DATEADD(DAY, 568, GETDATE())))
) A

) b

) h

) S
order by 1

5. Hi Gordon, Michael,
Let me include a comment seven years after;)
First of all, I'd like give you two guys thanks for your incredible book: 'Data Analysis Using SQL and Excel'.

For me, the most practical reference I have read in my life.

Then, I would like to share with you and sql script that I have adapted from 'chapter 8: Customer purchases and other repeated events' , section 'Time to next event'.

You split the process between a sql script and and Excel spreadsheet with formulas. I have only joined all the model in an only query.

6. SELECT
S.days
,S.h1
,S.acum_h1
,s.Previous_h1

--,sum(S.Previous_h1) over(order by days ) as ascG

,CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h1) over(order by days )) end as S1
,CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h2) over(order by days )) end as S2
,CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h3) over(order by days )) end as S3
,CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h4) over(order by days )) end as S4
,CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h5) over(order by days )) end as S5
,CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h6) over(order by days )) end as S6

,1-(CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h1) over(order by days )) end) as _1_S1
,1-(CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h2) over(order by days )) end) as _1_S2
,1-(CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h3) over(order by days )) end) as _1_S3
,1-(CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h4) over(order by days )) end) as _1_S4
,1-(CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h5) over(order by days )) end) as _1_S5
,1-(CASE WHEN (days=0 ) THEN CAST(1 AS DECIMAL(16,6)) ELSE (1-sum(S.Previous_h6) over(order by days )) end)as _1_S6

FROM
(

SELECT
h.*
--h1
,LEAD(h1, 1,0) OVER (ORDER BY days desc) AS Previous_h1
--,h2
,LEAD(h2, 1,0) OVER (ORDER BY days desc) AS Previous_h2
--,h3
,LEAD(h3, 1,0) OVER (ORDER BY days desc) AS Previous_h3
--,h4
,LEAD(h4, 1,0) OVER (ORDER BY days desc) AS Previous_h4
--,h5
,LEAD(h5, 1,0) OVER (ORDER BY days desc) AS Previous_h5
--,h6
,LEAD(h6, 1,0) OVER (ORDER BY days desc) AS Previous_h6
,sum(h1) over(order by days ) as Acum_h1
,sum(h2) over(order by days ) as Acum_h2
,sum(h3) over(order by days ) as Acum_h3
,sum(h4) over(order by days ) as Acum_h4
,sum(h5) over(order by days ) as Acum_h5
,sum(h6) over(order by days ) as Acum_h6
FROM
(
select
b.*
,case when Total <>0 then cast(cast(b.hasnext as decimal(14,6))/cast(Total as decimal(14,6)) as decimal(14,6)) else 0 end as h1
,case when Total_ord1 <>0 then cast(cast(b.hasnext1 as decimal(14,6))/cast(Total_ord1 as decimal(14,6)) as decimal(14,6)) else 0 end as h2
,case when Total_ord2 <>0 then cast(cast(b.hasnext2 as decimal(14,6))/cast(Total_ord2 as decimal(14,6)) as decimal(14,6)) else 0 end as h3
,case when Total_ord3 <>0 then cast(cast(b.hasnext3 as decimal(14,6))/cast(Total_ord3 as decimal(14,6)) as decimal(14,6)) else 0 end as h4
,case when Total_ord4 <>0 then cast(cast(b.hasnext4 as decimal(14,6))/cast(Total_ord4 as decimal(14,6)) as decimal(14,6)) else 0 end as h5
,case when Total_ord5 <>0 then cast(cast(b.hasnext5 as decimal(14,6))/cast(Total_ord5 as decimal(14,6)) as decimal(14,6)) else 0 end as h6

from

7. Hi again,

I have included the first part of the query.

Kind regards.

Your comment will appear when it has been reviewed by the moderators.