Dear Data Miners,
I am trying to build a churn model to predict WHEN customers will become paying members. Process:
1. Person comes to our web site.
2. They register for free to use the site.
3. If the want to have more access to the site and use more features they pay us.
What are the issues I should consider when I decide to set a cut date. The first step towards censoring the data.
For a classic churn model , we want to know when someone will stop paying us and leave our phone company. We censor those that we don’t know their final status pass our censor point.
I want to know when they will pay us and censor those I don’t know if they will pay us in the future.
Is the cut date choice arbitrary or is there some sampling rule?
Your example is a time-to-event model that does not represent churn. There are many such examples in business (and this is something discussed in Data Analysis Using SQL and Excel in a bit of depth).
Think of your situation as two different time-to-event problems:
(1) A person visits the web site, what happens next? Does the person return to the web site or register? This is a time-to-event problem and analysis can provide information on customer registrations, particularly the lag between the initial visit and the registration.
(2) A person registers for free, how long until that person buys something? This can provide insight on paying visitors.
Once you have broken the problem into these pieces, imagining the customer signature is easier. For the first problem, the customer signature is a picture of customers when they initially visit (or for each pre-registration visit, for a time-to-next event problem). The "prediction" columns are the date of the registration (or for time-to-event, the date of the next visit and whether it involves a registration).
The second component is a picture of the customer when they first register, and the prediction columns are when (and whether) the customer every pays for anything. In this case, it is very important to treat this as a time-to-event problem, because older registrations have had more opportunity to pay for something and the analysis needs to take this into account.
As for the censor date, it is the most recent date of the data. So, if you have data through the end of yesterday, then that is the censor date. For instance, for the second component of the analysis, customers who registered before yesterday but never paid would have their outcomes censored (these customers have not paid yet but they may pay in the future).