Saturday, April 25, 2009

When There Is Not Enough Data

I have a dataset where the target (continuous variable) variable that has to be estimated. However, in the given dataset, values for target are preset only for 2% while rest of 98% do not have values. The 98% are empty values. I need to score a dataset and give values for the target for all 2500 records. Can I use the 2% and replicate it several times and use that dataset to build a model? The ASE is too high if I use the 2% data alone. Any suggestions how to handle it, please?


The short answer to your question is "Yes, you can replicate the 2% and use it to build a model." BUT DO NOT DO THIS! Just because a tool or technique is possible to implement does not mean that it is a good idea. Replicating observations "confuses" models, often by making the model appear overconfident in its results.

Given the way that ASE (average squared error) is calculated, I don't think that replicating data is going to change the value. We can imagine adding a weight or frequency on each observation instead of replicating them. When the weights are all the same, they cancel out in the ASE formula.

What does change is confidence in the model. So, if you are doing a regression and looking at the regression coefficients, each has a confidence interval. By replicating the data, the resulting model would have smaller confidence intervals. However, these are false, because the replicated data has no more information than the original data.

The problem that you are facing is that the modeling technique you are using is simply not powerful enough to represent the 50 observations that you have. Perhaps a different modeling technique would work better, although you are working with a small amount of data. For instance, perhaps some sort of nearest neighbor approach would work well and be easy to implement.

You do not say why you are using ASE (average squared error) as the preferred measure of model fitness. I can speculate that you are trying to predict a number, perhaps using a regression. One challenge is that the numbers being predicted often fall into a particular range (such as positive numbers for dollar values or ranging between 0 and 1 for a percentage). However, regressions produce numbers that run the gamut of values. In this case, transforming the target variable can sometimes improve results.

In our class on data mining (Data Mining Techniques: Theory and Practice), Michael and I introduce the idea of oversamping rare data using weights in order to get a balanced model set. For instance, if you were predicting whether someone was in the 2% group, you might give each of them a weight of 49 and all the unknowns a weight of 1. The result would be a balanced model set. However, we strongly advise that the maximum weight be 1. So, the weights would be 1/49 for the common cases and 1 for the rare ones. For regressions, this is important because it prevents any coefficients from having too-narrow confidence intervals.

Monday, April 13, 2009

Customer-Centric Forecasting White Paper Available

In our consulting practice, we work with many subscription-based businesses including newspapers, mobile phone companies, and software-as-a-service providers. All of these companies need to forecast future subscriber levels. With production support from SAS, I have recently written a white paper describing our approach to creating such forecasts. Very briefly, the central idea is that the subscriber population is a constantly changing mix of customer segments based on geography, acquisition channel, product mix, subscription type, payment type, demographic characteristics, and the like. Each of these segments has a different survival curve. Overall subscriber numbers come from aggregating planned additions and forecast losses at the segment level. Managers can simulate the effects of alternative acquisition strategies by changing assumptions about the characteristics of future subscribers and watching how the forecast changes. The paper is available on our web site. I will also be presenting a keynote talk on customer-centric forecasting on July 1st at the A2009 conference in Copenhagen.

Friday, April 10, 2009

Rexer Analytics Data Mining Survey

Karl Rexer of Rexer Analytics asked us to alert our readers that their annual survey of data miners is ongoing and will be available for a few more days. Click on the title to be taken to the survey page.

Wednesday, April 8, 2009

MapReduce, Hadoop, Everything Old Is New Again

One of the pleasures of aging is watching younger generations discover pleasures one remembers discovering long ago--sex, porcini, the Beatles. Occasionally though, it is frustrating to see old ideas rediscovered as new ones. I am especially prone to that sort of frustration when the new idea is one I once championed unsuccessfully. Recently, I've been feeling as though I was always a Beatles fan but until recently all my friends preferred Herman's Hermits. Of course, I'm glad to see them coming around to my point of view, but still . . .

What brings these feelings up is all the excitement around MapReduce. It's nice to see a parallel programming paradigm that separates the description of the mapping from the description of the function to be applied, but at the same time, it seens a bit underwhelming. You see, I literally grew up with the parallel programming language APL. In the late 60's and early 70's my father worked at IBM's Yorktown Heights research center in the group that developped APL and I learned to program in that language at the age of 12. In 1982 I went to Analogic Corporation to work on an array processor implementation of APL. In 1986, while still at Analogic, I read Danny Hillis's book The Connection Machine and realized that he had designed the real APL Machine. I decided I wanted to work at the company that was building Danny's machine. I was hired by Guy Steele, who was then in charge of the software group at Thinking Machines. In the interview, all we talked about was APL. The more I learned about the Connection Machine's SIMD architecture, the more perfect a fit it seemed for APL or an APL-like language in which hypercubes of data may be partitioned into subcubes of any rank so that arbitrary functions can be applied to them. In APL and its descendents such as J, reduction is just one of rich family of ways that the results of applying a function to various data partitions can be glued together to form a result. I described this approach to parallel programming in a paper published in ACM SIGPLAN Notices in 1990, but as far as I know, no one ever read it. (You can, though. It is available here.) My dream of implementing APL on the Connection Machine gradually faded in the face of commercial reality. The early Connection Machine customers, having already been forced to learn Lisp, were not exactly clamouring for another esoteric language; they wanted Fortran. And Fortran is what I ended up working on. As you can tell, I still have regrets. If we'd implemented a true parallel APL back then, no one would have to invent MapReduce today.