Wednesday, September 25, 2013

For Predictive Modeling, Big Data Is No Big Deal

That is what I will be speaking about when I give a keynote talk a the Predictive Analytics World conference on Monday, September 30th in Boston.
For one thing, data has always been big. Big is a relative concept and data has always been big relative to the computational power, storage capacity, and I/O bandwidth available to process it. I now spend less time worrying about data size than I did in 1980. For another, data size as measured in bytes may or may not matter depending on what you want to do with it. If your problem can be expressed as a completely data parallel algorithm, you can process any amount of data in constant time simply by adding more processors and disks.
This session looks at various ways that size can be measured such as number of nodes and edges in a social network graph, number of records, number of bytes, or number of distinct outcomes, and how the importance of size varies by task. I will pay particular attention to the importance or unimportance of data size to predictive analytics and conclude that for this application, data is powerfully predictive, whether big or relatively small. For predictive modeling, you soon reach a point where doubling the size of the training data has no effect on your favorite measure of model goodness. Once you pass that point, there is no reason to increase your sample size. In short, big data is no big deal.