Monday, September 15, 2014

Data Mining versus (?) Data Science

Two of my favorite answers are:
  • "A data scientist is a statistician who lives in San Francisco"
  • "A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician."
(These have both circulated around the web, although the second is attributed to Josh Wills.)

Last week, I had the pleasure of speaking at a Data Science Summit for Microsoft.  The summit had a lot of beautiful stuff -- notably, Jer Thorpe (a "digital artist") who specializes in amazing graphics and the graphical representation of data.  It had inspiring stuff, such as Jake Porway, the founder of DataKind, an organization that provides (volunteer) data science services to non-profits around the world.  And, it had useful stuff, such as presentations by clients and some discussions of new products.

The term Data Science has always left me a bit perplexed.  Once upon a time, I remember having to ftp source code from academic sites in order to run specific algorithms.  Or, alternatively, we had to program the algorithms ourselves.  The advent of useful tools really made an old dictum true:  "In an analysis project, you spend 20% of the time understanding the problem, 80% of the time massaging the data, and the rest of the time doing modeling."  That wasn't true in the days when we had to develop our own code.  It comes close to being true now.

I find the focus on programming in data science to be problematic.  For me personally, at least, programming is a distraction from understanding data.  The issue isn't a personal aversion to coding.  The issue is that programming often requires a very careful attention to detail to get things to work just right.  On the other hand, data analysis requires a higher level view of understanding data and getting it to solve real world problems.  The focus on the low-level versus the high-level is very difficult to pull off.

In any case, I come to the conclusion that Data Science is just another term in a long-line of terms.  Whether called statistics or customer analytics or data mining or analytics or data science, the goal is the same.  Computers have been and are gathering incredible amounts of data about people, businesses, markets, economies, needs, desires, and solutions -- there will always be people who take up the challenge of transforming the data into solutions.