Thursday, June 15, 2017

Download a Chapter of Data Mining Techniques (3rd Edition) for Free

As seen on KDNuggets, you may now download Chapter 19, Derived Variables: Making the Data Mean More for free, thanks to our friends at JMP. This chapter is one of my personal favorites because it is about the part of data mining I find most enjoyable--thinking of ways to expose more of the information hidden in a data set so predictive algorithms are able to make use of it.

Wednesday, May 31, 2017

JMP Publishes Exercises to Accompany Data Mining Techniques (3rd Edition)

I am pleased to announce that the JMP division of SAS has made available a set of exercises I developed for use with Data Mining Techniques. I am grateful to my employer, Tripadvisor, for supplying the data used for these exercises. The exercises and data sets are free and available for download at https://www.jmp.com/en_us/academic/data-mining-techniques.html. I hope these exercises will be useful for professors using Data Mining Techniques as a text book and for people studying on there own.

Here is a bit more explanation from the introduction to the exercise document:

The third edition of Data Mining Techniques by Gordon S. Linoff and Michael J. A. Berry contains many stories about applying these techniques to real business data to achieve real business goals, but until now there has not been an easy way for students to follow along getting their hands dirty and learning from experience. This supplement fills that gap by providing suggested exercises along with the data to be mined.


We have always shied away from including exercises in our books for several reasons.  One is that software evolves so rapidly that the exercises can quickly seem out-of-date.  Another is that we wanted the freedom to talk about techniques and variations without worrying about whether they were supported by this or that software package.  Finally, the foundations of data mining do not require learning a specific package and we did not want software to distract readers from the principal focus of the book. On the other hand, there is no substitute for hands-on work with real data for truly mastering data mining techniques. The solution is to publish a separate document with an update cycle decoupled from that of Data Mining Techniques.

Sunday, May 31, 2015

Connecting Python to Postgres . . . An Arduous Task

I've decided that I want to learn a bit more about Python.  So, I downloaded Python onto my Mac running OS 10.8.x and decided to connect it to my local Postgres database.  Several days later, I have succeeded! Along the way, I may have learned a thing or two that would help others.  Hence, this description of my experience and my solutions.

Mac OS X comes with Python 2.7.  However, I figured that I like parentheses errors when I print, so I would go with the latest and greatest:  python 3.4.  For those who do not know, Python changed some of its syntactic conventions between versions 2 and 3.  A very notable one is that the commonly used print command is now a function and requires parentheses.  Previously, parentheses were not allowed.

An overview of the steps is basically:
  • Download Python 3.4 (easy).  Postgres was already installed.
  • Attempt to install psycopg2, the package that links Python to Postgres.
  • Attempt to install XCode
  • Attempt to install OS X 10.10
  • Attempt to install psycopg2
  • Finally, some success!
The script that I was attempting to run is rather easy:

#!/Applications/Python 3.4
#
# Small script to show PostgreSQL and Pyscopg together
#

import psycopg2

try:
    conn = psycopg2.connect("dbname='' user='' host='localhost' password=''")
    cur = conn.cursor()
    cur.execute("select count(*) from products")
    print(cur.fetchone())
    cur.close()
    conn.close()
except:
    print("Unexpected Error:", sys.exc_info()[0])

I counts the number of product in a table used for my book Data Analysis Using SQL and Excel.

What i encountered.

Getting and installing the latest version of Python is really easy.  Just go here.  (If the link doesn't work, it is easy to find with Google.)

Getting psycopg2 is really easy.  Just go here.  (If the link doesn't work, it is easy to find with Google.)

Installing psycopg2 should also be easy:

pip3 install psycopg2

But, this is where the trouble began.  The simplicity of installing Python lures you into an automated slumber, thinking "all this should work really easily".  For some reason, this installation requires the ability to compile code.  Fortunately, Apple offers a free package that has this and many other capabilities, located right here.  Alas, this download did not work for the Mac OS version I had.  And, Apple didn't seem to offer a version compatible with slightly out-of-date operating systems.

So, I bit the bullet to upgrade the operating system.  After all, my Mac has been asking about this for a long time.  I did some preliminary research on potential problems.  So far, the only issue upgrading to Mac OS 10.10 is that the OS forgot my network passwords.  That is a pain, because I don't really remember all of them, but a bearable pain.  An upgrade site is here.

Upgrading the operating system should be free.  I had problems with my AppleId, and had to verify my billing information.  So, I think it is free.  Eventually it all came through -- 5+ Gbytes worth.  Even on Verizon FIOS, that took several hours.  I remembered to change my Energy Saver Options so the computer would never turn off.  I don't know if this is necessary, but who wants to risk a partial download, even if the download can be restarted.

Once I downloaded the software (and had a nice dinner with a bit of wine, returning home late in the evening), I thought merely rebooting the computer would result in the new OS.   Nope.  I finally realized that I had to go back to the Apple Store and click the "update" button.  Then after a longish reboot (half an hour), the new operating system came up, and I was pretty soundly asleep by then.

With the new operating system, I could then load XCode.  That is another 2.5 Gbytes.  It too needs to be installed (I'm getting better at this).  I don't think the machine needed to be rebooted.

So, back to the simple code:

pip3 install psycopg2

And it failed.  This time with an error message saying that it could not find setup.py.  I searched my disk and I searched the web.  No such luck, although this seems like an important file.  Finally, I came across this very useful post, suggesting:

sudo PATH=$PATH:/Applications/Postgres.app/Contents/Versions/9.3/bin pip install psycopg2

I skeptically checked my path to Postgres.  I realized that I need pip3, changing the command to:


sudo PATH=$PATH:/Applications/Postgres.app/Contents/Versions/9.3/bin pip3 install psycopg2

And it worked!  Note that the "2" in psycopg2 refers to the version of this package, not to the version of Python.  As I write this, there is no psycopg3.

Note I have upvoted this answer on StackOverflow, as one should for super-helpful answers.

Then, I went back to Python to run my script.  This time, instead of saying that the psycopg2 was not installed, I got an error saying the connection failed.  Oh, what happiness!  A simple database error.

So, I started Postgres with:

/Applications/Postgres.app/Contents/Versions/9.3/bin/pg_ctl -D /Users/gordonlinoff/Library/Application\ Support/Postgres/var-9.3 start

And I fixed the database name, user name, and password in the connection script.  And, lo and behold, I got my answer:  4040.











Monday, September 15, 2014

Data Mining versus (?) Data Science

Two of my favorite answers are:
  • "A data scientist is a statistician who lives in San Francisco"
  • "A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician."
(These have both circulated around the web, although the second is attributed to Josh Wills.)

Last week, I had the pleasure of speaking at a Data Science Summit for Microsoft.  The summit had a lot of beautiful stuff -- notably, Jer Thorpe (a "digital artist") who specializes in amazing graphics and the graphical representation of data.  It had inspiring stuff, such as Jake Porway, the founder of DataKind, an organization that provides (volunteer) data science services to non-profits around the world.  And, it had useful stuff, such as presentations by clients and some discussions of new products.

The term Data Science has always left me a bit perplexed.  Once upon a time, I remember having to ftp source code from academic sites in order to run specific algorithms.  Or, alternatively, we had to program the algorithms ourselves.  The advent of useful tools really made an old dictum true:  "In an analysis project, you spend 20% of the time understanding the problem, 80% of the time massaging the data, and the rest of the time doing modeling."  That wasn't true in the days when we had to develop our own code.  It comes close to being true now.

I find the focus on programming in data science to be problematic.  For me personally, at least, programming is a distraction from understanding data.  The issue isn't a personal aversion to coding.  The issue is that programming often requires a very careful attention to detail to get things to work just right.  On the other hand, data analysis requires a higher level view of understanding data and getting it to solve real world problems.  The focus on the low-level versus the high-level is very difficult to pull off.

In any case, I come to the conclusion that Data Science is just another term in a long-line of terms.  Whether called statistics or customer analytics or data mining or analytics or data science, the goal is the same.  Computers have been and are gathering incredible amounts of data about people, businesses, markets, economies, needs, desires, and solutions -- there will always be people who take up the challenge of transforming the data into solutions.

Tuesday, August 26, 2014

An Achievement on Stack Overflow

Last Friday (August 22nd), I achieved a milestone on Stack Overflow, a computing question-and-answer site, by hitting 200,000 points.  So far this year, I have also earned more points than anyone, a testament to my obsession with the site (you can see yearly user rankings here).  As for overall points, I will never match the leader Jon Skeet because who had a head start of many years.

My answers are almost almost exclusively for answers related to SQL and various databases.   The site is highly geared toward "tools" questions, so there are few general analysis questions.

So, this blog is sharing some of my thoughts and my history on the site.

Clearly, I have helped a lot of people on Stack Overflow, around the world.  The most rewarding part are the thank-yous from real people working on real problems.  On several instances, I have helped speed up code by more than 99% -- turning hours of drudgery into seconds or minutes of response time.

But answering questions has helped me too:
  • My technical knowledge of databases has greatly improved, particularly the peculiarities (strengths and weaknesses) of each database engine.
  • I have learned patience for people who are confused by concepts in SQL.
  • I have (hopefully) learned how to explain concepts to people with different levels of competence.
  • I have learned how to figure out (sometimes) what someone is really asking.
  • I have a strong appreciation for what SQL can do and what SQL cannot do.
  • It has definitely increased the number of hits when I egosurf.
A few months after starting, I stopped down voting questions and answers.  "Down voting" is usually seen as "not-nice", making the other person defensive, confused, and perhaps angry.   A lesson for real life:  being mean (by down voting) is not nearly so useful as offering constructive criticism (by commenting).


This all started in January, 2012 (a bit over two and a half years ago).  The reason was simple:  I was writing a system called The Netting Machine for the Lehman Brothers Estate and it was stretching my knowledge of SQL Server.  One particular problem involved dynamic queries.  Google kept directing me to the same question on Stack Overflow.  This best answer was close to what I needed, but not quite.  It was only half-way there.  The third time I landed on the page, I added my own answer.  This was actually for quite selfish reasons:  the next time Google took me there, I wanted to see the full answer.

Lo and behold, my answer was accepted and up voted.  When the OP ("original poster" -- Stack Overflow lingo for the person asking the question) accepted my answer, s/he had to unaccept another.  That answer was by Aaron Bertrand, a SQL Server guru whose name I recognized from his instructive blog posts.  Aaron commented about the "unaccept".  In the back of my mind, If Aaron thinks this is important, then there must be something to it.  Ultimately, I can blame Aaron (whom I have not yet met in person) for getting me hooked.

For a few months, I sporadically answered questions.  Then, in the first week of May, my Mom's younger brother passed away.  That meant lots of time hanging around family, planning the funeral, and the like.  Answering questions on Stack Overflow turned out to be a good way to get away from things.  So, I became more intent.

Stack Overflow draws you in not only with points but with badges and privileges.  Each time I logged in, the system "thanked" me for my participation with more points, more badges, and more privileges.   This continued.  One day (probably in June), I hit the daily upvote maximum of 200 upvotes (you also get points when an answer is accepted or someone offers a bounty).  One week, I hit 1000 points.  One month, 5,000 points.  As an individual who is mesmerized by numbers, I noticed these things.

Last summer, I hit 100,000 points in September and slowed down.  I figured that six figures was enough, and I had other, more interesting things to do -- a trip to Myanmar, my sister's wedding, our apartment in Miami, classes to teach (San Francisco, Amsterdam, Chicago, San Antonio, Orlando) and so on.

I didn't start 2014 with the intention of spending too much of my time on the site.  But three things happened in January.  The first was a fever with a rash on my face.  It kept me home with not-enough to do.  So, I answered questions on Stack Overflow.  Then, I had an attack of gout.  That kept me home with not-enough to do.  And finally, the weather in January in New York was, well, wintery -- lots of cold and lots of snowy.  More reasons to stay home and answer questions.

By the end of January, I was the top scorer for the month.  "Hey, if I can do it in January, let's see what happens in February."  I had thought of relenting in April:   we flew to Greece and spent two nights on Mount Athos.  Mount Athos is a peninsula in northern Greece, devoted to twenty-one Orthodox monasteries -- and nothing else.  It is inhabited by a few thousand monks living medieval lifestyles.  The only way to visit is as a "pilgrim", staying at a monastery.  An incredible experience.  No internet.  But, I was able to make up the point deficit on Stack Overflow.

This year, each month that passes is another month where I seem to be the top point-gatherer on Stack Overflow.  At this point, I might as well make it to the end of the year.  I don't know if I will, but I do hope to help a few other people and to learn more about databases and how people are using them.




Tuesday, August 19, 2014

Gordon Linoff and Michael Berry Interviewed for JMP's Analytically Speaking Series

On August 13th, Anne Milley of JMP interviewed us for the webcast series, Analytically Speaking. The full program is available for viewing here. We discussed a range of topics including the upcoming supplement to Data Mining Techniques (3rd Edition). The supplement will contain hands-on exercises to accompany the book.

Saturday, May 31, 2014

Loading IP Test Data Into Postgres

Recently, I was trolling around the internet looking for some IP address data to play with.  Fortunately, I stumbled across MaxMind's Geolite Database, which is available for free.    All I have to do is include this notice:

This product includes GeoLite2 data created by MaxMind, available from <a href="http://www.maxmind.com">http://www.maxmind.com</a>.

That's easy enough.  The next question is how to get this into my local Postgres database.  A bit over a year ago, I very happily gave up on my Dell computer and opted for a Mac.  One downside to a Mac is that SQL Server doesn't run on it (obviously my personal opinion).  Happily, Postgres does and it is extremely easy to install by going to postgresapp.com.   An interface similar enough to SQL Server Management Studio (called pgadmin3) is almost as easy to install by going here.

So, the next problem is getting the MaxMind data into Postgres.  Getting the two tables into Postgres is easy, using the copy command.  The challenge is IPV6 versus IPV4 addresses.  The data is in IPV6 format with a subnet mask to represent ranges.  Most of us who are familiar with IP addresses are familiar with IPV4 addresses.  These are 32 bits and look something like this:  173.194.121.17 (this happens to be an address for www.google.com attained by running ping www.google.com in a terminal window).  Alas, the data from MaxMind uses IPV6 values rather than IPV4.

In IPV6, the above would look like: ::ffff:173.194.121.17 (to be honest, this is a hybrid format for representing IPV4 addresses in IPV6 address space).  And the situation is a bit worse, because these records contain address ranges.  So the address range is really:  ::ffff:173.194.0.0/112.

The "112" is called a subnet mask.  And, IPV4 also uses them.  In IPV4, they represent the initial range of digits, so they range from 1-32, with a number like "24" being very common.  "112" doesn't make sense in a 32-bit addressing scheme.   To fast forward, the "112" subnet mask for IPV6 corresponds to 16 in the IPV4 world.  This means that the first 16 bits are for the main network and the last 16 bits are for the subnet.  That is, the addresses range from 173.194.0.0 to 173.194.255.255.  The relationship between the subnet mask for IPV6 and IPV4 is easy to express:  the IPV4 subnet mask is the IPV6 subnet mask minus 96.

I have to credit this blog for helping me understand this, even though it doesn't give the exact formula.  Here, I am going to shamelessly reproduce a figure from that blog (along with its original attribution):
ipv6-address
Courtesy of ls-a.org

This figure says  the following.  The last 64 bits are new to IPV6, so they can be automatically subtracted out of the subnet mask.  Then, bits 0-32 also seem to be new, so they can also be subtracted out.  That totals 96 bits in the new version not in the old version.  To be honest, I am not 100% positive about my interpretation.  But it does seem to work.  Google does indeed own exactly this address range.

The Postgres code for creating the table then goes as follows:

create table ipcity_staging (
    network_start_ip varchar(255),
    network_mask_length int,
    geoname_id int,
    registered_country_geoname_id int,
    represented_country_geoname_id int,
    postal_code varchar(255),
    latitude decimal(15, 10),
    longitude decimal(15, 10),
    is_anonymous_proxy int,
    is_satellite_provider int
);

copy public.ipcity_staging
    from '...data/MaxMind IP/GeoLite2-City-CSV_20140401/GeoLite2-City-Blocks.csv'
    with CSV HEADER;

create table ipcity (
    IPCityId serial not null,
    IPStart int not null,
    IPEnd int not null,
    IPStartStr varchar(255) not null,
    IPEndStr varchar(255) not null,
    GeoNameId int,
    GeoNameId_RegisteredCountry int,
    GeoNameId_RepresentedCountry int,
    PostalCode varchar(255),
    Latitude decimal(15, 10),
    Longitude decimal(15, 10),
    IsAnonymousProxy int,
    IsSatelliteProvider int,
    unique (IPStart, IPEnd),
    unique (IPStartStr, IPEndStr)
);

insert into ipcity(IPStart, IPEnd, IPStartStr, IPEndStr, GeoNameId, GeoNameId_RegisteredCountry, GeoNameId_RepresentedCountry,
                   PostalCode, Latitude, Longitude, IsAnonymousProxy, IsSatelliteProvider
                  ) 
    select IPStart, IPEnd, IPStartStr, IPEndStr, GeoName_Id, registered_country_geoname_id, represented_country_geoname_id,
           Postal_Code, Latitude, Longitude, Is_Anonymous_Proxy, Is_Satellite_Provider
    from (select network_mask_length - 96,
                 hostmask(inet (substr(network_start_ip, 8) || '/' || network_mask_length - 96)) ,
                 inet(host(inet (substr(network_start_ip, 8) || '/' || network_mask_length - 96) )) |
                 hostmask(inet (substr(network_start_ip, 8) || '/' || network_mask_length - 96)
                ) as ipend_inet,
                substr(network_start_ip, 8) || '/' || network_mask_length - 96,
                ((split_part(IPStartStr, '.', 1)::int << 24) +
                 (split_part(IPStartStr, '.', 2)::int << 16) +
                 (split_part(IPStartStr, '.', 3)::int << 8) +
                 (split_part(IPStartStr, '.', 4)::int)
                ) as IPStart,
                ((split_part(IPEndStr, '.', 1)::int << 24) +
                 (split_part(IPEndStr, '.', 2)::int << 16) +
                 (split_part(IPEndStr, '.', 3)::int << 8) +
                 (split_part(IPEndStr, '.', 4)::int)
                ) as IPEnd,
                st.*
          from (select st.*,
                       host(inet (substr(network_start_ip, 8) || '/' || network_mask_length - 96)) as IPStartStr,
                       host(inet(host(inet (substr(network_start_ip, 8) || '/' || network_mask_length - 96) )) |
                            hostmask(inet (substr(network_start_ip, 8) || '/' || network_mask_length - 96))
                           ) as IPEndStr
                from ipcity_staging st 
                where network_start_ip like '::ffff:%'
               ) st
         ) st;