Friday, October 23, 2009

Counting Users From Unique Cookies

Counting people/unique visitors/users at web sites is a challenge, and is something that I've been working on for the past couple of months for the web site of a large media company. The goal is to count the number of distinct users over the course of a month. Counting distinct cookies is easy; the challenge is turning these into human beings. These challenges include:

  • Cookie deletions. A user may manually delete their cookies one or more times during the month.

  • Disallowing first party cookies. A user may allow session cookies (while the browser is running), but not allow the cookies to be committed to disk.

  • Multiple browsers. A single user may use multiple browsers on the same machine during the month. This is particularly true when the user upgrades his or her browser.

  • Multiple machines. A single user may use multiple machines during the month.

And, I have to admit, that the data that I'm using has one more problem, which is probably not widespread. The cookies are actually hashed into four bytes. This means that it is theoretically possible for two "real" cookies to have the same hash value. Not only theoretically possible, but it happens (although not too frequently).

I came across a very good blog by Angie Brown that lays out the assumptions in making the calculation, including a spreadsheet for varying the assumptions. One particularly interesting factoid from the blog is that the number cookies that appear only once during the month exceeds the number of unique visitors, even under quite reasonable assumptions. Where I am working, one camp believes that the number of unique visitors is approximated by the number of unique cookies.

A white paper by ComCast states that the average user has 2.5 unique cookies per month due to cookie deletion. The paper is here, and a PR note about it is it is here. This paper is widely cited, although it has some serious methodological problems due to the fact that its data sources are limited to DoubleClick and Yahoo!.

In particular, Yahoo! is quite clear about its cookie expiration policies (two weeks for users clicking the "keep me logged in for 2 weeks" box and eight hours for Yahoo! mail). I do not believe that this policy has changed significantly in the last few years, although I am not 100% sure.

The white paper from ComCast does not mention these facts, which means that most of the cookies that a user has are due to automatic deletion, not user behavior. How many distinct cookies does a user have, due only to the user's behavior?

If I make the following assumptions:

  • The Yahoo! users have an average of 2.5 cookies per month.

  • ComCast used the main Yahoo! cookies, and not the Yahoo! mail cookies.

  • All Yahoo! users use the site consistently throughout the month.

  • All Yahoo! users have the "keep me logged in for 2 weeks" box checked.
Then I can estimate the number of cookies per user per machine per month. The average user would have 31/14 = 2.2 cookies per month, strictly due to the automatic deletion. This leaves 0.3 cookies per month due to manual deletion. Of course, the user starts with one cookie. So the average number of cookies per month per user per machine is 1.3.

By the way, I find this number much more reasonable. I also think that it misses the larger source of overcounting -- users who use more than one machine. Unfortunately, there is no single approach. In the case that I'm working on, we have the advantage that a minority of users are registered, so we can use them as a sample.

PAW conference, privacy issues, déjà vu

I attended the Predictive Analytics World conference this week and I thought it was very succesful. Although the conference was fairly small, I heard several interesting presentations and ran into several interesting attendees. In other words, I think the density of interesting people on both sides of the podium was higher than at some larger conferences.

One of the high points for me was a panel discussion on consumer privacy issues. Normally, I find panel discussions a waste of time, but in this case the panel members had clearly given a lot of thought to the issues and had some interesting things to say.  The panel consisted of Stephen Baker, a long-time Business Week  writer and author of  The Numerati, (a book I haven't read, but which, I gather, suggests that people like me are using our data mining prowess to rule the world); Jules Polonetsky, currently of the Future of Privacy Forum, and previously Chief Privacy Officer and SVP for Consumer Advocacy at AOL, Chief Privacy Officer and Special Counsel at DoubleClick, New York City Consumer Affairs Commissioner in the Giuliani administration; and Mikael Hagström, Executive Vice President, EMEA and Asia Pacific for SAS. I was particularly taken by Jules's idea that companies that use personal information to provide services that would not otherwise be possible should agree on a universal symbol for "smart" kind of like the easily recognizable symbol for recycling. Instead of (well, I guess it would have to be in addition to) a privacy policy that no one reads and is all about how little they know about you and how little use they will make of it, the smart symbol on a web site would be a brag about how well the service provider can leverage your profile to improve your experience. Clicking on it would lead you to the details of what they now know about you, how they plan to use it, and what's in it for you. You would also be offered an opportunity to fill in more blanks and make corrections. Of course, every "smart" site would also have a "dumb" version for users who don't choose to opt in.

This morning, as I was telling Gordon about all this in a phone call, we started discussing some of our own feelings about privacy issues, many of which revolve around the power relationship between us as individuals and the organization wishing to make use of information about us. If the supermarket wants to use my loyalty card data to print coupons for me, I really don't mind. If an insurance company wants to use that same loyalty card data to deny me insurance because I buy too much meat and alchohol, I mind a lot. As I gave that example, I had an overwhelming feeling of déjà  vu. Or perhaps it was déjà lu? In fact, it was déjà écrit! I had posted a blog entry on this topic ten years ago, almost to the day. Only there weren't any blogs back then so attention-seeking consultants wrote columns in magazines instead. This one, which appeared in the October 26, 1999 issue of Intelligent Enterprise, said what I was planning to write today pretty well.

Saturday, October 17, 2009

See you at a conference soon

It appears to be conference season. I'll be taking part in several over the next several weeks and would enjoy meeting any readers who will also be in attendance. These are all business-oriented conferences with a practical rather than academic focus.

First up is Predictive Analytics World in Alexandria, VA October 20-21.  I will be speaking on our experience using survival analysis to forecast future subscriber levels for a variety of subscription-based businesses.This will be my first time attending this conference, but the previous iteration in San Francisco got rave reviews.

Next on my calendar is M2009 in Las Vegas, NV October 26-27. This conference is sponsored by SAS, but it is a general data mining conference, not a SAS users group or anything like that. I've been attending since the first one in 1998 (I think) and have watched it grow into the best-attended of the annual data mining conferences. Gordon Linoff and I will be debuting the latest version of our three-day class Data Mining Techniques Theory and Practice as part of the post-conference training October 28-30.  Too bad about the location, but I guess conference space is cheap in Las Vegas.

I only get to spend a partial weekend at home before leaving for TDWI World in Orlando, FL. This is not, strictly speaking, a data mining conference (the DW stands for "data warehouse") but once people have all that data warehoused, they may want to use if for predictive modeling. I will be teaching a one-day class called Predictive Modeling for Non-Statisticians on Tuesday, November 3. That is election day where I live, so I better get my absantee ballot soon.

The following week, it's off to San Jose, CA for the Business Analytics Summit November 12-13.  I believe this is the first annual running of this conference. I'll be on a panel discussing data mining and business analytics. Perhaps I will discover what the phrase  "business analytics" means! That would come in handy since I will be teaching a class on Marketing Analytics at Boston College's Carroll School of Management next semester. My guess: it means the same as data mining, but some people don't like that word so they've come up with a new one.

Friday, October 16, 2009

SVM with redundant cases

We received the following question from a reader:

I just discovered this blog -- it looks great. I apologize if this question has been asked before -- I tried searching without hits.

I'm just starting with SVMs and have a huge amount of data, mostly in the negative training set (2e8 negative examples, 2e7 positive examples), with relatively few features (eg less than 200). So far I've only tried linear SVM (liblinear) due to the size, with middling success, and want to under-sample at least the negative set to try kernels.

A very basic question. The bulk of the data is quite simple and completely redundant -- meaning many examples of identical feature sets overlapping both positive and negative classes. What differs is the frequency in each class. I think I should be able to remove these redundant samples and simply tell the cost function the frequency of each sample in each class. This would reduce my data by several orders of magnitude.

I have been checking publications on imbalanced data but I haven't found this simple issue addressed. Is there a common technique?

Thanks for any insight. Will start on your archives.
There are really two parts to the question. The first part is a general question about using frequencies to reduce the number of records. This is a fine approach. You can list each distinct record only once along with its frequency. The frequency counts how many times a particular pattern of feature values (including the class assigned to the target) appears. The second part involves the effect on the SVM algorithm of having many cases with identical features but different assigned classes. That sounded problematic to me, but since I am not an expert on support vector machines, I forwarded your question to someone who is--Lutz Hamel, author of Knowledge Discovery with Support Vector Machines.

Here is his reply:

I have some fundamental questions about the appropriateness of SVM for this classification problem:

Identical observation feature vectors produce different classification outcomes. If this is truly meaningful then we are asking the SVM to construct a decision plane through a point with some of the examples in this point classified as positive and some as negative. This is not possible. This means one of two things: (a) we have a sampling problem where different observations are mapped onto the same feature vectors. (b) we have a representation problem where the feature vector is not powerful enough to distinguish observations that should be distinguished.

It seems to me that this is not a problem of a simple unbalanced dataset but a problem of encoding and perhaps coming up with derived features that would make this a problem suitable for decision plane based classification algorithms such as SVMs. (is assigning the majority label to points that carry multiple observations an option?)
SVM tries to find a hyperplane that separates your classes. When, (as is very common with things such as marketing response data, or default, or fraud, or pretty much any data I ever work with), there are many training cases where identical values of the predictors lead to different outcomes, support vector machines are probably not the best choice. One alternative you could consider is decision trees. So long as there is a statistically significant difference in the distribution of the target classes, a decision tree can make splits. Any frequently occuring pattern of features will form a leaf and, taking the frquencies into account, the proportion of each class in the leaf provides estimates of the probabilities for each class given that pattern.