Many users are probably aware of Shazam, one of the few mobile applications that really seems to live up to the notion "Wow! It's Magic." When you are listening to music, you can run the application (presumably on your phone), click "tag it", and after a few tens of seconds of listening and processing, Shazam will tell you what you are listening to, the artist and other details.
Recently, I found an interesting article describing how it works. A presentation, with more pictures and less text, is available here. Kudos to the company and to the author, Avery Wang, for providing technical detail.
The paper does a very good job explaining the details of the algorithm. My goal here is to describe the algorithm from a higher perspective, because it is an interesting example of a memory-based reasoning algorithm. That is, an algorithm that combines information from "nearest neighbors" to arrive at a prediction.
Assume that we have a database of songs and an excerpt that we want to find in a database of millions of songs. A first approach might be to do an exhaustive search of the database to find a match. This would take a long time.
Alternatively, we can frame the problem as follows: for all songs in the database, what is the longest period of time where the excerpt overlaps part of a song. The nearest neighbors are the ones with the longest period, and, in general, we would choose the single one with the longest overlap.
Simple problem to describe. However, the real world hits quickly. The songs are probably quite clean acoustically. However, the excerpt is subject to numerous problems: background noise, loss of fidelity due to compression as the excerpt is transmitted, poor (or at least different) equipment for recording the excerpt, and so on.
Fortunately, the world of acoustics has something of a solution for this, called the "frequency domain". This is a map of all the sound frequencies, taken at a periodic interval -- say, one second. If "frequency domain" conjures up memories of things like Fourier Transforms, then you really do understand the subject.
However, for our purposes, it is enough to say that the frequency domain for a song produces a very, very bumpy curve for each second of the song -- each point is the strength of a particular acoustic frequency at that point in the song. The song can be thought of as a collection of all these curves. Taken together, these curves might resemble a map of a very hilly area. This would be a three-dimensional map of the song.
This map has peaks anologous to the tops of hills (or perhaps the tops of buildings in a city). These peaks are called a constellations, and they pretty much uniquely identify the song, regardless of all the problems mentioned above. That is, the constellations are resilitient to background noise, loss of fidelity, and so on.
Of course, we can do this for the songs in the database in advance. And, we can do this processing for a single excerpt pretty quickly.
So, the problem of finding the song with the longest overlap in seconds with the excerpt is now handled by finding consecutive seconds in a song where the frequency domain peaks match the frequency domain peaks from the excerpt. This is still a daunting problem, because there are so many peaks available. In other word, comparing one excerpt to millions of songs requires comparing hundreds of peaks in the excerpts to the many, many billions in the database -- very time consuming.
Shazam takes a very clever approach to this problem. The algorithm treats each peak as an anchor, and creates peak-pairs with other peaks "close" to the anchor. Here, "close" means that the other peaks are within a few seconds of the first and not too different in frequency. These peak-pairs are then calculated for both the song and the excerpt. The pairs are used to find sets of anchors that match between each song and the excerpt. Because the algorithm is looking for exact matches, it can use some programming tricks to make things even faster (these are described in the paper).
In the end, there is a set of anchors for each song matched by a given excerpt. For each song, these are scanned to find consecutive seconds where the anchors in each second overlap. The longest period of overlap is the distance between the excerpt and the song.
The algorithm is quite clever on several different levels. I do think that understanding it at a high level is valuable, especially since it can provide guidance to other very difficult recognition problems. On the other hand, when I use Shazam and it identifies a song, I still think it's magic.