Wednesday, April 8, 2009

MapReduce, Hadoop, Everything Old Is New Again

One of the pleasures of aging is watching younger generations discover pleasures one remembers discovering long ago--sex, porcini, the Beatles. Occasionally though, it is frustrating to see old ideas rediscovered as new ones. I am especially prone to that sort of frustration when the new idea is one I once championed unsuccessfully. Recently, I've been feeling as though I was always a Beatles fan but until recently all my friends preferred Herman's Hermits. Of course, I'm glad to see them coming around to my point of view, but still . . .

What brings these feelings up is all the excitement around MapReduce. It's nice to see a parallel programming paradigm that separates the description of the mapping from the description of the function to be applied, but at the same time, it seens a bit underwhelming. You see, I literally grew up with the parallel programming language APL. In the late 60's and early 70's my father worked at IBM's Yorktown Heights research center in the group that developped APL and I learned to program in that language at the age of 12. In 1982 I went to Analogic Corporation to work on an array processor implementation of APL. In 1986, while still at Analogic, I read Danny Hillis's book The Connection Machine and realized that he had designed the real APL Machine. I decided I wanted to work at the company that was building Danny's machine. I was hired by Guy Steele, who was then in charge of the software group at Thinking Machines. In the interview, all we talked about was APL. The more I learned about the Connection Machine's SIMD architecture, the more perfect a fit it seemed for APL or an APL-like language in which hypercubes of data may be partitioned into subcubes of any rank so that arbitrary functions can be applied to them. In APL and its descendents such as J, reduction is just one of rich family of ways that the results of applying a function to various data partitions can be glued together to form a result. I described this approach to parallel programming in a paper published in ACM SIGPLAN Notices in 1990, but as far as I know, no one ever read it. (You can, though. It is available here.) My dream of implementing APL on the Connection Machine gradually faded in the face of commercial reality. The early Connection Machine customers, having already been forced to learn Lisp, were not exactly clamouring for another esoteric language; they wanted Fortran. And Fortran is what I ended up working on. As you can tell, I still have regrets. If we'd implemented a true parallel APL back then, no one would have to invent MapReduce today.

2 comments:

  1. Great post, Michael.

    Interestingly, I have recently started reading up on MapReduce and was immediately struck by both its similarity to the data parallelism those of us in the TMC languages group were implementing in the Connection Machine compilers (Especially in Fortran), and by how limiting the programming model is, sort of like a constrained scatter-gather.

    Yes, APL on the Connection machine might have obviated the need for the langauge, but on the other hand, it is refreshing to see that there is still some interest out there in developing highly-parallelized applications, especially in the information management and business intelligence space.

    ReplyDelete
  2. Nice post. I actually found your blog through the Neural Networks post, but really enjoyed reading this one. Few days ago, I finished the book "Masterminds of Programming" with interviews of the creators APL (among others). Very insightful, so I spent a few hours to understand the basic principles of APL (syntax, model and symbols :-) and also found some analogies to the current trends of parallel programming.

    ReplyDelete

Your comment will appear when it has been reviewed by the moderators.