Professors Michael Stonebraker and David DeWitt have written a very interesting piece on relational databases and MapReduce (available here). For those who are not familiar with MapReduce, it is a computational framework developed by Google and Yahoo for processing large amounts of data in parallel.
The response to this article has, for the most part, been to defend MapReduce, which I find interesting because MapReduce is primarily useful for analytic applications. Both technologies make it possible to run large analytic tasks in parallel (taking advantage of multiple processors and multiple disks), without learning the details of parallel hardware and software. This makes both of them powerful for analytic purposes.
However, Professors Stonebraker and DeWitt make some points that are either wrong, or inconsequential with respect to using databases for complex queries and data warehousing.
(1) They claim that MapReduce lacks support for updates and transactions, implying that these are important for data analysis.
This is not true for complex analytic queries. Although updating data within a databases is very important for transactional systems, it is not at all important for analytic purposes and data warehousing. In fact, updates imply certain database features that can be quite detrimental to performance.
Updates imply row-level locking and logging. Both of these are activities that take up CPU and disk resources, but are not necessary for complex queries.
Updates also tend to imply that databases pages are only partially filled. This makes it possible to insert new data without splitting pages, which is useful in transactional systems. However, partially filled pages slow down queries that need to read large amounts of data.
Updates also work against vertical partitioning (also called columnar databases), where different columns of data are stored on different pages. This makes working on wide tables quite feasible, and is one of the tricks used by newer database vendors such as Netezza.
(2) They claim that MapReduce lacks indexing capabilities, implying that indexing is useful for data analysis.
One of the shortcomings of the MapReduce framework in comparison to SQL is that MapReduce does not facilitate joins. However, the major use of indexes for complex queries are for looking up values in smaller reference tables, which can often be done in memory. We can assume that all large tables require full table scans.
(3) MapReduce is incompatible with database tools, such as data mining tools.
The article actually sites Oracle Data Mining (which grew out of the Darwin project developed by Thinking Machines when I was there) and IBM Intelligent Miner. This latter reference is particular funny, because IBM has withdrawn this product from the market (see here). The article also fails to cite the most common of these tools, Microsoft SQL Server Data Mining, which is common because it is bundled with the database.
However, data mining within databases is not a technology that has taken off. One reason is pricing. Additional applications on database platforms often increase the need for hardware -- and more hardware often implies larger database costs. In any case, networks are quite fast and tools can access data in databases without having to be physically colocated with them. Serious data mining practitioners are usually using other tools, such as SAS, SPSS, S-Splus, or R.
By the way, I am not a convert to MapReduce (my most recent book is calld Data Analysis with SQL and Excel). Its major shortcoming is that it is a programming interface, and having to program detracts from solving business problems. SQL, for all its faults, is still much easier for most people to learn than Java or C++, and, if you do want to program, user-defined extensions can be quite beneficial. However, there are some tasks that I would not want to tackle in SQL, such as processing log files, and MapReduce is one scalable option for such processing.