Tuesday, April 8, 2008

Databases, MapReduce, and Disks

I just came across an interesting blog posting by Tom White entitled "Disks Have Become Tapes". This is an interesting posting, but it makes the following claim: relational databases are limited by the seek speed of disks whereas MapReduce-based methods take advantage of the streaming capabilities of disks. Hence, MapReduce is better than RDBMS for various types of processing.

Once again, I read a comment in a blog that seems misguided and gives inaccurate information. My guess is that people learn relational databases from the update/insert perspective and don't understand complex query processing. Alas. I do recommend my book Data Analysis Using SQL and Excel for such folks. Relational databases can take advantage of high-throughput disks.

Of course, the problem is not new. Tom White quotes David DeWitt quoting Jim Gray saying "Disks are the new tapes" (here). And the numbers are impressive. It takes longer to read a high capacity disk now than it did twenty years ago, because capacity has increased much faster than transfer rates. As for random seeks on the disk, let's not go there. Seek times have hardly improved at all over this time period. Seeking on a disk is like going to Australia in a canoe -- the canoe works well enough to cross a river, so why not an ocean? And, as we all know, RDBMSs use a lot of seeks for queries so they cannot take advantage of modern disks. MapReduce to the rescue!

Wait, is that common wisdom really true?

It is true that for updating or fetching a single row, an RDBMS does use disk seeks to get there (especially if there is an index). However, this is much faster than the alternative of streaming through the whole table -- even on a fancy, multi-cheap processor MapReduce systems connected to zillions of inexpensive disks.

On a complex query, the situation is a bit more favorable to the RDBMS for several reasons. First, large analytic queries typically read entire tables (or partitions of tables). They do not "take advantage" of indexing, since they read all rows using full table scans.

However, database engines do not read rows. They read pages. Between the query processor and the data is the page manager. Or, as T. S. Elliott wrote in his poem "The Hollow Men" [on an entirely different topic]:

Between the idea
And the reality
Between the motion
And the act
Falls the shadow

In this case, the shadow is the page manager, a very important part but often overlooked component of a database management system.

Table scans read the pages assigned to a table. So, query performance is based on a balance of disk performance (both throughput and latency) and page size. For a database used for analytics, use a big page size. 4k is way small . . . 128k or even 1Mbyte could be very reasonable (and I have seen systems with even larger page sizes). Also, remember to stuff the pages full. There is no reason to partially fill pages unless the table has updates (which is superfluous for most data warehouse tables).

Databases do a lot of things to improve performance. Probably the most important boost is accidental. Large database tables are typically loaded in bulk, say once-per-day. As a result, the pages are quite likely to be allocated sequentially. Voila! In such cases, the seek time from one page to the next is minimal.

But, databases are smarter than that. The second boost is pre-fetching pages that are likely to be needed. Even a not-so-smart database engine can realize when it is doing a full table scan. The page manager can seek to the next page at the same time that the processor is processing data in memory. That is, the CPU is working, while the page manager spends its time waiting for new pages to load. Although the page manager is waiting, the CPU is quite busy processing other data, so there is no effective wait time.

This overlap between CPU cycles and disk is very important for database performance on large queries. And you can see it on a database machine. In a well-balanced system, the CPUs are often quite busy on a large query and the disks are less busy.

Modern RDBMS have a third capability with respect to complex queries. Much of the work is likely to take place in temporary tables. The page manager would often store these on sequential pages, and they would be optimized for sequential access. In addition, temporary tables only store the columns that they need.

In short, databases optimize their disk access in several ways. They take advantage of high-throughput disks by:
  • using large page sizes to reduce the impact of latency;
  • storing large databases on sequential pages;
  • prefetching pages while the processor works on data already in memory;
  • efficiently storing temporary tables.
At least they are doing something! By the way, the balance between latency and throughput goes back at least to the 1980s when I entered this business. And I suspect that it is a much older concern.

The advantage and disadvantage of the MapReduce approach is that it leaves such optimizations in the hands of the operating system and the programmer. Fortunately, modern computer languages are smart with respect to sequential file I/O, so reading some records and then processing them would normally be optimized.

Of course, a programmer can disrupt this by writing temporary or output files to the same disk system being used to read data. Well, actually, disks are also getting smarter. With multiple platters and multiple read heads, modern disks can support multiple seeks to different areas.

A bigger problem arises with complex algorithms. MapReduce does not provide built-in support for joining large tables. Nor even for joining smaller tables. A nested loop join in MapReduce code could kill the performance of a query. An RDBMS might implement the same join using hash tables that gracefully overflow memory, should that be necessary. An exciting development in a programmer's life is when a hash table in memory gets too big and he or she learns about operating system page faults, a concern that the database engine takes care of by itself.

As I've mentioned before, RDBMS versus MapReduce is almost a religious battle. MapReduce has capabilities that RDBMSs do not have, and not only because programming languages are more expressive than SQL. The paradigm is strong and capable for certain tasks.

On the other hand, SQL is a comparatively easy language to learn (I mean compared to programming for MapReduce) and relational databases engines often have decades of experience built into them, for partitioning data, choosing join and aggregation algorithms, building temporary tables, keeping processors busy and disks spinning, and so on. In particular, RDBMSs do know a trick or two to optimize disk performance and take advantage of modern highish-latency higher-throughput disks.


No comments:

Post a Comment

Your comment will appear when it has been reviewed by the moderators.