Tuesday, October 28, 2008

Random Samples in SQL

Hi,

How would recommend getting a random sample from a table in SQL? Thank you!

Adam


This is a good question. Unfortunately, there is not a good answer, because the concept of a random sample does not really exist in relational algebra (which SQL -- to a greater or lesser extent -- is based on). There are, however, ways of to arrive at the solution. This discussion is based partly on the Appendix in Data Analysis Using SQL and Excel.

The basic idea is assume that there is a function that returns a random number, say uniformly between 0 and 1. If such a function exists, the SQL code for a random sample might look like:

....SELECT *
....FROM table t
....WHERE rand() <>

The function rand() does actually exist in many databases, such as IBM UDB, Microsoft SQL, and Mysql.

Does this really work for these databases? That depends on whether rand() is a deterministic or non-deterministic function. A deterministic function is essentially evaluated once, when the query is parsed. If this is the case, then all rows would have the same value, and the query would not return a 10% random sample. It would return either 0 rows or all of them.

Fortunately, for these databases, the designers were smart and rand() is non-deterministic, so the above code works as written.

Oracle has a totally different approach. It supports the SAMPLE clause. Using it, the above query would be written as:

....SELECT *
....FROM table t
....SAMPLE (10)

Another approach in Oracle is to use a pseudo-random number generator and ROWNUM. This approach works in any database that has something similar to ROWNUM.

If you happen to be using SAS proc SQL, then you can do something similar to the first example. The only difference is that the function is RAND('UNIFORM') rather than just RAND().

4 comments:

  1. For Microsoft Sql Server users, in Sql2k5 and up, there is a sample clause similar to Oracle that randomly samples pages on the backend:

    ....SELECT *
    ....FROM table t
    ....TABLESAMPLE (1000 rows)

    Because of how it samples data, you're not always guaranteed to get back exactly 1000 rows (the final count may be off by a few or several hundred), so if the exact number of rows returned is important to you, you can also use the following as a good proxy:

    ....SELECT TOP 1000 *
    ....FROM table t
    ....ORDER BY newid()

    ReplyDelete
  2. TABLESAMPLE does not provide a random sample.

    ReplyDelete
  3. I'm getting an error that TABLESAMPLE only works with local tables...so not with tables on a server?

    ReplyDelete
  4. I'm getting an error that TABLESAMPLE only works with local tables...so not with tables on a server?

    ReplyDelete

Your comment will appear when it has been reviewed by the moderators.