Monday, April 21, 2008

Using SET with Unique to Join Tables in SAS Data Steps

Recently, I have had to write a bunch of SAS code for one of our clients. Although I strive to do as much as possible using proc sql, there are some things that just require a data step. Alas.

When using the data step, I wish I were able call a query directly:

data whereever;
....set (SELECT beautiful things using SQL syntax);
....and so on with the SAS code

However, this is not possible.

A SAS programmer might point out that there are two easy work-arounds. First, you can simply call the query and save it as a SAS data set. Alternatively, you can define a view and access the view from the data step.

I do not like either of these solutions. One reason why I like SQL is that I can combine many different parts of a solution into a single SQL statement -- my SQL queries usually have lots of subqueries. Another reason I like SQL is it reduces the need for clutter -- intermediate files/tables/data sets -- which need to be named and tracked and managed and eventually deleted. I ran out of clever names for such things about fifteen years ago and much prefer having the database do the dirty work of tracking such things. Perhaps this is why I wrote a book on using SQL for data analysis.

So, I give up on the SQL syntax, but I still want to be able to do similar processing. The data step does make it possible to do joins, using a syntax that is almost intuitive (at least for data step code). The advertised syntax looks like:

proc sql;
....create index lookupkey on lookup;

data whereever;
....set master;
....set lookup (keep=lookupkey lookupvalue) key=lookupkey;
....and so on with the SAS code

This example is highly misleading! (So look below for a better version.) But, before explaining the problems and the solution, let me explain how the code works.

The first statement is a proc sql statement that builds an index on the lookup data set using the lookup key column. Real SAS programmers might prefer proc datasets, but I'm not a real SAS programmer. They do the same thing.

The second statement is the data step. The key part of the data step is the second set statement which uses the key= keyword. This keyword says to look up the corresponding value in another data set and fetch the first row where the values match. The "key" itself is an index, which is why I created the index first.

The keep part of the statement is just for efficiency's sake. This says to only keep the two variables that I want, the lookup key (which is needed for the index) and the lookup value. There may be another two hundred columns in the lookup table (er, data set), but these are the only ones that I want.

This basic example is quite deceptive. Indexes in SAS are a lot like indexes in databases. They are both called indexes and both give fast access to rows in a table, based on values in one or more columns. Both can be created in SQL.

However, they are not the same. The above syntax does work under some circumstances, such as when all the lookup keys are in the lookup table and when no two rows in a row in the master table have the same key (or some strange condition like that). Most importantly, I've found that the syntax seems to work on small test data but not on larger sets. This is a most nefarious type of difference. And, there are no warnings or errors.

The problem is that SAS indexes allow duplicates but treat indexes with duplicate keys differently from indexes with unique keys. Even worse, SAS determines this by how the index is created, not by the context. And for me (the database guy) the most frustrating thing is that the default is for the strange behavior instead of the nice clean behavior I'm expecting. I freely admit a bias here.

So we have to explicitly say that the index has no duplicates. In addition, SAS does not have reasonable behavior when there is no match. "Reasonable" behavior would be to set the lookup value to missing and to continue dutifully processing data. Instead, SAS generates an error and puts garbage in the lookup value.

proc sql;
....create unique index lookupkey on lookup;

data whereever;
....set master;
....set lookup (keep=lookupkey lookupvalue) key=lookupkey/unique;
....if (_iorc_ = %sysrc(_dsenom)) then do;
........_ERROR_ = 0;
........lookupvalue = .;
....end;
....and so on with the SAS code

The important change the presence of the unique signifier in both the create index statement and in the set statement. I have found that having it in one place is not sufficient, even when the index actually has no duplicates.

The error handling also tro ubles me. Strange functions called "iorc" are bad enough, even without being preceding by an underscore. Accessing global symbols such as _ERROR should be a sign that something extraordinary is going on. But nothing unusual is happening; the code is just taking into account the fact that the key is not in the lookup table.

In the end, I can use the data step to mimic SQL joins, including left outer joins (by taking into account, by using appropriate indexes and keys. Although I don't particularly like the syntax, I do find this capability very, very useful. The data step I referred to at the beginning of this post has eleven such lookups, and many of the lookup tables have hundreds of thousands or millions of rows.

--gordon

No comments:

Post a Comment

Your comment will appear when it has been reviewed by the moderators.