Tuesday, March 25, 2014

Three SQL Constructs You Can Forget About

SQL is a very powerful language, which could, of course, be made even more powerful and useful.  This post discusses three features of the language -- ANSI standard features -- that seem not only unnecessary but downright detrimental.  That is, they seem to cause much more confusion than they provide in functionality.  And, in all these cases, it would be easy to work around their absence.

Although it would be nice to remove these from the language itself, that is unlikely to happen.  However, they can be de-prioritized for people learning SQL.  These constructs are easy to work around and are less functional than their alternatives.  When learning SQL, these should be learned later in the process.

(1)  INSERT . . . VALUES()

The first construct is the use of VALUES with INSERT, as in:

insert into t(col1)    values(1);

In almost every database, this is easily replaced with:

insert into t(col1)
    select 1;
In some databases, you might have to add a from dual or from sys.dummy to make this work.
And, in every respect except one, the INSERT . . . SELECT method is better.   For instance, you can add a WHERE clause to be sure that the value doesn't already exist:
insert into t(col1) select 1 where not exists (select 1 from table t2 where t2.col1 = t)

Or, you can readily add other values, from this or another table:
insert into t(col1, col2)    select 1, (select count(*) from t2)
Trying to fit this into a VALUES statement just causes syntax errors and confusion.

And, you can use UNION ALL to add multiple rows at the same time.

The VALUES statement has exactly one advantage and that is the fact that it is standard.  The same code will work in multiple databases.  That seems very minor.  It would be better if the standard had a way of using SELECT to return a row without a table.


(2)  SELECT DISTINCT

The next unnecessary construct is SELECT DISTINCT.   First, this is easily replaced with GROUP BY.  So:
select distinct a, b, cfrom t;
is the same as:
select a, b, cfrom tgroup by a, b, c;

What makes the GROUP BY better?   Primarily the fact that you can have a HAVING clause.

So, SELECT DISTINCT is sometimes understood to be:  "Get me all the rows that are distinct".  Rather than, "Get me the distinct values from all the rows."  Actually, that first interpretation makes a lot of sense, even if it is wrong.  Not only is there no danger of confusion with the GROUP BY, but including HAVING COUNT(*) = 1 actually solves the first problem.    No way to do that with SELECT DISTINCT.

The second problem is perhaps more dangerous.  Have you ever seen someone write this?
select distinct(a) b, cfrom t;
Here, the DISTINCT seems to be used like a function.   The intention is "Get me distinct values of a along with arbitrary values of b and c".  Of course, this is exactly the same with or without the parentheses.  DISTINCT is not a function.  This usage is so prevalent that Postgres introduced the DISTINCT ON syntax to support it.

What advantages does SELECT DISTINCT have?  The syntax is shorter and you don't have to repeat the column names in a GROUP BY clause.    In a world of cut-and-paste, copying the column to GROUP BY is negligible effort.   And, it does allow SELECT DISTINCT *.   However that is a construct that I wouldn't miss at all.


(3)  COUNT(column)

Finally, there is the COUNT aggregation function with a column as an argument.  Just to be clear, I have no problem with COUNT(DISTINCT column) or COUNT(*) or COUNT(1).

No doubt, the designed of SQL were obsessed with NULL values (and despite the obsession, they still didn't get it right).   Wouldn't everyone in the world (who uses SQL) want to count the number of non-NULL values in a column?  What else could COUNT(column) mean?

Well, in many contexts, people probably think it means COUNT(DISTINCT column).  Consider the following query:
select c.country, count(c.CustomerId), count(o.OrderId)from Customers c join     Orders o     on c.CustomerId = o.CustomerId;

Many people might write this code, just like this, with the intention of getting the number of customers and the number of orders in each country.  How sad when they learn that these are the same!  There are no repeat purchasers anywhere.  (COUNT(DISTINCT c.CustomerId) fixes this problem.)

Such confusion would be a non-issue.

And, if you wanted to count non-NULL values?  Why not do it explicitly, so you can remember what the query is supposed to be doing:
select sum(case when a is not null then 1 else 0 end)
Yes, this takes a bit more typing but the query is much clearer on what it is doing.  It would be much shorter if all databases supported the "boolean" is an "integer" shortcut:
select sum(a is not null)

(4) ,

What is a list of three things without a fourth to cap it off?  Just don't use a comma in the FROM clause.  Explicit join syntax is more expressive and clearer in every case.  The , can be replaced by CROSS JOIN.

No comments:

Post a Comment

Your comment will appear when it has been reviewed by the moderators.