The SQL Language’s Most Missing Feature

SQL is also awesome in many ways. We can write out the most complex truths and facts and have the database tell us the answer in no time.

But the SQL language is arguably the most beautiful programming language out there. It has so many caveats that people like me get ridiculously rich selling consulting services just to merely explain its semantics. One of the best examples of how twisted the SQL language is, is Alex Bolenok’s article about SQL NULL on Tech.Pro.

Now, one of the biggest criticisms of SQL is its verbosity. Most SQL dialects have virtually no constructs to avoid repetition. It is a common use case in SQL to filter, group, select, and order by the same expression. An example:

SELECT   first_name || ' ' || last_name
FROM     customers
WHERE    first_name || ' ' || last_name LIKE 'L% E%'
GROUP BY first_name || ' ' || last_name
ORDER BY first_name || ' ' || last_name

You might write a different query for the same, sure. But even if you didn’t doesn’t it bother you that there is hardly any way to re-use the first_name || ' ' || last_name concatenation expression?

Well, you can at least reuse it in the ORDER BY clause:

SELECT   first_name || ' ' || last_name AS name
FROM     customers
WHERE    first_name || ' ' || last_name LIKE 'L% E%'
GROUP BY first_name || ' ' || last_name
ORDER BY name

And in MySQL, you can also reuse it in the GROUP BY clause, although we think that this is a bad idea:

SELECT   first_name || ' ' || last_name AS name
FROM     customers
WHERE    first_name || ' ' || last_name LIKE 'L% E%'
GROUP BY name
ORDER BY name

Standard SQL solutions for column reuse

Sure, you could create a derived table:

SELECT   name
FROM (
  SELECT first_name || ' ' || last_name
  FROM   customers
) c(name)
WHERE    name LIKE 'L% E%'
GROUP BY name
ORDER BY name

… or a common table expression, which is basically a reusable derived table, or a local view:

WITH c(name) AS (
  SELECT first_name || ' ' || last_name
  FROM   customers
)
SELECT   name
FROM     c
WHERE    name LIKE 'L% E%'
GROUP BY name
ORDER BY name

But why do I even have to create a table to reuse / rename a simple column expression? Why can’t I just simply use something like a common column expression? Like this?

-- Common column expression that can be appended
-- to any table expression.
SELECT   name
FROM     customers
WITH     name AS first_name || ' ' || last_name
WHERE    name LIKE 'L% E%'
GROUP BY name
ORDER BY name

Note that the common column expression would be scoped to the table source. In other words, it only makes sense in the context of a FROM clause and the tables specified in the FROM clause. In a way, the following two expressions would be exactly equivalent:

-- Proposed syntax
FROM     customers
WITH     name AS first_name || ' ' || last_name

-- Existing syntax
FROM (
  SELECT customers.*,
         first_name || ' ' || last_name AS name
  FROM   customers
) AS customers

The proposed syntax could also be applied to joined tables:

-- Proposed syntax
FROM customers
  AS c
WITH name AS c.first_name || ' ' || c.last_name
JOIN addresses
  AS a
WITH address AS a.street || '\n' || a.city
  ON c.address_id = a.address_id
WITH full_address AS c.name || '\n' || a.address

-- Or alternatively, if you don't need to tightly 
-- scope these common column expressions to their
-- corresponding tables:
FROM customers AS c
JOIN addresses AS a
  ON c.address_id = a.address_id
WITH name AS c.first_name || ' ' || c.last_name,
     address AS a.street || '\n' || a.city,
     full_address AS name || '\n' || address

So in plain English, the new WITH clause could be appended to any type of table expression. With parentheses and comments, the first among the above examples would read:

FROM (
  customers AS c
  -- The "name" column is appended to "c"
  WITH name AS c.first_name || ' ' || c.last_name
)
JOIN (
  addresses AS a
  -- The "address" column is appended to "a"
  WITH address AS a.street || '\n' || a.city
) ON c.address_id = a.address_id
-- The "full_address" column is appended to the
-- above joined table expression
WITH full_address AS c.name || '\n' || a.address

The above syntax would then again be exactly equivalent to this:

FROM (
  SELECT 
    c.*, a.*, 
    c.name || '\n' || a.address AS full_address
  FROM (
    SELECT c.*, 
           c.first_name || ' ' || c.last_name
    FROM customers AS c
  ) c
  JOIN (
    SELECT a.*,
           a.street || '\n' || a.city
    FROM addresses AS a
  ) a
  ON c.address_id = a.address_id
)

SQL’s most missing language feature

Since SQL:1999, we luckily have common table expressions – the WITH clause that can be prepended to any SELECT statement (and to other DML statements in some dialects). Using common table expressions, we can avoid repeating commonly used derived tables.

But such a thing is not possible for columns. The one language feature SQL is in most dire need of are thus common column expressions.

Recursive SQL for Data Normalisation

Recursive SQL can be awesome, although a bit hard to read in its SQL standard beauty. Let’s assume you have some aggregated data with dates and a number of events per date:

|                           DATE | COUNT |
|--------------------------------|-------|
| October, 01 2013 00:00:00+0000 |     2 |
| October, 02 2013 00:00:00+0000 |     1 |
| October, 03 2013 00:00:00+0000 |     3 |
| October, 04 2013 00:00:00+0000 |     4 |
| October, 05 2013 00:00:00+0000 |     2 |
| October, 06 2013 00:00:00+0000 |     0 |
| October, 07 2013 00:00:00+0000 |     2 |

Now let’s assume you want to normalise or “unaggregate” this data, generating “COUNT” records per date. The desired output is this:

|                           DATE | EVENT_NUMBER |
|--------------------------------|--------------|
| October, 01 2013 00:00:00+0000 |            1 |
| October, 01 2013 00:00:00+0000 |            2 |
| October, 02 2013 00:00:00+0000 |            1 |
| October, 03 2013 00:00:00+0000 |            1 |
| October, 03 2013 00:00:00+0000 |            2 |
| October, 03 2013 00:00:00+0000 |            3 |
| October, 04 2013 00:00:00+0000 |            1 |
| October, 04 2013 00:00:00+0000 |            2 |
| October, 04 2013 00:00:00+0000 |            3 |
| October, 04 2013 00:00:00+0000 |            4 |
| October, 05 2013 00:00:00+0000 |            1 |
| October, 05 2013 00:00:00+0000 |            2 |
| October, 07 2013 00:00:00+0000 |            1 |
| October, 07 2013 00:00:00+0000 |            2 |

As you may have noticed, there are no records for those dates with zero events (October 06). With recursive SQL, this is rather simple to achieve.

with recursive

-- Data could also be a regular table containing
-- the actual data
data(date, count) as (
  select date '2013-10-01', 2 union all
  select date '2013-10-02', 1 union all
  select date '2013-10-03', 3 union all
  select date '2013-10-04', 4 union all
  select date '2013-10-05', 2 union all
  select date '2013-10-06', 0 union all
  select date '2013-10-07', 2
),

-- This is the recursive common table expression
-- It starts with all data where count > 0
-- ... and then recurses by subtracting one
recurse(date, count) as (
  select date, count
  from data
  where count > 0
  union all
  select date, count - 1
  from recurse
  where count > 1
)
select date, count event_number from recurse
order by date asc, event_number asc;

See also this SQLFiddle to see the above CTE in action.

Incredibly, Oracle’s CONNECT BY clause doesn’t seem to be an option here. I challenge you to find a better solution, though! For instance, this beautiful solution that works with PostgreSQL:

with recursive
data(date, count) as (
  select date '2013-10-01', 2 union all
  select date '2013-10-02', 1 union all
  select date '2013-10-03', 3 union all
  select date '2013-10-04', 4 union all
  select date '2013-10-05', 2 union all
  select date '2013-10-06', 0 union all
  select date '2013-10-07', 2
)
select date, generate_series(1, count) event_number
from data
where count > 0
order by date asc, event_number asc;

PostgreSQL Syntax is a Mystery Only Exceeded by its Power

I just ran across this rather powerful PostgreSQL example statement from the manual. It reads

Increment the sales count of the salesperson who manages the account for Acme Corporation, and record the whole updated row along with current time in a log table:

WITH upd AS (
  UPDATE employees 
  SET sales_count = sales_count + 1 
  WHERE id = (
    SELECT sales_person 
    FROM accounts 
    WHERE name = 'Acme Corporation'
  )
  RETURNING *
)
INSERT INTO employees_log 
SELECT *, current_timestamp 
FROM upd;

This database keeps amazing me. A single statement allows for updating one table, taking the updated record(s) including all generated values as input for an insert into another table. But at the same time, PostgreSQL does not yet implement the SQL:2003 MERGE statement.

Crazy database…

Emulating the SQL standard derived column list

Derived column lists are a fine feature, if your database supports them. The SQL:2008 standard specifies

   7.6 <table reference> 

    <table reference> ::= 
        <table factor> 
      | <joined table> 

    <table factor> ::= 
        <table primary> [ <sample clause> ] 

    <table primary> ::= 
        <table or query name> [ [ AS ] <correlation name> 
      [ <left paren> <derived column list> <right paren> ] ] 
      | <derived table> [ AS ] <correlation name> 
      [ <left paren> <derived column list> <right paren> ] 
      | [...] 

When you put this in action in an actual query, the above becomes (in Postgres SQL):

SELECT t.a, t.b
FROM (
  SELECT 1, 2
) t(a, b)

In words: You can rename a derived table AND its columns in a single step, by supplying a <derived column list> to your <correlation name> (also known as table alias). The main advantage of this syntax is the fact that you only need to know the degree of your derived table, not the concrete (possibly auto-generated) names of its columns. This is particularly useful when renaming columns from an unnested table or from a table or array function:

SELECT t.a, t.b
FROM unnest(my_table_function()) t(a, b)

Emulating derived column lists

Not all databases support <derived column lists> along with a <correlation name>. But you can always emulate them. The simplest way is to use common table expressions. In Oracle, for instance, you could rewrite the above Postgres SQL statement to this:

-- Postgres
SELECT t.a, t.b
FROM (
  SELECT 1, 2
) t(a, b)

-- Equivalent Oracle query
WITH t(a, b) AS (
  SELECT 1, 2 FROM DUAL
)
SELECT * FROM t

-- Or a bit more verbose, if you really want to hide the
-- common table expression within the derived table
SELECT t.a, t.b
FROM (
  WITH t(a, b) AS (
    SELECT 1, 2 FROM DUAL
  )
  SELECT * FROM t
) t

Note that the CTE solution was given on this Stack Overflow question:
http://stackoverflow.com/q/14127707/521799

If your database supports neither derived column lists, nor common table expressions, you will have to push down the derived column list into the derived table and transform your nested SQL. In MySQL or H2, the above Postgres query would look like this:

-- MySQL, H2, and others
SELECT t.a, t.b
FROM (
  SELECT 1 a, 2 b
) t

It looks simple, but of course you’ll have to be able to actually transform your nested select and other sorts of table references

The solution to rule them all

If the latter isn’t an option for you, here’s one that will always work:

-- All databases
SELECT t.a, t.b
FROM (
  SELECT null a, null b FROM DUAL WHERE 1 = 0
  UNION ALL
  SELECT 1, 2 FROM DUAL
) t

Just concatenate an empty record in front of your actual query, which imposes column names upon your UNION ALL query. Thanks go to Bill’s great answer on Stack Overflow