5 Ways to Better Understand SQL by Adding Optional Parentheses

It appears that our recent beginner SQL articles explaining SQL syntax were quite popular. These include:

Today, I want to tackle the problem of understanding and learning SQL syntax from a new angle.

SQL’s COBOL-esque syntactic heritage

Unlike most C-style languages, SQL syntax is a messy mixture of keywords, special characters, and both with a lot of overloaded meanings. I’d call myself rather experienced with SQL, yet I’m still often unsure about whether something is a keyword or an identifier (and yes, it does matter):

From my past SQL training sessions, however, I have learned that one of the best tools to understand what’s really going on in SQL syntax is to add parentheses at random places. Parentheses can be noisy, but they can definitely help understand the syntactic structure of a statement.

A trivial example from arithmetics, or logic:

a + b * c
a OR b AND c

In almost all languages, operator precedence dictates that the above be equivalent to this:

a + (b * c)
a OR (b AND c)

And already, we’ve introduced some syntactic clarity to readers who might not be aware of operator precedence. While the parentheses are not necessary in these cases, they can help with readability if the expressions are much more complex than the above.

Did you know you can place parentheses at many other locations in SQL? Here’s a list of interesting places, where most people usually omit parentheses. I’m going to use the Sakila database for examples.

1. Row value expressions

In every day language, we don’t call them “row value expressions”. We call them ordinary “column expressions”. For example, we might look for a specific actor by name

SELECT * 
FROM actor
WHERE first_name = 'SUSAN'
AND last_name = 'DAVIS';

We will get:

|actor_id   |first_name|last_name |
|-----------|----------|----------|
|101        |SUSAN     |DAVIS     |
|110        |SUSAN     |DAVIS     |

But what we really did was this:

-- Works in many dialects
SELECT * 
FROM actor
WHERE (first_name) = ('SUSAN')
AND (last_name) = ('DAVIS');

Or also:

-- Works in a few dialects, including e.g. PostgreSQL
SELECT * 
FROM actor
WHERE ROW (first_name) = ROW ('SUSAN')
AND ROW (last_name) = ROW ('DAVIS');

The parentheses are somewhat ambiguous in this case, the ROW constructor makes it more obvious. The SQL standard specifies a “row value expression special case”, i.e. the case where a row value expression (i.e. a tuple expression) is of degree 1. In case of which most people will “colloquially” omit the parentheses, thinking we’re comparing column expressions.

In this example, we could rewrite the query to this:

SELECT * 
FROM actor
WHERE (first_name, last_name) = ('SUSAN', 'DAVIS');

Or, in funky Oracle, double parentheses are required for no good reason on the right hand side.

SELECT * 
FROM actor
WHERE (first_name, last_name) = (('SUSAN', 'DAVIS'));

But I was not going to confuse you, I was going to clarify things ;-)

In fact, once you get a hang of row value expressions, you will quickly want to use them for less trivial cases than avoiding two AND connected predicates. For example, you can use this approach also with the IN predicate (and if you’re daring enough, with the NOT IN predicate, but beware of NULLS):

SELECT * 
FROM actor
WHERE (first_name, last_name) IN (
  ('SUSAN', 'DAVIS'),
  ('NICK' , 'WAHLBERG')
)

Wow! Result:

|actor_id   |first_name|last_name |
|-----------|----------|----------|
|2          |NICK      |WAHLBERG  |
|101        |SUSAN     |DAVIS     |
|110        |SUSAN     |DAVIS     |

And not only that. You can also use the same approach with IN and subqueries:

-- Actors that have the same name as a customer
SELECT *
FROM actor
WHERE (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM customer
)

And indeed:

|actor_id   |first_name|last_name |
|-----------|----------|----------|
|4          |JENNIFER  |DAVIS     |

Yes! The “colloquial” IN predicate that you all wrote without the parentheses around the row value expression, in reality, had a row value expression of degree 1 on the left side, AND on the right side. Increase that degree to 2, and now you need to write a subquery with 2 columns as well.

It’s quite idiomatic, and readable, and nifty. Quite a few dialects support this.

2. JOINs

Joins are very misunderstood, in SQL. Very very misunderstood. One reason is because people confuse them with other set operations, such as UNION, and thus illustrate them with Venn Diagrams. I’ve already covered that. That article explains that JOINs are just (filtered) cartesian products.

The simplest JOIN type is CROSS JOIN. It does not have an ON or USING clause. Much more common, however, is INNER JOIN, or just JOIN (INNER being an optional keyword).

SELECT a.first_name, a.last_name, f.title
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)

We’ve all typed millions of these to obtain:

|first_name|last_name |title                         |
|----------|----------|------------------------------|
|...       |...       |...                           |
|PENELOPE  |GUINESS   |SPLASH GUMP                   |
|PENELOPE  |GUINESS   |VERTIGO NORTHWEST             |
|PENELOPE  |GUINESS   |WESTWARD SEABISCUIT           |
|PENELOPE  |GUINESS   |WIZARD COLDBLOODED            |
|NICK      |WAHLBERG  |ADAPTATION HOLES              |
|NICK      |WAHLBERG  |APACHE DIVINE                 |
|NICK      |WAHLBERG  |BABY HALL                     |
|...       |...       |...                           |

But what are these JOIN things? They are not SELECT clauses! They are operators like + or * or AND or OR.

And like any operator, the messy laws of associativity apply (as all the different join types have the same operator precedence, at least in the SQL standard. Some wonky implementations may disagree).

Luckily, left associativity applies with JOIN (right associativity would be madness). So, our magic parentheses come into play again. Our query from before is really this:

SELECT a.first_name, a.last_name, f.title
FROM (
  actor AS a
    JOIN film_actor AS fa 
      USING (actor_id)
)
JOIN film AS f 
  USING (film_id)

Can you see it? The readability problem here is the USING (or ON) clause. Without it, I could have written:

(actor JOIN film_actor) JOIN film

So, join FILM_ACTOR to ACTOR and then join the FILM table to the product.

Because INNER JOIN is associative (both left and right), we could write the following equivalent expression:

actor JOIN (film_actor JOIN film)

So, join FILM to FILM_ACTOR first, and then join the product to ACTOR. Quite a different beast, logically (the optimiser shouldn’t make a difference in this case).

… or in full SQL

SELECT a.first_name, a.last_name, f.title
FROM actor AS a 
  JOIN (
    film_actor
      JOIN film AS f 
        USING (film_id)
  )
USING (actor_id)

Egh. Don’t be clever. Use this only if you really have to. I find it quite unreadable. Careful formatting is essential here. But it works.

With more than 3 tables, you could build entire JOIN trees. While, in order to stay sane, most people will not actually build JOIN trees (yes, a list is also a tree), it is totally possible to do this confusing thing:

SELECT a.first_name, a.last_name, sum(p.amount)
FROM (
  actor AS a
    JOIN film_actor AS fa 
      USING (actor_id)
)
JOIN (
  film AS f
    JOIN (
      inventory AS i 
        JOIN rental AS r
          USING (inventory_id)
    ) USING (film_id)
) USING (film_id)
JOIN payment AS p 
  USING (rental_id)
GROUP BY a.actor_id

Instead of this nice clean SQL:

SELECT a.first_name, a.last_name, sum(p.amount)
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
JOIN inventory AS i USING (film_id)
JOIN rental AS r USING (inventory_id)
JOIN payment AS p USING (rental_id)
GROUP BY a.actor_id

… to get the revenue of all films we made for any given actor. Something like this:

|first_name|last_name |sum         |
|----------|----------|------------|
|ADAM      |GRANT     |974.19      |
|ADAM      |HOPPER    |1532.21     |
|AL        |GARLAND   |1525.87     |
|ALAN      |DREYFUSS  |1850.29     |
|ALBERT    |JOHANSSON |2202.78     |
|ALBERT    |NOLTE     |2183.75     |

But nevertheless. It is good to know how JOIN works, syntactically, because there will be that edge case where you want to give precedence to an INNER JOIN and then LEFT JOIN the entire product, or something like that. In that case, parentheses are your friend.

3. DISTINCT

There are no parentheses around the “DISTINCT arguments”. Despite a lot of people thinking there are. This is not what some people think it is:

SELECT DISTINCT (actor_id), first_name, last_name
FROM actor

The parentheses can be omitted, and it is the exact same thing.

The other nice thing about SQL syntax is that you never know whether parentheses are part of the language, or part of an expression. In fact, the above (id) is a row value expression again. If you (accidentally) wrapped more than one column in those parentheses:

SELECT DISTINCT (actor_id, first_name), last_name
FROM actor

Then you’d still get the same behaviour (in PostgreSQL, which supports nesting records like that), but the result is not what you wanted:

|row              |last_name|
|-----------------|---------|
|(1,PENELOPE)     |GUINESS  |
|(2,NICK)         |WAHLBERG |
|(3,ED)           |CHASE    |
|(4,JENNIFER)     |DAVIS    |

Funky eh?

4. UNION, INTERSECT, EXCEPT

The nice thing about SQL is that you can never derive a rule from another rule. While all types of JOIN have the same precedence, the set operators do not. Duh. Luckily, most people don’t use INTERSECT or EXCEPT at all, only UNION [ ALL ], so they don’t run into this problem.

In case there is any doubt, INTERSECT has a higher precedence, whereas UNION and EXCEPT have the same precedence (in the SQL standard). I’ll prove it by quoting it. From ISO/IEC 9075-2:2016(E) 7.17 <query expression> (I’ve removed some irrelevant bits)

<query expression body> ::=
  <query term>
| <query expression body> UNION [ ALL ] <query term>
| <query expression body> EXCEPT [ ALL ] <query term>

<query term> ::=
  <query primary>
| <query term> INTERSECT [ ALL ] <query primary>

See. INTERSECT has a higher precedence. Let’s try it (in PostgreSQL):

SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
INTERSECT
SELECT 1 AS a, 2 AS b

We get:

|a          |b          |
|-----------|-----------|
|1          |2          |
|2          |3          |

So, what we’ve done effectively (and the parenthesis are totally allowed, because <query term> allows it), is this

SELECT 2 AS a, 3 AS b
UNION
(
  SELECT 1 AS a, 2 AS b
  INTERSECT
  SELECT 1 AS a, 2 AS b
)

Now, for the life of me, I really don’t know how to nicely format parenthesised set operations in SQL. It looks awkward. But here we are. The above is a totally different thing than this:

(
  SELECT 2 AS a, 3 AS b
  UNION
  SELECT 1 AS a, 2 AS b
)
INTERSECT
SELECT 1 AS a, 2 AS b

If we do the union first, and then the intersection, the only resulting record is:

|a          |b          |
|-----------|-----------|
|1          |2          |

A similar effect can be seen when combining UNION and UNION ALL, which have the same precedence and are … left associative. Lucky again. So, this query here:

SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
UNION ALL
SELECT 1 AS a, 2 AS b

Has a UNION that does not get to remove any duplicates, and then a UNION ALL that adds a duplicate. It is equivalent to this (remember, left associative):

(
  SELECT 2 AS a, 3 AS b
  UNION
  SELECT 1 AS a, 2 AS b
)
UNION ALL
SELECT 1 AS a, 2 AS b

The result is:

|a          |b          |
|-----------|-----------|
|1          |2          |
|2          |3          |
|1          |2          |

It is quite a different beast from this one:

SELECT 2 AS a, 3 AS b
UNION
(
  SELECT 1 AS a, 2 AS b
  UNION ALL
  SELECT 1 AS a, 2 AS b
)

Now, we’ve used parentheses to force the query to do the UNION ALL operator first (producing duplicates), but then, the UNION removes them agian. The result is now:

|a          |b          |
|-----------|-----------|
|1          |2          |
|2          |3          |

Bonus: What about the ORDER BY clause

Ah, yes. The miracles of SQL syntax. In theory (and in the SQL standard), there is an optional ORDER BY clause over all set operations. E.g. like this:

SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
INTERSECT
SELECT 1 AS a, 2 AS b
ORDER BY a DESC

We’re getting:

|a          |b          |
|-----------|-----------|
|2          |3          |
|1          |2          |

Think of it this way:

(
  SELECT 2 AS a, 3 AS b
  UNION
  SELECT 1 AS a, 2 AS b
  INTERSECT
  SELECT 1 AS a, 2 AS b
)
ORDER BY a DESC

Some dialects may or may not allow you to actually place those parentheses. E.g. PostgreSQL does.

5. Subqueries

When you write subqueries, or specifically:

  • Scalar subqueries (subqueries in SELECT or WHERE, etc.)
  • Derived tables (subqueries in FROM)

… then you must put them in parentheses. For example, a correlated (scalar) subquery:

SELECT 
  first_name, 
  last_name, (
    -- Correlated subquery here
    SELECT count(*) 
    FROM film_actor AS fa 
    WHERE fa.actor_id = a.actor_id
  ) c
FROM actor AS a

These parentheses are not optional. Neither are those around the derived table:

SELECT 
  first_name, 
  last_name, 
  c
FROM actor AS a
JOIN (
  -- Derived table here
  SELECT actor_id, count(*) AS c
  FROM film_actor
  GROUP BY actor_id
) fa USING (actor_id)

Conclusion

SQL (and its syntax) is a device whose mystery is only exceeded by its power. With parentheses (as with other syntactic tokens), you can never really know if they are part of the language, of some clause, or part of an expression.

With some operators, parentheses help better understand the syntactic structure of the language.

  • Row value expressions of degree 1 have optional parentheses. We never write them, but we could have!
  • JOINs are really trees, not lists. We can nest them arbitrarily to get associativity under control.
  • DISTINCT doesn’t have parentheses. Don’t be fooled by your coworkers’ coding style!
  • Set operations have two levels of precedence: 1) INTERSECT and 2) UNION and EXCEPT (assuming standards compliance).
  • Subqueries (outside of set operations) have mandatory parentheses

I’m curious about your own greatest SQL syntax confusions. Please leave them in the comments.

How SQL DISTINCT and ORDER BY are Related

One of the things that confuse SQL users all the time is how DISTINCT and ORDER BY are related in a SQL query.

The Basics

Running some queries against the Sakila database, most people quickly understand:

SELECT DISTINCT length FROM film

This returns results in an arbitrary order, because the database can (and might apply hashing rather than ordering to remove duplicates):

length |
-------|
129    |
106    |
120    |
171    |
138    |
80     |
...

Most people also understand:

SELECT length FROM film ORDER BY length

This will give us duplicates, but in order:

length |
-------|
46     |
46     |
46     |
46     |
46     |
47     |
47     |
47     |
47     |
47     |
47     |
47     |
48     |
...

And, of course, we can combine the two:

SELECT DISTINCT length FROM film ORDER BY length

Resulting in…

length |
-------|
46     |
47     |
48     |
49     |
50     |
51     |
52     |
53     |
54     |
55     |
56     |
...

Then why doesn’t this work?

Maybe somewhat intuitively, we may want to order the lengths differently, e.g. by title:

SELECT DISTINCT length FROM film ORDER BY title

Most databases fail this query with an exception like Oracle’s:

ORA-01791: not a SELECTed expression

At first sight, this seems funny, because this works after all:

SELECT length FROM film ORDER BY title

Yielding:

length |
-------|
86     |
48     |
50     |
117    |
130    |
...

We could add the title to illustrate the ordering

length |title                       |
-------|----------------------------|
86     |ACADEMY DINOSAUR            |
48     |ACE GOLDFINGER              |
50     |ADAPTATION HOLES            |
117    |AFFAIR PREJUDICE            |
130    |AFRICAN EGG                 |

So, how are these different?

We have to rewind and check out the logical order of SQL operations (as opposed to the syntactic order). And always remember, this is the logical order, not the actual order executed by the optimiser.

When we write something like this:

SELECT DISTINCT length FROM film ORDER BY length

The logical order of operations is:

  • FROM clause, loading the FILM table
  • SELECT clause, projecting the LENGTH column
  • DISTINCT clause, removing distinct tuples (with projected LENGTH columns)
  • ORDER BY clause, ordering by the LENGTH column

If we look at this step by step, we have:

Step 1: SELECT * FROM film

The intermediary data set is something like:

film_id |title                       |length | ...
--------|----------------------------|-------| ...
1       |ACADEMY DINOSAUR            |86     | ...
2       |ACE GOLDFINGER              |48     | ...
3       |ADAPTATION HOLES            |50     | ...
4       |AFFAIR PREJUDICE            |117    | ...
5       |AFRICAN EGG                 |130    | ...
...     |...                         |...    | ...

Step 2: SELECT length …

The intermediary data set is something like:

length |
-------|
86     |
48     |
50     |
117    |
130    |
...
86     | <-- duplicate

Step 3: SELECT DISTINCT length …

Now we’re getting a new random order (due to hashing) and no duplicates anymore:

length |
-------|
129    |
106    |
120    |
171    |
138    |
...

Step 4: … ORDER BY length

And we’re getting:

length |
-------|
46     |
47     |
48     |
49     |
50     |
...

It seems obvious.

So why did this work?

Remember, this query worked:

SELECT length FROM film ORDER BY title

Even if after projecting the LENGTH column, it seems as though it is no longer available for sorting, it really is, according to the SQL standard and to common sense. There is a concept called extended sort key columns in the SQL standard, which means the above query has a slightly different order of operations (apart from the fact that there is no DISTINCT operation):

  • FROM clause, loading the FILM table
  • SELECT clause, projecting the LENGTH column from the select list and the TITLE from the extended sort key columns
  • ORDER BY clause, ordering by the TITLE column
  • SELECT clause (implicit), projecting only the LENGTH column, discarding the TITLE column

Again, this is what happens logically. Database optimisers may choose other ways to implement this. By example:

Step 1: SELECT * FROM film

Same as before

film_id |title                       |length | ...
--------|----------------------------|-------| ...
1       |ACADEMY DINOSAUR            |86     | ...
2       |ACE GOLDFINGER              |48     | ...
3       |ADAPTATION HOLES            |50     | ...
4       |AFFAIR PREJUDICE            |117    | ...
5       |AFRICAN EGG                 |130    | ...
...     |...                         |...    | ...

Step 2: SELECT length, title…

We get that synthetic extended sort key column TITLE along with the LENGTH column that we requested

length |title                       |
-------|----------------------------|
86     |ACADEMY DINOSAUR            |
114    |ALABAMA DEVIL               |
50     |ADAPTATION HOLES            |
117    |AFFAIR PREJUDICE            |
168    |ANTITRUST TOMATOES          |
...

Step 3: … ORDER BY title

… we can now order by that column

length |title                       |
-------|----------------------------|
86     |ACADEMY DINOSAUR            |
48     |ACE GOLDFINGER              |
50     |ADAPTATION HOLES            |
117    |AFFAIR PREJUDICE            |
130    |AFRICAN EGG                 |
...

Step 4: SELECT length

… and finally discard it, because we never wanted it

length |
-------|
86     |
48     |
50     |
117    |
130    |

So why can’t we use DISTINCT?

If we try to run this:

SELECT DISTINCT length FROM film ORDER BY title

We would get an additional DISTINCT operation in our logical set of operations:

  • FROM clause, loading the FILM table
  • SELECT clause, projecting the LENGTH column from the select list and the TITLE from the extended sort key columns
  • DISTINCT clause, removing duplicate (LENGTH, TITLE) values… Ooops
  • ORDER BY clause, ordering by the TITLE column
  • SELECT clause (implicit), projecting only the LENGTH column, discarding the TITLE column

The problem is, since we have synthetically added the extended sort key column TITLE to the projection in order to be able to ORDER BY it, DISTINCT wouldn’t have the same semantics anymore as can be seen here:

SELECT count(*)
FROM (
  SELECT DISTINCT length FROM film
) t;

SELECT count(*)
FROM (
  SELECT DISTINCT length, title FROM film
) t;

Yielding

140
1000

All titles are distinct. There is no way this query can be executed reasonably. Either DISTINCT doesn’t work (because the added extended sort key column changes its semantics), or ORDER BY doesn’t work (because after DISTINCT we can no longer access the extended sort key column).

A more constructed example. T contains this data:

CREATE TABLE t (a INT, b INT);
INSERT INTO t VALUES (1, 1);
INSERT INTO t VALUES (1, 2);
INSERT INTO t VALUES (2, 3);
INSERT INTO t VALUES (1, 4);
INSERT INTO t VALUES (2, 5);
A   B
-----
1   1
1   2
2   3
1   4
2   5

What would this query produce?

SELECT DISTINCT a FROM t ORDER BY b;

Clearly, we should only get 2 rows with values 1, 2, because of DISTINCT a:

A 
--
1
2

Now, how do we order these by B? There are 3 values of B associated A = 1 and 2 values of B associated with A = 2:

A   B
------------------
1   Any of 1, 2, 4
2   Any of 3, 5

Should we get 1, 2 or 2, 1 as a result? Impossible to tell.

But there are some exceptions

The way I read the SQL standard, the following exception should be possible. The SQL standard ISO/IEC 9075-2:2016(E), 7.17 <query expression>, Syntax Rules 28) d) i) 6) references the “Left normal form derivation”. But I may be reading this wrong, see also a discussion on the PostgreSQL mailing list:
https://www.postgresql.org/message-id/20030819103859.L69440-100000%40megazone.bigpanda.com

In any case, it still makes sense to me. For instance, we can form expressions on the columns in the select list. This is totally fine in MySQL (strict mode) and Oracle:

SELECT DISTINCT length 
FROM film 
ORDER BY mod(length, 10), length;

It will produce

length |
-------|
50     |
60     |
70     |
80     |
90     |
100    |
110    |
120    |
130    |
140    |
150    |
160    |
170    |
180    |
51     |
61     |
71     |

PostgreSQL doesn’t allow this because the expression MOD(LENGTH, 10) is not in the select list. How to interpret this? We’re looking again at the order of SQL operations:

  • FROM clause, loading the FILM table
  • SELECT clause, projecting the LENGTH column from the select list. MOD(LENGTH, 10) does not have to be put in the extended sort key columns, because it can be fully derived from the select list.
  • DISTINCT clause, removing duplicate LENGTH values … all fine, because we don’t have the verboten extended sort key columns
  • ORDER BY clause, ordering by the mod(LENGTH, 10), LENGTH columns. Totally fine, because we can derive all of these order by expressions from expressions in the select list

Makes sense, right?

Back to our constructed table T:

A   B
-----
1   1
1   2
2   3
1   4
2   5

We are allowed to write:

SELECT DISTINCT a, b FROM t ORDER BY a - b;

We would get:

A   B
-----
1   4
2   5
2   3
1   2
1   1

Again, the order by expressions can be derived completely from the select list. This also works in Oracle:

SELECT DISTINCT a - b FROM t ORDER BY abs(a - b);

The select list contains a column A - B, so we can derive any ORDER BY expression from it. But these don’t work:

SELECT DISTINCT a - b FROM t ORDER BY a;
SELECT DISTINCT a - b FROM t ORDER BY b;
SELECT DISTINCT a - b FROM t ORDER BY b - a;

It is easy to build an intuition for why these don’t work. Clearly, the data set we want is:

A - B  A             B             B - A
------------------------------------------
-3     Any of 1, 2   Any of 4, 5   3
-1     Any of 2, 1   Any of 3, 2   1
 0     Any of 1      Any of 1      0

Now, how are we supposed to order these by A, B or B - A? It looks as though we should be able to sort by B - A in this case. We could derive a complicated transformation of expressions that can be reasonably transformed into each other, such as A - B = -(B - A), but this simply isn’t practical. The expression in the projection is A - B, and that’s the only expression we can re-use in the ORDER BY. For example, we could even do this in Oracle:

SELECT DISTINCT a - b FROM t ORDER BY abs((a - b) + (a - b));

Or start using aliases:

SELECT DISTINCT a - b AS x FROM t ORDER BY abs(x + x);

Conclusion

The SQL language is quirky. This is mostly because the syntactical order of operations doesn’t match the logical order of operations. The syntax is meant to be human readable (remember Structured English Query Language?) but when reasoning about a SQL statement, we would often like to directly write down the logical order of operations.

In this article, we haven’t even touched the implications of adding

  • GROUP BY
  • TOP / LIMIT / FETCH
  • UNION

Which add more fun rules to what’s possible and what isn’t. Our previous article on the true logical order of SQL operations explains this completely.

Need more explanation? Check this out.

Should I Put That Table Alias or Not?

Infrequent SQL developers often get confused about when to put parentheses and/or aliases on derived tables. There has been this recent Reddit discussion about the subject, where user Elmhurstlol was wondering why they needed to provide an alias to the derived table (the subselect with the UNION) in the following query:

SELECT AVG(price) AS AVG_PRICE
FROM (
  SELECT price from product a JOIN pc b
  ON a.model=b.model AND maker='A'
  UNION ALL
  SELECT price from product a JOIN laptop c
  ON a.model=c.model and maker='A'
) hello

The question really was about why the "hello" table alias was necessary, because often it seems not to be required.

Here’s what the SQL standard states

If in doubt, it is often useful to consider the SQL standard about the rationale behind some syntax elements. In this case, let’s consider the freely available SQL 1992 standard text (for simplicity), and see how it specifies table references:

<table reference> ::=
    <table name> [ [ AS ] <correlation name>
        [ <left paren> <derived column list> 
          <right paren> ] ]
  | <derived table> [ AS ] <correlation name>
        [ <left paren> <derived column list>
          <right paren> ]
  | <joined table>

<derived table> ::= <table subquery>

<table subquery> ::= <subquery>

<subquery> ::= <left paren> <query expression>
               <right paren>

(for more information about the awesome and completely underused derived column list feature, read this article here)

The essence of the above syntax specification is this:

  • A derived table MUST always be aliased
  • The AS keyword is optional, for improved readability
  • The parentheses MUST always be put around subqueries

Following these rules, you’ll be pretty safe in most SQL dialects. Here are some deviations to the above, though:

  • Some dialects allow for unaliased derived tables. However, this is still a bad idea, because you will not be able to unambiguously qualify a column from such a derived table

Takeaway

Always provide a meaningful alias to your derived tables. As simple as that.

MySQL Bad Idea #666

MySQL… We’ve blogged about MySQL before. Many times. We’ve shown bad ideas implemented in MySQL here:

But this beats everything. Check out this Stack Overflow question. It reads: “Why Oracle does not support ‘group by 1,2,3’?”. At first, I thought this user might have been confused because SQL allows for referencing columns by (1-based!) index in ORDER BY clauses:

SELECT first_name, last_name
FROM customers
ORDER BY 1, 2

The above is equivalent to ORDER BY first_name, last_name. The indexes 1, 2 refer to columns from the projection. This might be useful every now and then to avoid repeating complex column expressions, although it is probably a bit risky as you can change ordering semantics when adding a column to the SELECT clause.

But this user wanted to use the same syntax for the GROUP BY clause. And this actually works in MySQL! Check out the following query:

SELECT a, b
FROM (
  SELECT 'a' a, 'b' b, 'c' c UNION ALL
  SELECT 'a'  , 'c'  , 'c'   UNION ALL
  SELECT 'a'  , 'b'  , 'd'
) t
GROUP BY 1, 2
ORDER BY 2

See an example on SQLFiddle

The above yields…

| A | B |
|---|---|
| a | b |
| a | c |

But what would this even mean? According to our in-depth explanation of SQL clauses, the projection (SELECT clause) is logically evaluated after the GROUP BY clause. In other words, the columns defined in the SELECT clause are not yet in scope of the GROUP BY clause. Hence, the only reasonable semantics of column indexes would be the index from the table source t. But that’s not the case. Check out this alternative query:

SELECT a, b
FROM (
  SELECT 'a' a, 'b' b, 'c' c UNION ALL
  SELECT 'a'  , 'c'  , 'c'   UNION ALL
  SELECT 'a'  , 'b'  , 'd'
) t
GROUP BY 1, 2
ORDER BY 2

See an example on SQLFiddle

This now yields:

| B | C |
|---|---|
| b | c |
| c | c |
| b | d |

And it’s (probably?) the expected behaviour in MySQL as the documentation states:

Columns selected for output can be referred to in ORDER BY and GROUP BY clauses using column names, column aliases, or column positions. Column positions are integers and begin with 1

The documentation actually treats GROUP BY in a very similar fashion as ORDER BY. For instance, it is possile to specify the ordering direction using GROUP BY only:

MySQL extends the GROUP BY clause so that you can also specify ASC and DESC after columns named in the clause:

SELECT a, COUNT(b) 
FROM test_table GROUP BY a DESC;

While we cannot figure out a reasonable edge-case that breaks this feature, we still think that there is something fishy about it. The fact that the SELECT clause is logically evaluated after the table source (FROM, WHERE, GROUP BY, HAVING), allowing the GROUP BY clause to reference it seems to lead to a weird understanding of SQL.

On the other hand, SQL is a very verbose language with little support for declaring reusable objects, short of common table expressions and the WINDOW clause. It is actually a bit surprising that the SQL standards folks would support this WINDOW clause to declare reusable window frames before introducing much more usable “common column expressions”, e.g:

-- Common column/table expressions:
WITH x AS CASE t1.a 
          WHEN 1 THEN 'a'
          WHEN 2 THEN 'b'
                 ELSE 'c'
          END, 
     y AS SOME_FUNCTION(t2.a, t2.b)
SELECT x, NVL(y, x)
FROM t1 JOIN t2 ON t1.id = t2.id
GROUP BY x, y
ORDER BY x DESC, y ASC

With common column expressions, reusing column expressions is independent of the SELECT clause itself. In other words, you can reuse column expressions in JOIN clauses, WHERE clauses, GROUP BY clauses, HAVING clauses, etc. without having to actually SELECT them.

So, to be fair with MySQL, while this feature is a non-feature in its current form, it provides a workaround for SQL’s verbosity.

PostgreSQL 9.3 Released!

A great database has just gotten better. PostgreSQL 9.3 has been released today. While improved reliability and availability is certainly quite a thrilling addition, from the jOOQ perspective, the most interesting features are new SQL syntax elements. These include:

  • Better support for JSON
  • LATERAL JOIN (or as SQL:1999 calls it: lateral derived table)
  • Materialised views
  • Updatable views

See all the improvements for yourself:
https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.3