It appears that our recent beginner SQL articles explaining SQL syntax were quite popular. These include:
Today, I want to tackle the problem of understanding and learning SQL syntax from a new angle.
SQL’s COBOL-esque syntactic heritage
Unlike most C-style languages, SQL syntax is a messy mixture of keywords, special characters, and both with a lot of overloaded meanings. I’d call myself rather experienced with SQL, yet I’m still often unsure about whether something is a keyword or an identifier (and yes, it does matter):
From my past SQL training sessions, however, I have learned that one of the best tools to understand what’s really going on in SQL syntax is to add parentheses at random places. Parentheses can be noisy, but they can definitely help understand the syntactic structure of a statement.
A trivial example from arithmetics, or logic:
In almost all languages, operator precedence dictates that the above be equivalent to this:
a + (b * c)
a OR (b AND c)
And already, we’ve introduced some syntactic clarity to readers who might not be aware of operator precedence. While the parentheses are not necessary in these cases, they can help with readability if the expressions are much more complex than the above.
Did you know you can place parentheses at many other locations in SQL? Here’s a list of interesting places, where most people usually omit parentheses. I’m going to use the
Sakila database for examples.
In every day language, we don’t call them “row value expressions”. We call them ordinary “column expressions”. For example, we might look for a specific actor by name
SELECT *
FROM actor
WHERE first_name = 'SUSAN'
AND last_name = 'DAVIS';
We will get:
|actor_id |first_name|last_name |
|-----------|----------|----------|
|101 |SUSAN |DAVIS |
|110 |SUSAN |DAVIS |
But what we really did was this:
-- Works in many dialects
SELECT *
FROM actor
WHERE (first_name) = ('SUSAN')
AND (last_name) = ('DAVIS');
Or also:
-- Works in a few dialects, including e.g. PostgreSQL
SELECT *
FROM actor
WHERE ROW (first_name) = ROW ('SUSAN')
AND ROW (last_name) = ROW ('DAVIS');
The parentheses are somewhat ambiguous in this case, the
ROW
constructor makes it more obvious. The SQL standard specifies a “row value expression special case”, i.e. the case where a row value expression (i.e. a tuple expression) is of degree 1. In case of which most people will “colloquially” omit the parentheses, thinking we’re comparing column expressions.
In this example, we could rewrite the query to this:
SELECT *
FROM actor
WHERE (first_name, last_name) = ('SUSAN', 'DAVIS');
Or, in funky Oracle, double parentheses are required for no good reason on the right hand side.
SELECT *
FROM actor
WHERE (first_name, last_name) = (('SUSAN', 'DAVIS'));
But I was not going to confuse you, I was going to clarify things ;-)
In fact, once you get a hang of row value expressions, you will quickly want to use them for less trivial cases than avoiding two
AND
connected predicates. For example, you can use this approach also with the
IN
predicate (and if you’re daring enough, with the
NOT IN
predicate, but beware of
NULLS
):
SELECT *
FROM actor
WHERE (first_name, last_name) IN (
('SUSAN', 'DAVIS'),
('NICK' , 'WAHLBERG')
)
Wow! Result:
|actor_id |first_name|last_name |
|-----------|----------|----------|
|2 |NICK |WAHLBERG |
|101 |SUSAN |DAVIS |
|110 |SUSAN |DAVIS |
And not only that. You can also use the same approach with
IN
and subqueries:
-- Actors that have the same name as a customer
SELECT *
FROM actor
WHERE (first_name, last_name) IN (
SELECT first_name, last_name
FROM customer
)
And indeed:
|actor_id |first_name|last_name |
|-----------|----------|----------|
|4 |JENNIFER |DAVIS |
Yes! The “colloquial”
IN
predicate that you all wrote without the parentheses around the row value expression, in reality, had a row value expression of degree 1 on the left side,
AND on the right side. Increase that degree to 2, and now you need to write a subquery with 2 columns as well.
It’s quite idiomatic, and readable, and nifty. Quite a few dialects support this.
Joins are very misunderstood, in SQL. Very very misunderstood. One reason is because people confuse them with other set operations, such as
UNION
,
and thus illustrate them with Venn Diagrams. I’ve already covered that. That article explains that JOINs are just (filtered) cartesian products.
The simplest JOIN type is
CROSS JOIN
. It does not have an
ON
or
USING
clause. Much more common, however, is
INNER JOIN
, or just
JOIN
(
INNER
being an optional keyword).
SELECT a.first_name, a.last_name, f.title
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
We’ve all typed millions of these to obtain:
|first_name|last_name |title |
|----------|----------|------------------------------|
|... |... |... |
|PENELOPE |GUINESS |SPLASH GUMP |
|PENELOPE |GUINESS |VERTIGO NORTHWEST |
|PENELOPE |GUINESS |WESTWARD SEABISCUIT |
|PENELOPE |GUINESS |WIZARD COLDBLOODED |
|NICK |WAHLBERG |ADAPTATION HOLES |
|NICK |WAHLBERG |APACHE DIVINE |
|NICK |WAHLBERG |BABY HALL |
|... |... |... |
But what are these
JOIN
things? They are
not SELECT
clauses! They are operators like
+
or
*
or
AND
or
OR
.
And like any operator, the messy laws of
associativity apply (as all the different join types have the same operator precedence, at least in the SQL standard. Some wonky implementations may disagree).
Luckily, left associativity applies with
JOIN
(right associativity would be madness). So, our magic parentheses come into play again. Our query from before is really this:
SELECT a.first_name, a.last_name, f.title
FROM (
actor AS a
JOIN film_actor AS fa
USING (actor_id)
)
JOIN film AS f
USING (film_id)
Can you see it? The readability problem here is the
USING
(or
ON
) clause. Without it, I could have written:
(actor JOIN film_actor) JOIN film
So, join
FILM_ACTOR
to
ACTOR
and then join the
FILM
table to the product.
Because
INNER JOIN
is associative (both left and right), we could write the following equivalent expression:
actor JOIN (film_actor JOIN film)
So, join
FILM
to
FILM_ACTOR
first, and then join the product to
ACTOR
. Quite a different beast, logically (the optimiser shouldn’t make a difference in this case).
… or in full SQL
SELECT a.first_name, a.last_name, f.title
FROM actor AS a
JOIN (
film_actor
JOIN film AS f
USING (film_id)
)
USING (actor_id)
Egh. Don’t be clever. Use this only if you really have to. I find it quite unreadable. Careful formatting is essential here. But it works.
With more than 3 tables, you could build entire
JOIN
trees. While, in order to stay sane, most people will not actually build
JOIN
trees (yes, a list is also a tree), it is totally possible to do this confusing thing:
SELECT a.first_name, a.last_name, sum(p.amount)
FROM (
actor AS a
JOIN film_actor AS fa
USING (actor_id)
)
JOIN (
film AS f
JOIN (
inventory AS i
JOIN rental AS r
USING (inventory_id)
) USING (film_id)
) USING (film_id)
JOIN payment AS p
USING (rental_id)
GROUP BY a.actor_id
Instead of this nice clean SQL:
SELECT a.first_name, a.last_name, sum(p.amount)
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
JOIN inventory AS i USING (film_id)
JOIN rental AS r USING (inventory_id)
JOIN payment AS p USING (rental_id)
GROUP BY a.actor_id
… to get the revenue of all films we made for any given actor. Something like this:
|first_name|last_name |sum |
|----------|----------|------------|
|ADAM |GRANT |974.19 |
|ADAM |HOPPER |1532.21 |
|AL |GARLAND |1525.87 |
|ALAN |DREYFUSS |1850.29 |
|ALBERT |JOHANSSON |2202.78 |
|ALBERT |NOLTE |2183.75 |
But nevertheless. It is good to know how
JOIN
works, syntactically, because there
will be that edge case where you want to give precedence to an
INNER JOIN
and then
LEFT JOIN
the entire product, or something like that. In that case, parentheses are your friend.
There are no parentheses around the “
DISTINCT
arguments”. Despite a lot of people thinking there are. This is not what some people think it is:
SELECT DISTINCT (actor_id), first_name, last_name
FROM actor
The parentheses can be omitted, and it is the exact same thing.
The other nice thing about SQL syntax is that you never know whether parentheses are part of the language, or part of an expression. In fact, the above
(id)
is a row value expression again. If you (accidentally) wrapped more than one column in those parentheses:
SELECT DISTINCT (actor_id, first_name), last_name
FROM actor
Then you’d still get the same behaviour (in PostgreSQL, which supports nesting records like that), but the result is not what you wanted:
|row |last_name|
|-----------------|---------|
|(1,PENELOPE) |GUINESS |
|(2,NICK) |WAHLBERG |
|(3,ED) |CHASE |
|(4,JENNIFER) |DAVIS |
Funky eh?
The nice thing about SQL is that you can never derive a rule from another rule. While all types of
JOIN
have the same precedence, the set operators do not. Duh. Luckily, most people don’t use
INTERSECT or EXCEPT at all, only
UNION [ ALL ]
, so they don’t run into this problem.
In case there is any doubt,
INTERSECT
has a higher precedence, whereas
UNION
and
EXCEPT
have the same precedence (in the SQL standard). I’ll prove it by quoting it. From ISO/IEC 9075-2:2016(E) 7.17 <query expression> (I’ve removed some irrelevant bits)
<query expression body> ::=
<query term>
| <query expression body> UNION [ ALL ] <query term>
| <query expression body> EXCEPT [ ALL ] <query term>
<query term> ::=
<query primary>
| <query term> INTERSECT [ ALL ] <query primary>
See.
INTERSECT
has a higher precedence. Let’s try it (in PostgreSQL):
SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
INTERSECT
SELECT 1 AS a, 2 AS b
We get:
|a |b |
|-----------|-----------|
|1 |2 |
|2 |3 |
So, what we’ve done effectively (and the parenthesis are totally allowed, because
<query term>
allows it), is this
SELECT 2 AS a, 3 AS b
UNION
(
SELECT 1 AS a, 2 AS b
INTERSECT
SELECT 1 AS a, 2 AS b
)
Now, for the life of me, I really don’t know how to nicely format parenthesised set operations in SQL. It looks awkward. But here we are. The above is a
totally different thing than this:
(
SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
)
INTERSECT
SELECT 1 AS a, 2 AS b
If we do the union first, and then the intersection, the only resulting record is:
|a |b |
|-----------|-----------|
|1 |2 |
A similar effect can be seen when combining
UNION
and
UNION ALL
, which have the same precedence and are … left associative. Lucky again. So, this query here:
SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
UNION ALL
SELECT 1 AS a, 2 AS b
Has a
UNION
that does not get to remove any duplicates, and then a
UNION ALL
that adds a duplicate. It is equivalent to this (remember, left associative):
(
SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
)
UNION ALL
SELECT 1 AS a, 2 AS b
The result is:
|a |b |
|-----------|-----------|
|1 |2 |
|2 |3 |
|1 |2 |
It is quite a different beast from this one:
SELECT 2 AS a, 3 AS b
UNION
(
SELECT 1 AS a, 2 AS b
UNION ALL
SELECT 1 AS a, 2 AS b
)
Now, we’ve used parentheses to force the query to do the
UNION ALL
operator first (producing duplicates), but then, the
UNION
removes them agian. The result is now:
|a |b |
|-----------|-----------|
|1 |2 |
|2 |3 |
Bonus: What about the ORDER BY clause
Ah, yes. The miracles of SQL syntax.
In theory (and in the SQL standard), there is an optional
ORDER BY
clause over
all set operations. E.g. like this:
SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
INTERSECT
SELECT 1 AS a, 2 AS b
ORDER BY a DESC
We’re getting:
|a |b |
|-----------|-----------|
|2 |3 |
|1 |2 |
Think of it this way:
(
SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
INTERSECT
SELECT 1 AS a, 2 AS b
)
ORDER BY a DESC
Some dialects may or may not allow you to actually place those parentheses. E.g. PostgreSQL does.
When you write subqueries, or specifically:
- Scalar subqueries (subqueries in
SELECT
or WHERE
, etc.)
- Derived tables (subqueries in
FROM
)
… then you
must put them in parentheses. For example, a correlated (scalar) subquery:
SELECT
first_name,
last_name, (
-- Correlated subquery here
SELECT count(*)
FROM film_actor AS fa
WHERE fa.actor_id = a.actor_id
) c
FROM actor AS a
These parentheses are not optional. Neither are those around the derived table:
SELECT
first_name,
last_name,
c
FROM actor AS a
JOIN (
-- Derived table here
SELECT actor_id, count(*) AS c
FROM film_actor
GROUP BY actor_id
) fa USING (actor_id)
SQL (and its syntax) is a device whose mystery is only exceeded by its power. With parentheses (as with other syntactic tokens), you can never really know if they are part of the language, of some clause, or part of an expression.
With some operators, parentheses help better understand the syntactic structure of the language.
- Row value expressions of degree 1 have optional parentheses. We never write them, but we could have!
- JOINs are really trees, not lists. We can nest them arbitrarily to get associativity under control.
- DISTINCT doesn’t have parentheses. Don’t be fooled by your coworkers’ coding style!
- Set operations have two levels of precedence: 1)
INTERSECT
and 2) UNION
and EXCEPT
(assuming standards compliance).
- Subqueries (outside of set operations) have mandatory parentheses
I’m curious about your own greatest SQL syntax confusions. Please leave them in the comments.
Like this:
Like Loading...
Parantheses, semicolons and indentations. We write code for humans alter all. If there is even slight chance for misinterpretation it is better to be explicit like the Python rule: EIBTI(Explicit Is Better Than Implicit).
Well, there’s no absolute rule. I mean, every part of an expression could be assigned to a local variable for maximum explicitness. Clearly, this is exaggerated…
“I’m curious about your own greatest SQL syntax confusions.”
Some performance changes when wrapping with additional brackets/subquery:
1) SQL Server: Optimization level based on condition: 1=1 is always true condition and could be skiped by query optimizer:
2) Oracle: Scalar subquery caching https://oracle-base.com/articles/misc/efficient-function-calls-from-sql#scalar-subquery-caching-revisited
vs
3) Not really brackets but reading from non-existing schema:
https://stackoverflow.com/questions/49540736/weird-result-of-using-cte/49542750#49542750
Strange case and I would really want to know the rationale behind it.
4) Lateral column alias reference and undeterministic functions behavior – the idea was really tempting, it turns it was inlining
https://stackoverflow.com/questions/59334543/amazon-redshift-lateral-column-alias-reference
Thanks for the comments! Interesting bits.
1) is one of those weird things which I’ve observed also in this blog post: https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model. It looks relatively trivial to optimise “statically” prior to passing the query to the cost based optimiser. But perhaps, for historic reasons, there’s some assumption about scalar subqueries to never be inlined, which users have come to rely upon. Similar to 2) (Also on this blog: https://blog.jooq.org/2011/09/02/oracle-scalar-subquery-caching/). In fact, jOOQ is “abusing” 1) a few times, e.g. when jOOQ users do not provide any
ORDER BY
clause, but anORDER BY
clause is mandatory in some context (e.g. when usingOFFSET .. FETCH
in SQL Server. When writing this:jOOQ generates this for SQL Server:
(because
ORDER BY 0
wouldn’t work)3) and 4) are very puzzling indeed. Very obscure.
Most confusing recent SQL moment, and as a reply on the ORDER BY in unions. The aliases you give on the first subquery in the union are used in the result (at least in Oracle). But when you use these aliases in the ORDER BY, the alias must be consistently used in all subqueries.
So the following query is OK.
The next query is not OK
Hmm, are you sure? I’ve run into this problem, but in my case, the penultimate union subquery needed to provide column aliases, not the first one. Even more random! https://stackoverflow.com/q/25387951/521799