The SQL language is very intuitive. Until it isn’t.
Over the years, a lot of people have criticised the SQL language for a variety of reasons. For instance: IDEs cannot easily guess what auto completion options to offer, because as long as you don’t specify the FROM
clause, there are no tables in scope (yet):
-- Don't you wish this would be completed to first_name? SELECT first_na... -- Aaah, now it works: SELECT first_na... FROM customer
These things are weird, because the lexical order of operations does not match the logical order of operations. We humans may sometimes (often) intuitively understand this ordering difference. E.g. we know that we’re about to select from the customer table. But the IDE doesn’t know this.
GROUP BY contributes the most confusion
When a junior developer / SQL beginner starts working with SQL, quite quickly, they will find out about aggregation and GROUP BY
. And they’ll quickly write things like:
SELECT count(*) FROM customer
Yay, we have 200 customers!
And then:
SELECT count(*) FROM customer WHERE first_name = 'Steve'
Wow, 90 of them are called Steve! Interesting. Let’s find out how many we have per name…
SELECT first_name, count(*) FROM customer GROUP BY first_name
Ahaa!
FIRST_NAME COUNT ------------------ Steve 90 Jane 80 Joe 20 Janet 10
Very nice. But are they all the same? Let’s check out the last name, too
SELECT first_name, last_name, count(*) FROM customer GROUP BY first_name
Oops!
ORA-00979: not a GROUP BY expression
Jeez, what does it mean? (note, unfortunately, MySQL users that do not use the STRICT mode will still get a result here with arbitrary last names!, so a new MySQL user won’t understand their mistake)
How do you easily explain this to a SQL newbie? It seems obvious to “pros”, but is it really obvious? Is it obvious enough that you can explain it easily to a junior? Think about it. Why are each one of these statements semantically correct or wrong?
-- Wrong SELECT first_name, count(*) FROM customer WHERE count(*) > 1 GROUP BY first_name -- Correct SELECT first_name, count(*) FROM customer GROUP BY first_name HAVING count(*) > 1 -- Correct SELECT first_name, count(*) FROM customer GROUP BY first_name ORDER BY count(*) DESC -- Wrong SELECT first_name, last_name, count(*) FROM customer GROUP BY first_name -- Correct SELECT first_name, MAX(last_name), count(*) FROM customer GROUP BY first_name -- Wrong SELECT first_name || ' ' || last_name, count(*) FROM customer GROUP BY first_name -- Correct SELECT first_name || ' ' || MAX(last_name), count(*) FROM customer GROUP BY first_name -- Correct SELECT MAX(first_name || ' ' || last_name), count(*) FROM customer GROUP BY first_name
The problem is syntax related
The SQL syntax works in a similar way like the English language. It is a command. We start commands with verbs. The verb is SELECT
(or INSERT
, UPDATE
, DELETE
, CREATE
, DROP
, etc. etc.)
Unfortunately, human language is incredibly ill-suited for the much more formal world of programming. While it offers some consolation to new users (possibly non-programmers) who are absolute beginners, it just makes stuff hard for everyone else. All the different SQL clauses have extremely complex interdependencies. For instance:
- In the presence of a
GROUP BY
clause, only expressions built fromGROUP BY
expressions (or functional dependencies thereof), or aggregate functions can be used inHAVING
,SELECT
, andORDER BY
clauses. - For simplicity reasons, let’s not even talk about
GROUPING SETS
- In fact, there are even a few cases in which
GROUP BY
is implied. E.g. if you write a “naked”HAVING
clause - A single aggregate function in the
SELECT
clause (in the absence ofGROUP BY
) will force aggregation into a single row - In fact, this can also be implied by putting that aggregate function in
ORDER BY
(for whatever reason) - You can
ORDER BY
quite a few expressions that reference any columns from theFROM
clause withoutSELECT
ing them. But that’s no longer true if you writeSELECT DISTINCT
- …
The list is endless. If you’re interested, you can read the SQL standard documents and check out how many weird and complicated inter-dependencies there exist between the many clauses of the SELECT
statement.
Can this ever be understood?
Luckily, yes! There’s a simple trick, which I’m always explaining to the delegates that visit my SQL Masterclass. The lexical (syntactical) order of SQL operations (clauses) does not correspond at all to the logical order of operations (although, sometimes, they do coincidentally). Thanks to modern optimisers, the order also doesn’t correspond to the actual order of operations, so we really have: syntactical -> logical -> actual order, but let’s leave that aside for now.
The logical order of operations is the following (for “simplicity” I’m leaving out vendor specific things like CONNECT BY
, MODEL
, MATCH_RECOGNIZE
, PIVOT
, UNPIVOT
and all the others):
FROM
: This is actually the first thing that happens, logically. Before anything else, we’re loading all the rows from all the tables and join them. Before you scream and get mad: Again, this is what happens first logically, not actually. The optimiser will very probably not do this operation first, that would be silly, but access some index based on theWHERE
clause. But again, logically, this happens first. Also: all theJOIN
clauses are actually part of thisFROM
clause.JOIN
is an operator in relational algebra. Just like+
and-
are operators in arithmetics. It is not an independent clause, likeSELECT
orFROM
WHERE
: Once we have loaded all the rows from the tables above, we can now throw them away again usingWHERE
GROUP BY
: If you want, you can take the rows that remain afterWHERE
and put them in groups or buckets, where each group contains the same value for theGROUP BY
expression (and all the other rows are put in a list for that group). In Java, you would get something like:Map<String, List<Row>>
. If you do specify aGROUP BY
clause, then your actual rows contain only the group columns, no longer the remaining columns, which are now in that list. Those columns in the list are only visible to aggregate functions that can operate upon that list. See below.- aggregations: This is important to understand. No matter where you put your aggregate function syntactically (i.e. in the
SELECT
clause, or in theORDER BY
clause), this here is the step where aggregate functions are calculated. Right afterGROUP BY
. (remember: logically. Clever databases may have calculated them before, actually). This explains why you cannot put an aggregate function in theWHERE
clause, because its value cannot be accessed yet. TheWHERE
clause logically happens before the aggregation step. Aggregate functions can access columns that you have put in “this list” for each group, above. After aggregation, “this list” will disappear and no longer be available. If you don’t have aGROUP BY
clause, there will just be one big group without any key, containing all the rows. HAVING
: … but now you can access aggregation function values. For instance, you can check thatcount(*) > 1
in theHAVING
clause. BecauseHAVING
is afterGROUP BY
(or impliesGROUP BY
), we can no longer access columns or expressions that were notGROUP BY
columns.WINDOW
: If you’re using the awesome window function feature, this is the step where they’re all calculated. Only now. And the cool thing is, because we have already calculated (logically!) all the aggregate functions, we can nest aggregate functions in window functions. It’s thus perfectly fine to write things likesum(count(*)) OVER ()
orrow_number() OVER (ORDER BY count(*))
. Window functions being logically calculated only now also explains why you can put them only in theSELECT
orORDER BY
clauses. They’re not available to theWHERE
clause, which happened before. Note that PostgreSQL and Sybase SQL Anywhere have an actualWINDOW
clause!SELECT
: Finally. We can now use all the rows that are produced from the above clauses and create new rows / tuples from them usingSELECT
. We can access all the window functions that we’ve calculated, all the aggregate functions that we’ve calculated, all the grouping columns that we’ve specified, or if we didn’t group/aggregate, we can use all the columns from ourFROM
clause. Remember: Even if it looks like we’re aggregating stuff inside ofSELECT
, this has happened long ago, and the sweet sweetcount(*)
function is nothing more than a reference to the result.DISTINCT
: Yes!DISTINCT
happens afterSELECT
, even if it is put before yourSELECT
column list, syntax-wise. But think about it. It makes perfect sense. How else can we remove distinct rows, if we don’t know all the rows (and their columns) yet?UNION, INTERSECT, EXCEPT
: This is a no-brainer. AUNION
is an operator that connects two subqueries. Everything we’ve talked about thus far was a subquery. The output of a union is a new query containing the same row types (i.e. same columns) as the first subquery. Usually. Because in wacko Oracle, the penultimate subquery is the right one to define the column name. Oracle database, the syntactic troll ;)ORDER BY
: It makes total sense to postpone the decision of ordering a result until the end, because all other operations might use hashmaps, internally, so any intermediate order might be lost again. So we can now order the result. Normally, you can access a lot of rows from theORDER BY
clause, including rows (or expressions) that you did notSELECT
. But when you specifiedDISTINCT
, before, you can no longer order by rows / expressions that were not selected. Why? Because the ordering would be quite undefined.OFFSET
: Don’t use offsetLIMIT, FETCH, TOP
: Now, sane databases put theLIMIT
(MySQL, PostgreSQL) orFETCH
(DB2, Oracle 12c, SQL Server 2012) clause at the very end, syntactically. In the old days, Sybase and SQL Server thought it would be a good idea to haveTOP
as a keyword inSELECT
. As if the correct ordering ofSELECT DISTINCT
wasn’t already confusing enough.
There, we have it. It makes total sense. And if you ever want to do something that is not in the “right order”, the simplest trick is always to resort to a derived table. E.g. when you want to group on a window function:
-- Doesn't work, cannot put window functions in GROUP BY SELECT ntile(4) ORDER BY (age) AS bucket, MIN(age), MAX(age) FROM customer GROUP BY ntile(4) ORDER BY (age) -- Works: SELECT bucket, MIN(age), MAX(age) FROM ( SELECT age, ntile(4) ORDER BY (age) AS bucket FROM customer ) c GROUP BY bucket
Why does it work? Because:
- In the derived table,
FROM
happens first, and then theWINDOW
is calculated, then the bucket isSELECT
ed. - The outer
SELECT
can now treat the result of this window function calculation like any ordinary table in theFROM
clause, thenGROUP BY
an ordinary column, then aggregate, thenSELECT
Let’s review our original examples with an explanation why they work or why they don’t.
-- Wrong: Because aggregate functions are calculated -- *after* GROUP BY, and WHERE is applied *before* GROUP BY SELECT first_name, count(*) FROM customer WHERE count(*) > 1 GROUP BY first_name -- logical order -- available columns after operation ------------------------------------------------------------- FROM customer -- customer.* WHERE ??? > 1 -- customer.* (count not yet available!) GROUP BY first_name -- first_name (customer.* for aggs only) <aggregate> count(*) -- first_name, count SELECT first_name, count -- first_name, count -- Correct: Because aggregate functions are calculated -- *after* GROUP BY but *before* HAVING, so they're -- available to the HAVING clause. SELECT first_name, count(*) FROM customer GROUP BY first_name HAVING count(*) > 1 -- logical order -- available columns after operation ------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) <aggregate> count(*) -- first_name, count HAVING count > 1 -- first_name, count SELECT first_name, count -- first_name, count -- Correct: Both SELECT and ORDER BY are applied *after* -- the aggregation step, so aggregate function results are -- available SELECT first_name, count(*) FROM customer GROUP BY first_name ORDER BY count(*) DESC -- logical order -- available columns after operation ------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) <aggregate> count(*) -- first_name, count SELECT first_name, count -- first_name, count ORDER BY count DESC -- first_name, count -- Wrong: Because the GROUP BY clause creates groups of -- first names, and all the remaining customer columns -- are aggregated into a list, which is only visiblbe to -- aggregate functions SELECT first_name, last_name, count(*) FROM customer GROUP BY first_name -- logical order -- available columns after operation ----------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) <aggregate> count(*) -- first_name, count -- first_name, count (last_name removed) SELECT first_name, ???, count -- Correct: Because now, we're using an aggregate function -- to access one of the columns that have been put into that -- list of columns that are otherwise no longer available -- after the GROUP BY clause SELECT first_name, MAX(last_name), count(*) FROM customer GROUP BY first_name -- logical order -- available columns after operation ----------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) -- first_name, max, count <aggregate> MAX(last_name), count(*) -- first_name, max, count SELECT first_name, max, count -- Wrong: Because we still cannot access the last name column -- which is in that list after the GROUP BY clause. SELECT first_name || ' ' || last_name, count(*) FROM customer GROUP BY first_name -- logical order -- available columns after operation ----------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) <aggregate> count(*) -- first_name, count -- first_name, count (last_name removed) SELECT first_name || ' ' || ???, count -- Correct: Because we can access the last name column from -- aggregate functions, which can see that list SELECT first_name || ' ' || MAX(last_name), count(*) FROM customer GROUP BY first_name -- logical order -- available columns after operation ----------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) -- first_name, max, count <aggregate> MAX(last_name), count(*) -- first_name, max, count (no last_name) SELECT first_name || ' ' || max, count -- Correct: Because both GROUP BY columns and aggregated -- columns are available to aggregate functions SELECT MAX(first_name || ' ' || last_name), count(*) FROM customer GROUP BY first_name -- logical order -- available columns after operation ----------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) -- first_name, max, count <aggregate> MAX(first_name || ' ' || last_name), count(*) SELECT max, count -- first_name, max, count
Always think about the logical order of operations
If you’re not a frequent SQL writer, the syntax can indeed be confusing. Especially GROUP BY
and aggregations “infect” the rest of the entire SELECT
clause, and things get really weird. When confronted with this weirdness, we have two options:
- Get mad and scream at the SQL language designers
- Accept our fate, close our eyes, forget about the snytax and remember the logical operations order
I generally recommend the latter, because then things start making a lot more sense, including the beautiful cumulative daily revenue calculation below, which nests the daily revenue (SUM(amount)
aggregate function) inside of the cumulative revenue (SUM(...) OVER (...)
window function):
SELECT payment_date, SUM(SUM(amount)) OVER (ORDER BY payment_date) AS revenue FROM payment GROUP BY payment_date
… because aggregations logically happen before window functions.
Caveat: ORDER BY clause
There are some caveats around the ORDER BY
clause, which might be contributing to further confusion. By default, continue assuming that the logical order of operations is correct. But then, there are some special cases, in particular:
- In the absence of a
DISTINCT
clause - In the absence of set operations like
UNION
You can reference expressions in ORDER BY
, which are not projected by SELECT
. The following query is perfectly fine in most databases:
SELECT first_name, last_name FROM actor ORDER BY actor_id
There’s a “virtual” / implicit ACTOR_ID
projection, as if we had written:
SELECT first_name, last_name, actor_id FROM actor ORDER BY actor_id
But then, removed the ACTOR_ID
column again from the result. This is very convenient, although it might lead to some confusion about the semantics and the order of operations. Specifically, you cannot use e.g. DISTINCT
in such a situation. The following query is invalid:
SELECT DISTINCT first_name, last_name FROM actor ORDER BY actor_id -- Oops
Because, what if there are two actors by the same name but with very different IDs? The ordering would now be undefined.
With set operations, it is even more clear why this isn’t permitted:
SELECT first_name, last_name FROM actor UNION SELECT first_name, last_name FROM customer ORDER BY actor_id -- Oops
In this case, the ACTOR_ID
column isn’t present on the CUSTOMER
table, so the query makes no sense at all.
Further reading
Want to learn more? We also have these articles for you to read: