Site icon Java, SQL and jOOQ.

A Beginner’s Guide to the True Order of SQL Operations

The SQL language is very intuitive. Until it isn’t. Over the years, a lot of people have criticised the SQL language for a variety of reasons. For instance: IDEs cannot easily guess what auto completion options to offer, because as long as you don’t specify the FROM clause, there are no tables in scope (yet):

-- Don't you wish this would be completed to first_name?
SELECT first_na...

-- Aaah, now it works:
SELECT first_na...
FROM customer

These things are weird, because the lexical order of operations does not match the logical order of operations. We humans may sometimes (often) intuitively understand this ordering difference. E.g. we know that we’re about to select from the customer table. But the IDE doesn’t know this.

GROUP BY contributes the most confusion

When a junior developer / SQL beginner starts working with SQL, quite quickly, they will find out about aggregation and GROUP BY. And they’ll quickly write things like:

SELECT count(*)
FROM customer

Yay, we have 200 customers! And then:

SELECT count(*)
FROM customer
WHERE first_name = 'Steve'

Wow, 90 of them are called Steve! Interesting. Let’s find out how many we have per name…

SELECT first_name, count(*)
FROM customer
GROUP BY first_name

Ahaa!
FIRST_NAME   COUNT
------------------
Steve        90
Jane         80
Joe          20
Janet        10
Very nice. But are they all the same? Let’s check out the last name, too

SELECT first_name, last_name, count(*)
FROM customer
GROUP BY first_name

Oops!
ORA-00979: not a GROUP BY expression
Jeez, what does it mean? (note, unfortunately, MySQL users that do not use the STRICT mode will still get a result here with arbitrary last names!, so a new MySQL user won’t understand their mistake) How do you easily explain this to a SQL newbie? It seems obvious to “pros”, but is it really obvious? Is it obvious enough that you can explain it easily to a junior? Think about it. Why are each one of these statements semantically correct or wrong?

-- Wrong
SELECT first_name, count(*)
FROM customer
WHERE count(*) > 1
GROUP BY first_name

-- Correct
SELECT first_name, count(*)
FROM customer
GROUP BY first_name
HAVING count(*) > 1

-- Correct
SELECT first_name, count(*)
FROM customer
GROUP BY first_name
ORDER BY count(*) DESC

-- Wrong
SELECT first_name, last_name, count(*)
FROM customer
GROUP BY first_name

-- Correct
SELECT first_name, MAX(last_name), count(*)
FROM customer
GROUP BY first_name

-- Wrong
SELECT first_name || ' ' || last_name, count(*)
FROM customer
GROUP BY first_name

-- Correct
SELECT first_name || ' ' || MAX(last_name), count(*)
FROM customer
GROUP BY first_name

-- Correct
SELECT MAX(first_name || ' ' || last_name), count(*)
FROM customer
GROUP BY first_name

The problem is syntax related

The SQL syntax works in a similar way like the English language. It is a command. We start commands with verbs. The verb is SELECT (or INSERT, UPDATE, DELETE, CREATE, DROP, etc. etc.) Unfortunately, human language is incredibly ill-suited for the much more formal world of programming. While it offers some consolation to new users (possibly non-programmers) who are absolute beginners, it just makes stuff hard for everyone else. All the different SQL clauses have extremely complex interdependencies. For instance: The list is endless. If you’re interested, you can read the SQL standard documents and check out how many weird and complicated inter-dependencies there exist between the many clauses of the SELECT statement.

Can this ever be understood?

Luckily, yes! There’s a simple trick, which I’m always explaining to the delegates that visit my SQL Masterclass. The lexical (syntactical) order of SQL operations (clauses) does not correspond at all to the logical order of operations (although, sometimes, they do coincidentally). Thanks to modern optimisers, the order also doesn’t correspond to the actual order of operations, so we really have: syntactical -> logical -> actual order, but let’s leave that aside for now. The logical order of operations is the following (for “simplicity” I’m leaving out vendor specific things like CONNECT BY, MODEL, MATCH_RECOGNIZE, PIVOT, UNPIVOT and all the others): There, we have it. It makes total sense. And if you ever want to do something that is not in the “right order”, the simplest trick is always to resort to a derived table. E.g. when you want to group on a window function:

-- Doesn't work, cannot put window functions in GROUP BY
SELECT ntile(4) ORDER BY (age) AS bucket, MIN(age), MAX(age)
FROM customer
GROUP BY ntile(4) ORDER BY (age)

-- Works:
SELECT bucket, MIN(age), MAX(age)
FROM (
  SELECT age, ntile(4) ORDER BY (age) AS bucket
  FROM customer
) c
GROUP BY bucket

Why does it work? Because: Let’s review our original examples with an explanation why they work or why they don’t.

-- Wrong: Because aggregate functions are calculated
-- *after* GROUP BY, and WHERE is applied *before* GROUP BY
SELECT first_name, count(*)
FROM customer
WHERE count(*) > 1
GROUP BY first_name

-- logical order         -- available columns after operation
-------------------------------------------------------------
FROM customer            -- customer.*
WHERE ??? > 1            -- customer.* (count not yet available!)
GROUP BY first_name      -- first_name (customer.* for aggs only)
<aggregate> count(*)     -- first_name, count
SELECT first_name, count -- first_name, count



-- Correct: Because aggregate functions are calculated
-- *after* GROUP BY but *before* HAVING, so they're 
-- available to the HAVING clause.
SELECT first_name, count(*)
FROM customer
GROUP BY first_name
HAVING count(*) > 1

-- logical order         -- available columns after operation
-------------------------------------------------------------
FROM customer            -- customer.*
GROUP BY first_name      -- first_name (customer.* for aggs only)
<aggregate> count(*)     -- first_name, count
HAVING count > 1         -- first_name, count
SELECT first_name, count -- first_name, count



-- Correct: Both SELECT and ORDER BY are applied *after*
-- the aggregation step, so aggregate function results are 
-- available
SELECT first_name, count(*)
FROM customer
GROUP BY first_name
ORDER BY count(*) DESC

-- logical order         -- available columns after operation
-------------------------------------------------------------
FROM customer            -- customer.*
GROUP BY first_name      -- first_name (customer.* for aggs only)
<aggregate> count(*)     -- first_name, count
SELECT first_name, count -- first_name, count
ORDER BY count DESC      -- first_name, count



-- Wrong: Because the GROUP BY clause creates groups of
-- first names, and all the remaining customer columns
-- are aggregated into a list, which is only visiblbe to
-- aggregate functions
SELECT first_name, last_name, count(*)
FROM customer
GROUP BY first_name

-- logical order         -- available columns after operation
-----------------------------------------------------------------
FROM customer            -- customer.*
GROUP BY first_name      -- first_name (customer.* for aggs only)
<aggregate> count(*)     -- first_name, count
                         -- first_name, count (last_name removed)
SELECT first_name, ???, count 



-- Correct: Because now, we're using an aggregate function
-- to access one of the columns that have been put into that
-- list of columns that are otherwise no longer available
-- after the GROUP BY clause
SELECT first_name, MAX(last_name), count(*)
FROM customer
GROUP BY first_name

-- logical order         -- available columns after operation
-----------------------------------------------------------------
FROM customer            -- customer.*
GROUP BY first_name      -- first_name (customer.* for aggs only)
                         -- first_name, max, count
<aggregate> MAX(last_name), count(*) 
                         -- first_name, max, count
SELECT first_name, max, count



-- Wrong: Because we still cannot access the last name column
-- which is in that list after the GROUP BY clause.
SELECT first_name || ' ' || last_name, count(*)
FROM customer
GROUP BY first_name

-- logical order         -- available columns after operation
-----------------------------------------------------------------
FROM customer            -- customer.*
GROUP BY first_name      -- first_name (customer.* for aggs only)
<aggregate> count(*)     -- first_name, count
                         -- first_name, count (last_name removed)
SELECT first_name || ' ' || ???, count 




-- Correct: Because we can access the last name column from
-- aggregate functions, which can see that list
SELECT first_name || ' ' || MAX(last_name), count(*)
FROM customer
GROUP BY first_name

-- logical order         -- available columns after operation
-----------------------------------------------------------------
FROM customer            -- customer.*
GROUP BY first_name      -- first_name (customer.* for aggs only)
                         -- first_name, max, count
<aggregate> MAX(last_name), count(*)  
                         -- first_name, max, count (no last_name)
SELECT first_name || ' ' || max, count




-- Correct: Because both GROUP BY columns and aggregated
-- columns are available to aggregate functions
SELECT MAX(first_name || ' ' || last_name), count(*)
FROM customer
GROUP BY first_name

-- logical order         -- available columns after operation
-----------------------------------------------------------------
FROM customer            -- customer.*
GROUP BY first_name      -- first_name (customer.* for aggs only)
                         -- first_name, max, count
<aggregate> MAX(first_name || ' ' || last_name), count(*)
SELECT max, count        -- first_name, max, count

Always think about the logical order of operations

If you’re not a frequent SQL writer, the syntax can indeed be confusing. Especially GROUP BY and aggregations “infect” the rest of the entire SELECT clause, and things get really weird. When confronted with this weirdness, we have two options: I generally recommend the latter, because then things start making a lot more sense, including the beautiful cumulative daily revenue calculation below, which nests the daily revenue (SUM(amount) aggregate function) inside of the cumulative revenue (SUM(...) OVER (...) window function):

SELECT
  payment_date,
  SUM(SUM(amount)) OVER (ORDER BY payment_date) AS revenue
FROM payment
GROUP BY payment_date

… because aggregations logically happen before window functions.

Caveat: ORDER BY clause

There are some caveats around the ORDER BY clause, which might be contributing to further confusion. By default, continue assuming that the logical order of operations is correct. But then, there are some special cases, in particular: You can reference expressions in ORDER BY, which are not projected by SELECT. The following query is perfectly fine in most databases:

SELECT first_name, last_name
FROM actor
ORDER BY actor_id

There’s a “virtual” / implicit ACTOR_ID projection, as if we had written:

SELECT first_name, last_name, actor_id
FROM actor
ORDER BY actor_id

But then, removed the ACTOR_ID column again from the result. This is very convenient, although it might lead to some confusion about the semantics and the order of operations. Specifically, you cannot use e.g. DISTINCT in such a situation. The following query is invalid:

SELECT DISTINCT first_name, last_name
FROM actor
ORDER BY actor_id -- Oops

Because, what if there are two actors by the same name but with very different IDs? The ordering would now be undefined. With set operations, it is even more clear why this isn’t permitted:

SELECT first_name, last_name
FROM actor
UNION
SELECT first_name, last_name
FROM customer
ORDER BY actor_id -- Oops

In this case, the ACTOR_ID column isn’t present on the CUSTOMER table, so the query makes no sense at all.

Further reading

Want to learn more? We also have these articles for you to read:
Exit mobile version