One of the things that confuse SQL users all the time is how DISTINCT
and ORDER BY
are related in a SQL query.
The Basics
Running some queries against the Sakila database, most people quickly understand:
SELECT DISTINCT length FROM film
This returns results in an arbitrary order, because the database can (and might apply hashing rather than ordering to remove duplicates):
length | -------| 129 | 106 | 120 | 171 | 138 | 80 | ...
Most people also understand:
SELECT length FROM film ORDER BY length
This will give us duplicates, but in order:
length | -------| 46 | 46 | 46 | 46 | 46 | 47 | 47 | 47 | 47 | 47 | 47 | 47 | 48 | ...
And, of course, we can combine the two:
SELECT DISTINCT length FROM film ORDER BY length
Resulting in…
length | -------| 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | ...
Then why doesn’t this work?
Maybe somewhat intuitively, we may want to order the lengths differently, e.g. by title:
SELECT DISTINCT length FROM film ORDER BY title
Most databases fail this query with an exception like Oracle’s:
ORA-01791: not a SELECTed expression
At first sight, this seems funny, because this works after all:
SELECT length FROM film ORDER BY title
Yielding:
length | -------| 86 | 48 | 50 | 117 | 130 | ...
We could add the title to illustrate the ordering
length |title | -------|----------------------------| 86 |ACADEMY DINOSAUR | 48 |ACE GOLDFINGER | 50 |ADAPTATION HOLES | 117 |AFFAIR PREJUDICE | 130 |AFRICAN EGG |
So, how are these different?
We have to rewind and check out the logical order of SQL operations (as opposed to the syntactic order). And always remember, this is the logical order, not the actual order executed by the optimiser.
When we write something like this:
SELECT DISTINCT length FROM film ORDER BY length
The logical order of operations is:
FROM
clause, loading theFILM
tableSELECT
clause, projecting theLENGTH
columnDISTINCT
clause, removing distinct tuples (with projectedLENGTH
columns)ORDER BY
clause, ordering by theLENGTH
column
If we look at this step by step, we have:
Step 1: SELECT * FROM film
The intermediary data set is something like:
film_id |title |length | ... --------|----------------------------|-------| ... 1 |ACADEMY DINOSAUR |86 | ... 2 |ACE GOLDFINGER |48 | ... 3 |ADAPTATION HOLES |50 | ... 4 |AFFAIR PREJUDICE |117 | ... 5 |AFRICAN EGG |130 | ... ... |... |... | ...
Step 2: SELECT length …
The intermediary data set is something like:
length | -------| 86 | 48 | 50 | 117 | 130 | ... 86 | <-- duplicate
Step 3: SELECT DISTINCT length …
Now we’re getting a new random order (due to hashing) and no duplicates anymore:
length | -------| 129 | 106 | 120 | 171 | 138 | ...
Step 4: … ORDER BY length
And we’re getting:
length | -------| 46 | 47 | 48 | 49 | 50 | ...
It seems obvious.
So why did this work?
Remember, this query worked:
SELECT length FROM film ORDER BY title
Even if after projecting the LENGTH
column, it seems as though it is no longer available for sorting, it really is, according to the SQL standard and to common sense. There is a concept called extended sort key columns in the SQL standard, which means the above query has a slightly different order of operations (apart from the fact that there is no DISTINCT
operation):
FROM
clause, loading theFILM
tableSELECT
clause, projecting theLENGTH
column from the select list and theTITLE
from the extended sort key columnsORDER BY
clause, ordering by theTITLE
columnSELECT
clause (implicit), projecting only theLENGTH
column, discarding theTITLE
column
Again, this is what happens logically. Database optimisers may choose other ways to implement this. By example:
Step 1: SELECT * FROM film
Same as before
film_id |title |length | ... --------|----------------------------|-------| ... 1 |ACADEMY DINOSAUR |86 | ... 2 |ACE GOLDFINGER |48 | ... 3 |ADAPTATION HOLES |50 | ... 4 |AFFAIR PREJUDICE |117 | ... 5 |AFRICAN EGG |130 | ... ... |... |... | ...
Step 2: SELECT length, title…
We get that synthetic extended sort key column TITLE
along with the LENGTH
column that we requested
length |title | -------|----------------------------| 86 |ACADEMY DINOSAUR | 114 |ALABAMA DEVIL | 50 |ADAPTATION HOLES | 117 |AFFAIR PREJUDICE | 168 |ANTITRUST TOMATOES | ...
Step 3: … ORDER BY title
… we can now order by that column
length |title | -------|----------------------------| 86 |ACADEMY DINOSAUR | 48 |ACE GOLDFINGER | 50 |ADAPTATION HOLES | 117 |AFFAIR PREJUDICE | 130 |AFRICAN EGG | ...
Step 4: SELECT length
… and finally discard it, because we never wanted it
length | -------| 86 | 48 | 50 | 117 | 130 |
So why can’t we use DISTINCT?
If we try to run this:
SELECT DISTINCT length FROM film ORDER BY title
We would get an additional DISTINCT
operation in our logical set of operations:
FROM
clause, loading theFILM
tableSELECT
clause, projecting theLENGTH
column from the select list and theTITLE
from the extended sort key columnsDISTINCT
clause, removing duplicate(LENGTH, TITLE)
values… OoopsORDER BY
clause, ordering by theTITLE
columnSELECT
clause (implicit), projecting only theLENGTH
column, discarding theTITLE
column
The problem is, since we have synthetically added the extended sort key column TITLE
to the projection in order to be able to ORDER BY
it, DISTINCT
wouldn’t have the same semantics anymore as can be seen here:
SELECT count(*)
FROM (
SELECT DISTINCT length FROM film
) t;
SELECT count(*)
FROM (
SELECT DISTINCT length, title FROM film
) t;
Yielding
140 1000
All titles are distinct. There is no way this query can be executed reasonably. Either DISTINCT
doesn’t work (because the added extended sort key column changes its semantics), or ORDER BY
doesn’t work (because after DISTINCT
we can no longer access the extended sort key column).
A more constructed example. T contains this data:
CREATE TABLE t (a INT, b INT);
INSERT INTO t VALUES (1, 1);
INSERT INTO t VALUES (1, 2);
INSERT INTO t VALUES (2, 3);
INSERT INTO t VALUES (1, 4);
INSERT INTO t VALUES (2, 5);
A B ----- 1 1 1 2 2 3 1 4 2 5
What would this query produce?
SELECT DISTINCT a FROM t ORDER BY b;
Clearly, we should only get 2 rows with values 1, 2, because of DISTINCT a
:
A -- 1 2
Now, how do we order these by B
? There are 3 values of B associated A = 1
and 2 values of B associated with A = 2
:
A B ------------------ 1 Any of 1, 2, 4 2 Any of 3, 5
Should we get 1, 2 or 2, 1 as a result? Impossible to tell.
But there are some exceptions
The way I read the SQL standard, the following exception should be possible. The SQL standard ISO/IEC 9075-2:2016(E), 7.17 <query expression>, Syntax Rules 28) d) i) 6) references the “Left normal form derivation”. But I may be reading this wrong, see also a discussion on the PostgreSQL mailing list:
https://www.postgresql.org/message-id/20030819103859.L69440-100000%40megazone.bigpanda.com
In any case, it still makes sense to me. For instance, we can form expressions on the columns in the select list. This is totally fine in MySQL (strict mode) and Oracle:
SELECT DISTINCT length
FROM film
ORDER BY mod(length, 10), length;
It will produce
length | -------| 50 | 60 | 70 | 80 | 90 | 100 | 110 | 120 | 130 | 140 | 150 | 160 | 170 | 180 | 51 | 61 | 71 |
PostgreSQL doesn’t allow this because the expression MOD(LENGTH, 10)
is not in the select list. How to interpret this? We’re looking again at the order of SQL operations:
FROM
clause, loading theFILM
tableSELECT
clause, projecting theLENGTH
column from the select list.MOD(LENGTH, 10)
does not have to be put in the extended sort key columns, because it can be fully derived from the select list.DISTINCT
clause, removing duplicateLENGTH
values … all fine, because we don’t have the verboten extended sort key columnsORDER BY
clause, ordering by themod(LENGTH, 10), LENGTH
columns. Totally fine, because we can derive all of these order by expressions from expressions in the select list
Makes sense, right?
Back to our constructed table T:
A B ----- 1 1 1 2 2 3 1 4 2 5
We are allowed to write:
SELECT DISTINCT a, b FROM t ORDER BY a - b;
We would get:
A B ----- 1 4 2 5 2 3 1 2 1 1
Again, the order by expressions can be derived completely from the select list. This also works in Oracle:
SELECT DISTINCT a - b FROM t ORDER BY abs(a - b);
The select list contains a column A - B
, so we can derive any ORDER BY
expression from it. But these don’t work:
SELECT DISTINCT a - b FROM t ORDER BY a;
SELECT DISTINCT a - b FROM t ORDER BY b;
SELECT DISTINCT a - b FROM t ORDER BY b - a;
It is easy to build an intuition for why these don’t work. Clearly, the data set we want is:
A - B A B B - A ------------------------------------------ -3 Any of 1, 2 Any of 4, 5 3 -1 Any of 2, 1 Any of 3, 2 1 0 Any of 1 Any of 1 0
Now, how are we supposed to order these by A, B
or B - A
? It looks as though we should be able to sort by B - A
in this case. We could derive a complicated transformation of expressions that can be reasonably transformed into each other, such as A - B = -(B - A)
, but this simply isn’t practical. The expression in the projection is A - B
, and that’s the only expression we can re-use in the ORDER BY
. For example, we could even do this in Oracle:
SELECT DISTINCT a - b FROM t ORDER BY abs((a - b) + (a - b));
Or start using aliases:
SELECT DISTINCT a - b AS x FROM t ORDER BY abs(x + x);
Conclusion
The SQL language is quirky. This is mostly because the syntactical order of operations doesn’t match the logical order of operations. The syntax is meant to be human readable (remember Structured English Query Language?) but when reasoning about a SQL statement, we would often like to directly write down the logical order of operations.
In this article, we haven’t even touched the implications of adding
GROUP BY
TOP / LIMIT / FETCH
UNION
Which add more fun rules to what’s possible and what isn’t. Our previous article on the true logical order of SQL operations explains this completely.
Actually, I would use the following explanation.
…is exactly the same as
So it’s obvious that you can write
…or just by referring field in SELECT clause by number:
However, the below query is invalid because “title” is neither a part of grouping nor a result of aggregate function (like MAX or MIN) :
…because you can’t write even such “unordred” query
However, if we apply aggregation to”title”(if it ever makes sense) we will get the following perfectly valid query:
…so it’s valid to write
…or referring by a column number
…or even exclude MAX(title) from the result set
That’s like saying
WHERE
is the same thing asHAVING
. While you can often achieve the same things, they’re not really exactly the same thing. For example, theGROUP BY
clause is located before theWINDOW
clause in the logical order of SQL operations, which means you cannot use any window functions inGROUP BY
/HAVING
. But you can use them inSELECT
and thus filter on them usingDISTINCT
. See: https://blog.jooq.org/2016/12/09/a-beginners-guide-to-the-true-order-of-sql-operationsSo, while it is useful to know that
DISTINCT
andGROUP BY
work in a similar way, it is misleading (especially for beginners) to claim that they’re exactly the same.Thanks, this article is really helpful!
Can you tell me (or point me to somewhere that I can read about) what the purpose of “t” is in this query?
Thanks for your message. Some RDBMS (but not all, e.g. not Oracle) require that derived tables (subqueries in the FROM clause) have an alias. To play save, even if it is not really useful, I jsut always add it. This way, this query will also work in MySQL, PostgreSQL, SQL Server, and other RDBMS.
Hope this helps
Thanks for such a detailed explanation @lukaseder especially the detailed step by step explanation. Helped a lot!
Though for the postgreSQL “distinct” don’t agree with the explanation: the three sql statements are not equivalent;
Had doubt for the third query and did the following operations on a postgreSQL (EnterpriseDB) database:
As you can notice, the third query is yielding a distinct set of result as opposed to the the first two. If i’m not missing any environment level properties, i suppose it is not wrong to draw the conclusion that the three queries are not equivalent.
Would love to hear from you on this.
Thanks, again!Though
Thanks for your comment. Indeed, the examples were wrong. I think I meant to inverse length and title somehow. Unfortunately, I don’t have time in the near future to fix this, so I’ll just remove that section of the article, which doesn’t really add too much value anyway.
Thanks again for your thorough review!