How SQL DISTINCT and ORDER BY are Related

One of the things that confuse SQL users all the time is how DISTINCT and ORDER BY are related in a SQL query.

The Basics

Running some queries against the Sakila database, most people quickly understand:

SELECT DISTINCT length FROM film

This returns results in an arbitrary order, because the database can (and might apply hashing rather than ordering to remove duplicates):

length |
-------|
129    |
106    |
120    |
171    |
138    |
80     |
...

Most people also understand:

SELECT length FROM film ORDER BY length

This will give us duplicates, but in order:

length |
-------|
46     |
46     |
46     |
46     |
46     |
47     |
47     |
47     |
47     |
47     |
47     |
47     |
48     |
...

And, of course, we can combine the two:

SELECT DISTINCT length FROM film ORDER BY length

Resulting in…

length |
-------|
46     |
47     |
48     |
49     |
50     |
51     |
52     |
53     |
54     |
55     |
56     |
...

Then why doesn’t this work?

Maybe somewhat intuitively, we may want to order the lengths differently, e.g. by title:

SELECT DISTINCT length FROM film ORDER BY title

Most databases fail this query with an exception like Oracle’s:

ORA-01791: not a SELECTed expression

At first sight, this seems funny, because this works after all:

SELECT length FROM film ORDER BY title

Yielding:

length |
-------|
86     |
48     |
50     |
117    |
130    |
...

We could add the title to illustrate the ordering

length |title                       |
-------|----------------------------|
86     |ACADEMY DINOSAUR            |
48     |ACE GOLDFINGER              |
50     |ADAPTATION HOLES            |
117    |AFFAIR PREJUDICE            |
130    |AFRICAN EGG                 |

So, how are these different?

We have to rewind and check out the logical order of SQL operations (as opposed to the syntactic order). And always remember, this is the logical order, not the actual order executed by the optimiser.

When we write something like this:

SELECT DISTINCT length FROM film ORDER BY length

The logical order of operations is:

  • FROM clause, loading the FILM table
  • SELECT clause, projecting the LENGTH column
  • DISTINCT clause, removing distinct tuples (with projected LENGTH columns)
  • ORDER BY clause, ordering by the LENGTH column

If we look at this step by step, we have:

Step 1: SELECT * FROM film

The intermediary data set is something like:

film_id |title                       |length | ...
--------|----------------------------|-------| ...
1       |ACADEMY DINOSAUR            |86     | ...
2       |ACE GOLDFINGER              |48     | ...
3       |ADAPTATION HOLES            |50     | ...
4       |AFFAIR PREJUDICE            |117    | ...
5       |AFRICAN EGG                 |130    | ...
...     |...                         |...    | ...

Step 2: SELECT length …

The intermediary data set is something like:

length |
-------|
86     |
48     |
50     |
117    |
130    |
...
86     | <-- duplicate

Step 3: SELECT DISTINCT length …

Now we’re getting a new random order (due to hashing) and no duplicates anymore:

length |
-------|
129    |
106    |
120    |
171    |
138    |
...

Step 4: … ORDER BY length

And we’re getting:

length |
-------|
46     |
47     |
48     |
49     |
50     |
...

It seems obvious.

So why did this work?

Remember, this query worked:

SELECT length FROM film ORDER BY title

Even if after projecting the LENGTH column, it seems as though it is no longer available for sorting, it really is, according to the SQL standard and to common sense. There is a concept called extended sort key columns in the SQL standard, which means the above query has a slightly different order of operations (apart from the fact that there is no DISTINCT operation):

  • FROM clause, loading the FILM table
  • SELECT clause, projecting the LENGTH column from the select list and the TITLE from the extended sort key columns
  • ORDER BY clause, ordering by the TITLE column
  • SELECT clause (implicit), projecting only the LENGTH column, discarding the TITLE column

Again, this is what happens logically. Database optimisers may choose other ways to implement this. By example:

Step 1: SELECT * FROM film

Same as before

film_id |title                       |length | ...
--------|----------------------------|-------| ...
1       |ACADEMY DINOSAUR            |86     | ...
2       |ACE GOLDFINGER              |48     | ...
3       |ADAPTATION HOLES            |50     | ...
4       |AFFAIR PREJUDICE            |117    | ...
5       |AFRICAN EGG                 |130    | ...
...     |...                         |...    | ...

Step 2: SELECT length, title…

We get that synthetic extended sort key column TITLE along with the LENGTH column that we requested

length |title                       |
-------|----------------------------|
86     |ACADEMY DINOSAUR            |
114    |ALABAMA DEVIL               |
50     |ADAPTATION HOLES            |
117    |AFFAIR PREJUDICE            |
168    |ANTITRUST TOMATOES          |
...

Step 3: … ORDER BY title

… we can now order by that column

length |title                       |
-------|----------------------------|
86     |ACADEMY DINOSAUR            |
48     |ACE GOLDFINGER              |
50     |ADAPTATION HOLES            |
117    |AFFAIR PREJUDICE            |
130    |AFRICAN EGG                 |
...

Step 4: SELECT length

… and finally discard it, because we never wanted it

length |
-------|
86     |
48     |
50     |
117    |
130    |

So why can’t we use DISTINCT?

If we try to run this:

SELECT DISTINCT length FROM film ORDER BY title

We would get an additional DISTINCT operation in our logical set of operations:

  • FROM clause, loading the FILM table
  • SELECT clause, projecting the LENGTH column from the select list and the TITLE from the extended sort key columns
  • DISTINCT clause, removing duplicate (LENGTH, TITLE) values… Ooops
  • ORDER BY clause, ordering by the TITLE column
  • SELECT clause (implicit), projecting only the LENGTH column, discarding the TITLE column

The problem is, since we have synthetically added the extended sort key column TITLE to the projection in order to be able to ORDER BY it, DISTINCT wouldn’t have the same semantics anymore as can be seen here:

SELECT count(*)
FROM (
  SELECT DISTINCT length FROM film
) t;

SELECT count(*)
FROM (
  SELECT DISTINCT length, title FROM film
) t;

Yielding

140
1000

All titles are distinct. There is no way this query can be executed reasonably. Either DISTINCT doesn’t work (because the added extended sort key column changes its semantics), or ORDER BY doesn’t work (because after DISTINCT we can no longer access the extended sort key column).

A more constructed example. T contains this data:

CREATE TABLE t (a INT, b INT);
INSERT INTO t VALUES (1, 1);
INSERT INTO t VALUES (1, 2);
INSERT INTO t VALUES (2, 3);
INSERT INTO t VALUES (1, 4);
INSERT INTO t VALUES (2, 5);
A   B
-----
1   1
1   2
2   3
1   4
2   5

What would this query produce?

SELECT DISTINCT a FROM t ORDER BY b;

Clearly, we should only get 2 rows with values 1, 2, because of DISTINCT a:

A 
--
1
2

Now, how do we order these by B? There are 3 values of B associated A = 1 and 2 values of B associated with A = 2:

A   B
------------------
1   Any of 1, 2, 4
2   Any of 3, 5

Should we get 1, 2 or 2, 1 as a result? Impossible to tell.

But there are some exceptions

The way I read the SQL standard, the following exception should be possible. The SQL standard ISO/IEC 9075-2:2016(E), 7.17 <query expression>, Syntax Rules 28) d) i) 6) references the “Left normal form derivation”. But I may be reading this wrong, see also a discussion on the PostgreSQL mailing list:
https://www.postgresql.org/message-id/20030819103859.L69440-100000%40megazone.bigpanda.com

In any case, it still makes sense to me. For instance, we can form expressions on the columns in the select list. This is totally fine in MySQL (strict mode) and Oracle:

SELECT DISTINCT length 
FROM film 
ORDER BY mod(length, 10), length;

It will produce

length |
-------|
50     |
60     |
70     |
80     |
90     |
100    |
110    |
120    |
130    |
140    |
150    |
160    |
170    |
180    |
51     |
61     |
71     |

PostgreSQL doesn’t allow this because the expression MOD(LENGTH, 10) is not in the select list. How to interpret this? We’re looking again at the order of SQL operations:

  • FROM clause, loading the FILM table
  • SELECT clause, projecting the LENGTH column from the select list. MOD(LENGTH, 10) does not have to be put in the extended sort key columns, because it can be fully derived from the select list.
  • DISTINCT clause, removing duplicate LENGTH values … all fine, because we don’t have the verboten extended sort key columns
  • ORDER BY clause, ordering by the mod(LENGTH, 10), LENGTH columns. Totally fine, because we can derive all of these order by expressions from expressions in the select list

Makes sense, right?

Back to our constructed table T:

A   B
-----
1   1
1   2
2   3
1   4
2   5

We are allowed to write:

SELECT DISTINCT a, b FROM t ORDER BY a - b;

We would get:

A   B
-----
1   4
2   5
2   3
1   2
1   1

Again, the order by expressions can be derived completely from the select list. This also works in Oracle:

SELECT DISTINCT a - b FROM t ORDER BY abs(a - b);

The select list contains a column A - B, so we can derive any ORDER BY expression from it. But these don’t work:

SELECT DISTINCT a - b FROM t ORDER BY a;
SELECT DISTINCT a - b FROM t ORDER BY b;
SELECT DISTINCT a - b FROM t ORDER BY b - a;

It is easy to build an intuition for why these don’t work. Clearly, the data set we want is:

A - B  A             B             B - A
------------------------------------------
-3     Any of 1, 2   Any of 4, 5   3
-1     Any of 2, 1   Any of 3, 2   1
 0     Any of 1      Any of 1      0

Now, how are we supposed to order these by A, B or B - A? It looks as though we should be able to sort by B - A in this case. We could derive a complicated transformation of expressions that can be reasonably transformed into each other, such as A - B = -(B - A), but this simply isn’t practical. The expression in the projection is A - B, and that’s the only expression we can re-use in the ORDER BY. For example, we could even do this in Oracle:

SELECT DISTINCT a - b FROM t ORDER BY abs((a - b) + (a - b));

Or start using aliases:

SELECT DISTINCT a - b AS x FROM t ORDER BY abs(x + x);

Conclusion

The SQL language is quirky. This is mostly because the syntactical order of operations doesn’t match the logical order of operations. The syntax is meant to be human readable (remember Structured English Query Language?) but when reasoning about a SQL statement, we would often like to directly write down the logical order of operations.

In this article, we haven’t even touched the implications of adding

  • GROUP BY
  • TOP / LIMIT / FETCH
  • UNION

Which add more fun rules to what’s possible and what isn’t. Our previous article on the true logical order of SQL operations explains this completely.

Need more explanation? Check this out.

7 thoughts on “How SQL DISTINCT and ORDER BY are Related

  1. Actually, I would use the following explanation.

    SELECT DISTINCT length FROM film
    

    …is exactly the same as

    SELECT length FROM film GROUP BY length
    

    So it’s obvious that you can write

    SELECT length FROM film GROUP BY length ORDER BY length
    

    …or just by referring field in SELECT clause by number:

    SELECT length FROM film GROUP BY length ORDER BY 1
    

    However, the below query is invalid because “title” is neither a part of grouping nor a result of aggregate function (like MAX or MIN) :

    SELECT length FROM film GROUP BY length ORDER BY title
    

    …because you can’t write even such “unordred” query

    SELECT length, title FROM film GROUP BY length
     -- it's ambiguous what row will yields "title", it's not a part of grouping.
    

    However, if we apply aggregation to”title”(if it ever makes sense) we will get the following perfectly valid query:

    SELECT length, MAX(title) FROM film GROUP BY length
    

    …so it’s valid to write

    SELECT length, MAX(title) FROM film GROUP BY length ORDER BY MAX(title)
    

    …or referring by a column number

    SELECT length, MAX(title) FROM film GROUP BY length ORDER BY 2
    

    …or even exclude MAX(title) from the result set

    SELECT length FROM film GROUP BY length ORDER BY MAX(title)
    
    1. Actually, I would use the following explanation.

      SELECT DISTINCT length FROM film
      …is exactly the same as

      SELECT length FROM film GROUP BY length

      That’s like saying WHERE is the same thing as HAVING. While you can often achieve the same things, they’re not really exactly the same thing. For example, the GROUP BY clause is located before the WINDOW clause in the logical order of SQL operations, which means you cannot use any window functions in GROUP BY / HAVING. But you can use them in SELECT and thus filter on them using DISTINCT. See: https://blog.jooq.org/2016/12/09/a-beginners-guide-to-the-true-order-of-sql-operations

      So, while it is useful to know that DISTINCT and GROUP BY work in a similar way, it is misleading (especially for beginners) to claim that they’re exactly the same.

      1. That’s like saying WHERE is the same thing as HAVING
        You are exaggerating. In the context of the article’s example filtering out non-unique rows (DISTINCT) is the same as “reducing” original rowset to unique rows via GROUP BY. Ok, I agree that “almost the same” here is a better statement than “exactly the same”, and, again, in the context given.

  2. Thanks, this article is really helpful!
    Can you tell me (or point me to somewhere that I can read about) what the purpose of “t” is in this query?

    SELECT length
    FROM (
      SELECT length, MIN(title) title
      FROM film
      GROUP BY length
    ) t
    ORDER BY title
    
    1. Thanks for your message. Some RDBMS (but not all, e.g. not Oracle) require that derived tables (subqueries in the FROM clause) have an alias. To play save, even if it is not really useful, I jsut always add it. This way, this query will also work in MySQL, PostgreSQL, SQL Server, and other RDBMS.

      Hope this helps

  3. Thanks for such a detailed explanation @lukaseder especially the detailed step by step explanation. Helped a lot!
    Though for the postgreSQL “distinct” don’t agree with the explanation: the three sql statements are not equivalent;

    --SELECT DISTINCT length FROM film ORDER BY title;
    --SELECT DISTINCT ON (title) length FROM film ORDER BY title;
    --SELECT length FROM (   SELECT length, MIN(title) title   FROM film   GROUP BY length ) t ORDER BY title;
    

    Had doubt for the third query and did the following operations on a postgreSQL (EnterpriseDB) database:

    -------------------------------------------------Output Text-Start---------------------------------------------------------------
    SQL> SELECT * FROM SOAPS;
    
    LENGTH               TITLE
    
    
    
    10                   TITLE A
    20                   TITLE B
    30                   TITLE C
    40                   TITLE D
    50                   TITLE E
    10                   TITLE F
    20                   TITLE G
    
    7 rows retrieved.
    
    SQL> SELECT DISTINCT LENGTH FROM SOAPS ORDER BY TITLE;
    
    LENGTH
    
    10
    20
    30
    40
    50
    10
    20
    
    7 rows retrieved.
    
    SQL> SELECT DISTINCT ON (TITLE) LENGTH FROM SOAPS ORDER BY TITLE;
    
    LENGTH
    
    10
    20
    30
    40
    50
    10
    20
    
    7 rows retrieved.
    
    SQL> SELECT LENGTH FROM (SELECT LENGTH, MIN(title) title FROM SOAPS GROUP BY LENGTH) t ORDER BY title;
    
    LENGTH
    
    10
    20
    30
    40
    50
    
    -------------------------------------------------Output Text- Ends---------------------------------------------------------------
    

    As you can notice, the third query is yielding a distinct set of result as opposed to the the first two. If i’m not missing any environment level properties, i suppose it is not wrong to draw the conclusion that the three queries are not equivalent.

    Would love to hear from you on this.

    Thanks, again!Though

    1. Thanks for your comment. Indeed, the examples were wrong. I think I meant to inverse length and title somehow. Unfortunately, I don’t have time in the near future to fix this, so I’ll just remove that section of the article, which doesn’t really add too much value anyway.

      Thanks again for your thorough review!

Leave a Reply