DISTINCTis something like a function, and that it can take parenthesised arguments. Just recently, I’ve seen this Stack Overflow question where the OP was looking for a way to express this in jOOQ:
Notice the parentheses around
SELECT DISTINCT (emp.id), emp.fname, emp.name FROM employee emp;
(emp.id), which look as though this is some special kind of
DISTINCTusage, which is akin to a
DISTINCTfunction. The idea is often that:
- The behaviour is somewhat different from omitting the parentheses
- The performance is faster, because only the ID needs to be considered for distinctness
This is incorrectThese claims are incorrect, of course. There is no semantic or performance difference between the two. The parentheses are merely parentheses around a column expression, in a similar way as you would use parentheses to influence operator precedence. Think of it this way:
In the above example, we do not apply a “DISTINCT function” to the expression
SELECT DISTINCT (emp.id + 1) * 2, emp.fname, emp.name FROM employee emp;
emp.id + 1. We merely placed parentheses around a column expression
emp.id + 1to make sure the addition happens before the multiplication. The
DISTINCToperator happens after the projection, always. If SQL had used a more logical syntax, rather than following English grammar (it was originally called Structured English QUEry Language, or SEQUEL), then we would write the OP’s statement like this:
FROM employee SELECT id, fname, name DISTINCT
DISTINCToperation always happens after the projection (
SELECTclause content), and is applied to the entirety of the projection. There is no way in standard SQL to apply distinctness only to parts of the projection (there is in PostgreSQL, see further down). To clarify this a bit more, I recommend reading our previous blog post about the logical order of operations in SQL, and how DISTINCT and ORDER BY are related.
What would it mean anyway?We can revert the question back to the OP and ask ourselves, what would it mean for a
DISTINCToperation to apply to only one column, anyway? Let’s assume this data set:
|id |fname|name| |---|-----|----| |1 |A |A | |1 |B |B |If we applied
DISTINCTonly to the ID column (and didn’t project anything else), clearly, we’d only get one row as a result:
SELECT DISTINCT id FROM employee
|id | |---| |1 |But if we wanted to also project
NAME, which row would “win”? Would we display the first or the second row, or any random row? The behaviour would be undefined, and SQL doesn’t like undefined behaviour, so this is not possible. The only reasonable application of
DISTINCTis always on the entire projection.
Exception: PostgreSQLFortunately (or to add more to syntactic confusion: unfortunately), PostgreSQL has implemented an extension to the SQL standard. With
DISTINCT ON, it is effectively possible to apply distinctness only to parts of the projection:
The output is now what the OP desired (but couldn’t use, because they were using MySQL):
WITH emp (id, fname, name) AS ( VALUES (1, 'A', 'A'), (1, 'B', 'B') ) SELECT DISTINCT ON (id) id, fname, name FROM emp ORDER BY id, fname, name
|id |fname|name | |-----------|-----|-----| |1 |A |A |I personally don’t like
DISTINCT ON. While it is very useful, no doubt, it makes something that is already very difficult to explain to SQL beginners even more complicated. With a “more reasonable” syntax, the query would be written like this:
With this syntactic order of operation, there would be no doubt about the semantics of
FROM emp SELECT id, fname, name ORDER BY id, fname, name DISTINCT ON (id)
7 thoughts on “SQL DISTINCT is not a function”
We used to have both SELECT DISTINCT and SELECT ALL operators. The latter one defaulted to the SELECT that we use today.
The standard GROUP BY clause can express this, but if the other selected columns are not already aggregate expressions then something like MIN() is needed to make a clear winner from multiple different names for each id.
That’s treacherous, because you’ll get a per-column MIN() evaluation not a per-row. For example, if your data is this:
Your approach would lead to this result, which is probably not desireable:
I am also not a fan of DISTINCT ON (…). ORDER BY. I prefer to think about such cases in terms of filtered ranking function:
You folks are spoiled with that QUALIFY clause. I really wish this became part of the standard. It makes a lot of sense, too
DISTINCT and DISTINCT ON are often a bad code smell as I’ve experienced them in production code, as it’s common to take a short cut toward the real constraints using it.
GROUP BY and thoroughly modern fun like windowing functions often signify of more complete working out of the constraints.
This isn’t to say there’s never a place for it, just that when you see it, it’s a signal to examine carefully.
I can see DISTINCT being a sign of this problem, but DISTINCT ON? The clause is sophisticated and complicated enough that I would not have expected it to be used as much by people who don’t model their databases properly…