A nice little gem in PostgreSQL’s SQL syntax is the DISTINCT ON
clause, which is as powerful as it is esoteric.
In a previous post, we’ve blogged about some caveats to think of when DISTINCT
and ORDER BY
are used together. The bigger picture can be seen in our article about the logical order of operations in SQL SELECT.
The PostgreSQL documentation explains it well:
SELECT DISTINCT ON ( expression [, ...] )
keeps only the first row of each set of rows where the given expressions evaluate to equal. TheDISTINCT ON
expressions are interpreted using the same rules as forORDER BY
. […] For example:
SELECT DISTINCT ON (location) location, time, report
FROM weather_reports
ORDER BY location, time DESC;
retrieves the most recent weather report for each location. […]
Again, this is quite esoteric as the distinct-ness is now decided only based on the columns listed separately in parentheses. For all the other columns, only the first row according to ORDER BY
is projected. The SQL language is probably one of the only ones where the syntactic order of operations has absolutely nothing to do with the logical order of operations, and this DISTINCT ON
syntax isn’t helping. In a more straightforward language design, the above statement could read, instead:
FROM weather_reports
WINDOW w AS (PARTITION BY location ORDER BY time DESC)
SELECT
location,
FIRST_VALUE (time) OVER w AS time,
FIRST_VALUE (report) OVER w AS report
DISTINCT
ORDER BY location
In other words, we would execute these operations in the following logical order:
FROM
: Access the weather_reports tableWINDOW
: Specify “windows” or “groups” in our data set, grouping by location and ordering contents of each group by time, descendingly (using the term “WINDOW” is no accident as we’ll see afterwards)SELECT
: Project the location (which is the grouping) and per group, the first values of time / report ordered by timeDISTINCT
: Remove all the duplicates, because the above operation will produce the same time and report value for each record that shares the same locationORDER BY
: Finally, order the results per grouping
That’s what really happens, and incidentally, the above synthetic SQL syntax matches the actual logical order of operations in the SQL language, so translating it back to an actual SQL statement would yield:
SELECT DISTINCT
location,
FIRST_VALUE (time) OVER w AS time,
FIRST_VALUE (report) OVER w AS report
FROM weather_reports
WINDOW w AS (PARTITION BY location ORDER BY time DESC)
ORDER BY location
If your database doesn’t support the WINDOW
clause, just expand it into the individual window functions. E.g. in Oracle, write:
SELECT DISTINCT
location,
FIRST_VALUE (time) OVER (PARTITION BY location ORDER BY time DESC),
FIRST_VALUE (report) OVER (PARTITION BY location ORDER BY time DESC)
FROM weather_reports
ORDER BY location
From a readability perspective, I would definitely prefer the standard SQL syntax over DISTINCT ON
.
Want to play around with it? Here’s some sample data:
create table weather_reports (location text, time date, report text);
insert into weather_reports values ('X', DATE '2000-01-01', 'X1');
insert into weather_reports values ('X', DATE '2000-01-02', 'X2');
insert into weather_reports values ('X', DATE '2000-01-03', 'X3');
insert into weather_reports values ('Y', DATE '2000-01-03', 'Y1');
insert into weather_reports values ('Y', DATE '2000-01-05', 'Y2');
insert into weather_reports values ('Z', DATE '2000-01-04', 'Z1');
The result being:
|location|time |report| |--------|----------|------| |X |2000-01-03|X3 | |Y |2000-01-05|Y2 | |Z |2000-01-04|Z1 |
Notice that jOOQ already supports PostgreSQL DISTINCT ON
and in the future, we might emulate it for other dialects using the above technique: https://github.com/jOOQ/jOOQ/issues/3564