SQL is a verbose language, and one of the most verbose features are
window functions.
In a
stack overflow question that I’ve encountered recently, someone asked to calculate the difference between the first and the last value in a time series for any given day:
Input
volume tstamp
---------------------------
29011 2012-12-28 09:00:00
28701 2012-12-28 10:00:00
28830 2012-12-28 11:00:00
28353 2012-12-28 12:00:00
28642 2012-12-28 13:00:00
28583 2012-12-28 14:00:00
28800 2012-12-29 09:00:00
28751 2012-12-29 10:00:00
28670 2012-12-29 11:00:00
28621 2012-12-29 12:00:00
28599 2012-12-29 13:00:00
28278 2012-12-29 14:00:00
Desired output
first last difference date
------------------------------------
29011 28583 428 2012-12-28
28800 28278 522 2012-12-29
How to write the query?
Notice that the value and timestamp progression do not correlate as it may appear. So, there is not a rule that if
Timestamp2 > Timestamp1
then
Value2 < Value1
. Otherwise, this simple query would work (using PostgreSQL syntax):
SELECT
max(volume) AS first,
min(volume) AS last,
max(volume) - min(volume) AS difference,
CAST(tstamp AS DATE) AS date
FROM t
GROUP BY CAST(tstamp AS DATE);
There are several ways to find the first and last values within a group that do not involve window functions. For example:
- In Oracle, you can use the FIRST and LAST functions, which for some arcane reason are not written
FIRST(...) WITHIN GROUP (ORDER BY ...)
or LAST(...) WITHIN GROUP (ORDER BY ...)
, like other sorted set aggregate functions, but some_aggregate_function(...) KEEP (DENSE_RANK FIRST ORDER BY ...)
. Go figure
- In PostgreSQL, you could use the
DISTINCT ON
syntax along with ORDER BY
and LIMIT
More details about the various approaches can be found here:
How to Write Efficient TOP N Queries in SQL
The best performing approach would be to use an aggregate function like Oracle’s, but few databases have this function. So, we’ll resort to using the
FIRST_VALUE
and LAST_VALUE
window functions:
SELECT DISTINCT
first_value(volume) OVER (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first,
last_value(volume) OVER (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last,
first_value(volume) OVER (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
- last_value(volume) OVER (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS diff,
CAST(tstamp AS DATE) AS date
FROM t
ORDER BY CAST(tstamp AS DATE)
Oops 🤔
That doesn’t look too readable. But it will yield the correct result. Granted, we could wrap the definition for the columns
FIRST
and
LAST
in a derived table, but that would still leave us with two repetitions of the window definition:
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
WINDOW clause to the rescue
Luckily, at least 3 databases have implemented the SQL standard
WINDOW
clause:
- MySQL
- PostgreSQL
- Sybase SQL Anywhere
The above query can be refactored to this one:
SELECT DISTINCT
first_value(volume) OVER w AS first,
last_value(volume) OVER w AS last,
first_value(volume) OVER w
- last_value(volume) OVER w AS diff,
CAST(tstamp AS DATE) AS date
FROM t
WINDOW w AS (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
ORDER BY CAST(tstamp AS DATE)
Notice how I can specify a window name with a window specification in a similar way as I can define a common table expression (
WITH
clause):
WINDOW
<window-name> AS (<window-specification>)
{ ,<window-name> AS (<window-specification>)... }
Not only can I reuse entire specifications, I could also build a specification from a partial specification, and reuse only parts. My previous query could have been rewritten as such:
SELECT DISTINCT
first_value(volume) OVER w3 AS first,
last_value(volume) OVER w3 AS last,
first_value(volume) OVER w3
- last_value(volume) OVER w3 AS diff,
CAST(tstamp AS DATE) AS date
FROM t
WINDOW
w1 AS (PARTITION BY CAST(tstamp AS DATE)),
w2 AS (w1 ORDER BY tstamp),
w3 AS (w2 ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
ORDER BY CAST(tstamp AS DATE)
Each window specification can be created from scratch, or be based on a previously defined window specification. Note this is also true when referencing the window definition. If I wanted to reuse the
PARTITION BY
clause and the
ORDER BY
clause, but change the
FRAME
clause (
ROWS ...
), then I could have written this:
SELECT DISTINCT
first_value(volume) OVER (
w2 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS first,
last_value(volume) OVER (
w2 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) AS last,
first_value(volume) OVER (
w2 ROWS UNBOUNDED PRECEDING
) - last_value(volume) OVER (
w2 ROWS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING
) AS diff,
CAST(tstamp AS DATE) AS date
FROM t
WINDOW
w1 AS (PARTITION BY CAST(tstamp AS DATE)),
w2 AS (w1 ORDER BY tstamp)
ORDER BY CAST(tstamp AS DATE)
What if my database doesn’t support the WINDOW clause?
In that case, you have to either manually write the window specification on each window function, or you use a SQL builder like jOOQ, which can emulate the window clause:
You can try this translation online on our website:
https://www.jooq.org/translateLike this:
Like Loading...
Hi Lukas,
thanks for the blog!
It seems part of the BETWEEN specification fell out on this line (#9 in last example in
section “WINDOW clause to the rescue”):
The query works (MySQL 8.0) if I add “AND CURRENT ROW”
Thanks,
Dag
Thanks a lot for the pointer, Dag. Fixed. What I meant to write there is
w3 ROWS UNBOUNDED PRECEDING
, i.e. use the syntax that specifies only the lower bound of the window.Starting from SQLite 3.25.0(2018-09-18 ) also supports WINDOW clause/windowed functions.
https://www.sqlite.org/windowfunctions.html
so we have at least 4 DBs :)
You’re right, thanks for the pointer