I’m frequently telling developers to put window functions almost everywhere, because they’re so awesome!
One feature that I rarely see in the wild (even if it is extremely useful for reporting) is called “logical windowing” in Oracle, and it’s most useful when used with INTERVAL
ranges. Let’s see what we may want to do.
I have a payment
table in my Sakila DVD rental store database. The payment table has payment dates, as can be seen here:
SELECT CAST(payment_date AS TIMESTAMP) ts FROM payment ORDER BY ts
output:
TS --------------------------- 24.05.05 22:53:30.000000000 24.05.05 22:54:33.000000000 24.05.05 23:03:39.000000000 24.05.05 23:04:41.000000000 24.05.05 23:05:21.000000000 24.05.05 23:08:07.000000000 24.05.05 23:11:53.000000000 24.05.05 23:31:46.000000000 25.05.05 00:00:40.000000000 25.05.05 00:02:21.000000000 25.05.05 00:09:02.000000000 25.05.05 00:19:27.000000000
Now, let’s assume I’m interested in these things:
- How many payments were there in the same hour as any given payment?
- How many payments were there in the same hour before any given payment?
- How many payments were there within one hour before any given payment?
Those are three entirely different questions. The expected solution will be this:
TS 1 2 3 ------------------------------------------- 24.05.05 22:53:30.000000000 2 1 1 24.05.05 22:54:33.000000000 2 2 2 24.05.05 23:03:39.000000000 6 1 3 24.05.05 23:04:41.000000000 6 2 4 24.05.05 23:05:21.000000000 6 3 5 24.05.05 23:08:07.000000000 6 4 6 24.05.05 23:11:53.000000000 6 5 7 24.05.05 23:31:46.000000000 6 6 8 25.05.05 00:00:40.000000000 4 1 7 25.05.05 00:02:21.000000000 4 2 8 25.05.05 00:09:02.000000000 4 3 5 25.05.05 00:19:27.000000000 4 4 5
As you can see, in column #1, we’re getting always the same number of payments for any given hour. If we were using GROUP BY trunc(payment_date, 'HH24')
, it would be simple to see that this would the result we were looking for:
TS 1 ----------------------------------- 24.05.05 22:00:00 2 24.05.05 23:00:00 6 25.05.05 00:00:00 4
Column #2 is a bit more sophisticated as it will also aggregatee the number of payments per hour, but only those that are before any given payment. For instance, in the hour of 24.05.05 23:00:00
, we have 6 payments (see above), and those are the different payment numbers within that hour. E.g. 24.05.05 23:11:53
is the fifth payment within that hour:
TS 2 ----------------------------------- 24.05.05 23:03:39.000000000 1 24.05.05 23:04:41.000000000 2 24.05.05 23:05:21.000000000 3 24.05.05 23:08:07.000000000 4 24.05.05 23:11:53.000000000 5 24.05.05 23:31:46.000000000 6
Finally, column #3 is using a sliding interval. For each payment, it checks how many payments were there in the time range between the current payment and one hour before. Now, excuse my ASCII art:
TS 3 ----------------------------------- 24.05.05 22:53:30.000000000 1 24.05.05 22:54:33.000000000 2 24.05.05 23:03:39.000000000 3 <------\ 24.05.05 23:04:41.000000000 4 | 24.05.05 23:05:21.000000000 5 | 24.05.05 23:08:07.000000000 6 | 24.05.05 23:11:53.000000000 7 <----\ | 24.05.05 23:31:46.000000000 8 <--\ | | 25.05.05 00:00:40.000000000 7 | | | 25.05.05 00:02:21.000000000 8 ---|-|-/ 25.05.05 00:09:02.000000000 5 ---|-/ 25.05.05 00:19:27.000000000 5 ---/
All of these are really useful for reporting, and all of these can be done rather easily with window functions. I’m using Oracle syntax here:
SELECT CAST(payment_date AS TIMESTAMP) ts, -- Column 1: Payments in the same hour COUNT(*) OVER ( PARTITION BY TRUNC(payment_date, 'HH24') ), -- Column 2: Preceding payments in the same hour COUNT(*) OVER ( PARTITION BY TRUNC(payment_date, 'HH24') ORDER BY payment_date ), -- Column 3: Preceding payments within one hour COUNT(*) OVER ( ORDER BY payment_date RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW ) FROM payment ORDER BY ts
Remember how column #1 is essentially the same thing as running a GROUP BY
operation? That’s simple with window functions too. Just use a single PARTITION BY
clause, and you get that grouping by the hour:
COUNT(*) OVER ( PARTITION BY TRUNC(payment_date, 'HH24') )
Column #2 is more interesting. Because we’re ordering the window, and because we’re not putting any explicit frame clause (ROWS
or RANGE
), by default, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
is implied. The following two are the same:
COUNT(*) OVER ( PARTITION BY TRUNC(payment_date, 'HH24') ORDER BY payment_date ) COUNT(*) OVER ( PARTITION BY TRUNC(payment_date, 'HH24') ORDER BY payment_date RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW )
In other words, we group (PARTITION BY
) all payments by the hour, order each group by the actual date, and limit / frame the window to the payments that preceed the current payment.
Column #3 is the most interesting and the least known, however. Unfortunately, it is not yet supported in all databases, even if it’s part of the SQL standard.
COUNT(*) OVER ( ORDER BY payment_date RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW )
We’re no longer using the PARITION BY
clause, because we don’t want to group payments into hours. Instead, we want to look behind one hour from each individual payment to see how many payments there were in that hour. But hours now don’t start / end at 00:00
minutes/seconds, they end in a given payment’s payment_date
column (where Oracle DATE is really a timestamp).
This is a really nice feature that is available with DATE
or TIMESTAMP
columns only, when you pass an INTERVAL
data type to the frame clause. I bet, however, that most of you have already had 1-2 reports where this might have been useful to know!
Want to learn more?
At the risk of being accused of self-marketing, I’ve done a 30-video series on Analytic functions (including windowing functions) here
Cheers,
Connor
That’s a huge risk, but really interesting looking series! Thanks for sharing
I was pleasantly surprised they are present in PostgreSQL: https://www.postgresql.org/docs/9.1/static/tutorial-window.html
Window functions: Yes of course :). But this particular feature (logical windowing over date ranges) is not yet available, unfortunately.
Correction: It was added to PostgreSQL 11