PostgreSQL 11’s Support for SQL Standard GROUPS and EXCLUDE Window Function Clauses

Exciting discovery when playing around with PostgreSQL 11! New SQL standard window function clauses have been supported. If you want to play with this, you can do so very easily using docker:

docker pull postgres:11
docker run --name POSTGRES11 -e POSTGRES_PASSWORD=postgres -d postgres:11
docker run -it --rm --link POSTGRES11:postgres postgres psql -h postgres -U postgres

The frame clause

When working with window functions, in some cases you want to add the optional frame clause. For example, to get a sliding average over your data, you will write:


SELECT 
  payment_date,
  amount,
  avg(amount) OVER (
    ORDER BY payment_date, payment_id
    ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )::DECIMAL(10, 2),
  array_agg(amount) OVER (
    ORDER BY payment_date, payment_id
    ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )
FROM payment;

As always I will be running queries against the Sakila database. The above query yields:

payment_date        |amount |avg  |array_agg                   |
--------------------|-------|-----|----------------------------|
2005-05-24 22:53:30 |2.99   |3.32 |          {2.99,2.99,3.99}  |
2005-05-24 22:54:33 |2.99   |3.74 |     {2.99,2.99,3.99,4.99}  |
2005-05-24 23:03:39 |3.99   |4.39 |{2.99,2.99,3.99,4.99,6.99}  |
2005-05-24 23:04:41 |4.99   |3.99 |{2.99,3.99,4.99,6.99,0.99}  |
2005-05-24 23:05:21 |6.99   |3.79 |{3.99,4.99,6.99,0.99,1.99}  |
2005-05-24 23:08:07 |0.99   |3.99 |{4.99,6.99,0.99,1.99,4.99}  |
2005-05-24 23:11:53 |1.99   |3.99 |{6.99,0.99,1.99,4.99,4.99}  |
2005-05-24 23:31:46 |4.99   |3.79 |{0.99,1.99,4.99,4.99,5.99}  |

The array_agg function helps display how the sliding average came to be. For each average value, we’re looking 2 rows ahead and 2 rows behind in the ordered window. In the above query, I’m using the optional frame clause to specify the frame size. It has three “modes” or “units”:

<window frame units> ::=
  ROWS
| RANGE
| GROUPS

Almost all databases that support window functions support the first two unit types. To my knowledge, only PostgreSQL 11 and H2 1.4.198 now also supports GROUPS. The difference is rather simple to explain:

ROWS counts the exact number of rows in the frame.
RANGE performs logical windowing where we don’t count the number of rows, but look for a value offset.
GROUPS counts all groups of tied rows within the window.

I think this is best explained by example. Let’s look at payments with payment timestamps truncated to the hour:


WITH hourly_payment AS (
  SELECT 
    payment_id,
    date_trunc('h', payment_date) AS hour,
    amount
  FROM payment
)
SELECT *
FROM hourly_payment
ORDER BY hour;

This gives us:

payment_id |hour                |amount |
-----------|--------------------|-------|
12377      |2005-05-24 22:00:00 |2.99   | \  Tied group
3504       |2005-05-24 22:00:00 |2.99   | /

6440       |2005-05-24 23:00:00 |4.99   | \
11032      |2005-05-24 23:00:00 |3.99   |  |
8987       |2005-05-24 23:00:00 |4.99   |  | Tied group
6003       |2005-05-24 23:00:00 |6.99   |  |
14728      |2005-05-24 23:00:00 |0.99   |  |
7274       |2005-05-24 23:00:00 |1.99   | /

12025      |2005-05-25 00:00:00 |0.99   | \
3831       |2005-05-25 00:00:00 |8.99   |  |
7044       |2005-05-25 00:00:00 |4.99   |  |
8623       |2005-05-25 00:00:00 |9.99   |  | Tied group
3386       |2005-05-25 00:00:00 |4.99   |  |
8554       |2005-05-25 00:00:00 |4.99   |  |
10785      |2005-05-25 00:00:00 |5.99   |  |
9014       |2005-05-25 00:00:00 |6.99   | /

15394      |2005-05-25 01:00:00 |2.99   | \
10499      |2005-05-25 01:00:00 |4.99   |  |
5020       |2005-05-25 01:00:00 |2.99   |  | Tied group
490        |2005-05-25 01:00:00 |0.99   |  |
12305      |2005-05-25 01:00:00 |4.99   | /

11796      |2005-05-25 02:00:00 |4.99   | \
9463       |2005-05-25 02:00:00 |4.99   |  | Tied group
13711      |2005-05-25 02:00:00 |4.99   | /

Now we can see that for each hour, we have several payments. When we order payments by hour, there are some “tied” payments within that hour (or “group”), i.e. the order among payments on 2005-05-24 22:00:00 are not ordered deterministically among themselves. The payment ids are pretty random. Now, if we look at the three window frame units again, how do they behave? ROWS


WITH hourly_payment AS (
  SELECT 
    payment_id,
    date_trunc('h', payment_date) AS hour
  FROM payment
)
SELECT 
  payment_id,
  hour,
  array_agg(payment_id) OVER (
    ORDER BY hour
    ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )
FROM hourly_payment
ORDER BY hour;

We can see that the size of the window is always precisely 5 rows (except at the beginning and end of the data set):

payment_id |hour                |array_agg                      |
-----------|--------------------|-------------------------------|
12377      |2005-05-24 22:00:00 |{12377,3504,6440}              |
3504       |2005-05-24 22:00:00 |{12377,3504,6440,11032}        |
6440       |2005-05-24 23:00:00 |{12377,3504,6440,11032,8987}   |
11032      |2005-05-24 23:00:00 |{3504,6440,11032,8987,6003}    |
8987       |2005-05-24 23:00:00 |{6440,11032,8987,6003,14728}   |
6003       |2005-05-24 23:00:00 |{11032,8987,6003,14728,7274}   |
14728      |2005-05-24 23:00:00 |{8987,6003,14728,7274,12025}   |
7274       |2005-05-24 23:00:00 |{6003,14728,7274,12025,3831}   |
12025      |2005-05-25 00:00:00 |{14728,7274,12025,3831,7044}   |
3831       |2005-05-25 00:00:00 |{7274,12025,3831,7044,8623}    |
7044       |2005-05-25 00:00:00 |{12025,3831,7044,8623,3386}    |
8623       |2005-05-25 00:00:00 |{3831,7044,8623,3386,8554}     |
3386       |2005-05-25 00:00:00 |{7044,8623,3386,8554,10785}    |
8554       |2005-05-25 00:00:00 |{8623,3386,8554,10785,9014}    |
10785      |2005-05-25 00:00:00 |{3386,8554,10785,9014,15394}   |
9014       |2005-05-25 00:00:00 |{8554,10785,9014,15394,10499}  |
15394      |2005-05-25 01:00:00 |{10785,9014,15394,10499,5020}  |
10499      |2005-05-25 01:00:00 |{9014,15394,10499,5020,490}    |
5020       |2005-05-25 01:00:00 |{15394,10499,5020,490,12305}   |
490        |2005-05-25 01:00:00 |{10499,5020,490,12305,11796}   |
12305      |2005-05-25 01:00:00 |{5020,490,12305,11796,9463}    |
11796      |2005-05-25 02:00:00 |{490,12305,11796,9463,13711}   |
9463       |2005-05-25 02:00:00 |{12305,11796,9463,13711,8167}  |
13711      |2005-05-25 02:00:00 |{11796,9463,13711,8167,1011}   |

There is no notion of a “group” among the rows in the window. But the problem is that we’re getting random PAYMENT_ID values unless we also add the PAYMENT_ID to the ORDER BY clause. This isn’t really what we want, most of the time, so we use: RANGE


WITH hourly_payment AS (
  SELECT 
    payment_id,
    date_trunc('h', payment_date) AS hour
  FROM payment
)
SELECT 
  payment_id,
  hour,
  EXTRACT(epoch FROM hour) / 3600,
  array_agg(payment_id) OVER (
    ORDER BY EXTRACT(epoch FROM hour) / 3600
    RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )
FROM hourly_payment
ORDER BY hour;

I have switched from ROWS to RANGE and now the ORDER BY clause works on a number based on the epoch of the hour. What happens now? This now yields:

payment_id |hour                |?column? |array_agg                                                                                                                                                              
-----------|--------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
12377      |2005-05-24 22:00:00 |310270   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014}
3504       |2005-05-24 22:00:00 |310270   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014}

6440       |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
11032      |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
8987       |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
6003       |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
14728      |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
7274       |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}

12025      |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
3831       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
7044       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
8623       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
3386       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
8554       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
10785      |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
9014       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}

15394      |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}
10499      |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}
5020       |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}
490        |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}
12305      |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}

11796      |2005-05-25 02:00:00 |310274   |{12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245,14396,13055,15984,9975,8188,5596,2388,7347,11598,6186}
9463       |2005-05-25 02:00:00 |310274   |{12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245,14396,13055,15984,9975,8188,5596,2388,7347,11598,6186}
13711      |2005-05-25 02:00:00 |310274   |{12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245,14396,13055,15984,9975,8188,5596,2388,7347,11598,6186}

I’ve visually separated the rows by their hour and the array aggregation by the “tied” payment_ids, i.e. the payment IDs that have the same hour. Observations:

We get the same aggregation value for the entire set of tied rows, so if in two rows, HOUR is the same, then ARRAY_AGG is the same as well
The window size is now a logical size, no longer an offset size, so we’re going back 2 hours and ahead 2 hours (instead of 2 rows). This is why I’ve extracted epoch and divided it by hour, so I will get consecutive integer values for consecutive hours

The same result could have been achieved using interval types:


WITH hourly_payment AS (
  SELECT 
    payment_id,
    date_trunc('h', payment_date) AS hour
  FROM payment
)
SELECT 
  payment_id,
  hour,
  EXTRACT(epoch FROM hour) / 3600,
  array_agg(payment_id) OVER (
    ORDER BY hour
    RANGE BETWEEN INTERVAL '2 hours' PRECEDING 
              AND INTERVAL '2 hours' FOLLOWING
  )
FROM hourly_payment
ORDER BY hour;

Summary of ROWS, RANGE, GROUPS

The above case was a real world use-case. A more constructed example that might be easier to digest, visually, can be seen here:


WITH t(id, v) AS (
  VALUES (1, 1), (2, 1), (3, 3), (4, 5), (5, 5), (6, 5), (7, 6)
)
SELECT
  id,
  v,
  array_agg(id) OVER rows,
  array_agg(v)  OVER rows,
  array_agg(id) OVER range,
  array_agg(v)  OVER range,
  array_agg(id) OVER groups,
  array_agg(v)  OVER groups
FROM t
WINDOW 
  o AS (ORDER BY v),
  rows AS (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING),
  range AS (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING),
  groups AS (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

Notice, I’m using the SQL standard WINDOW clause to be able to name and reuse a repeated window specification. I’ve seen this clause to be supported in:

MySQL 8.0
PostgreSQL
Sybase SQL Anywhere

The query yields:

id |v |array_agg |array_agg |array_agg |array_agg |array_agg     |array_agg     |
---|--|----------|----------|----------|----------|--------------|--------------|
1  |1 |{1,2}     |{1,1}     |{1,2}     |{1,1}     |{1,2,3}       |{1,1,3}       |
2  |1 |{1,2,3}   |{1,1,3}   |{1,2}     |{1,1}     |{1,2,3}       |{1,1,3}       |
3  |3 |{2,3,4}   |{1,3,5}   |{3}       |{3}       |{1,2,3,4,5,6} |{1,1,3,5,5,5} |
4  |5 |{3,4,5}   |{3,5,5}   |{4,5,6,7} |{5,5,5,6} |{3,4,5,6,7}   |{3,5,5,5,6}   |
5  |5 |{4,5,6}   |{5,5,5}   |{4,5,6,7} |{5,5,5,6} |{3,4,5,6,7}   |{3,5,5,5,6}   |
6  |5 |{5,6,7}   |{5,5,6}   |{4,5,6,7} |{5,5,5,6} |{3,4,5,6,7}   |{3,5,5,5,6}   |
7  |6 |{6,7}     |{5,6}     |{4,5,6,7} |{5,5,5,6} |{4,5,6,7}     |{5,5,5,6}     |

Observation:

The ROWS framed window is of size 3 max in this case (1 row preceding, the current row, and 1 row following)
The RANGE framed window is a logical window that looks behind a value of 1 and ahead a value of 1
The GROUPS framed window is of size 3 groups max in this case (1 group preceding, the current group, and 1 group following)

Neat, huh? jOOQ 3.12 will add support for this feature: https://github.com/jOOQ/jOOQ/issues/7646

EXCLUDE clause

This is probably a bit less frequently useful than the new GROUPS clause. There is now a new window frame exclusion clause:

<window frame exclusion> ::=
  EXCLUDE CURRENT ROW
| EXCLUDE GROUP
| EXCLUDE TIES
| EXCLUDE NO OTHERS

It can be used to exclude some rows around the current row from being in the window. I have yet to think of a use case for this. Here’s how it works for: ROWS


WITH t(v) AS (
  VALUES (1), (1), (3), (5), (5), (5), (6)
)
SELECT
  v,
  array_agg(v) OVER (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
                       EXCLUDE CURRENT ROW) AS current_row,
  array_agg(v) OVER (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE GROUP) AS group,
  array_agg(v) OVER (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE TIES) AS ties,
  array_agg(v) OVER (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE NO OTHERS) AS no_others
FROM t
WINDOW o AS (ORDER BY v)

Resulting in:

v |current_row |group |ties    |no_others |
--|------------|------|--------|----------|
1 |{1}         |NULL  |{1}     |{1,1}     |
1 |{1,3}       |{3}   |{1,3}   |{1,1,3}   |
3 |{1,5}       |{1,5} |{1,3,5} |{1,3,5}   |
5 |{3,5}       |{3}   |{3,5}   |{3,5,5}   |
5 |{5,5}       |NULL  |{5}     |{5,5,5}   |
5 |{5,6}       |{6}   |{5,6}   |{5,5,6}   |
6 |{5}         |{5}   |{5,6}   |{5,6}     |

As you can see, the window may now be completely empty, which results in NULL being emitted.

Excluding the current row seems obvious
Excluding the current group also
Excluding ties excludes all other rows from the group
Excluding no others is the default, just like when you don’t put this EXCLUDE clause

RANGE The exclusion can be applied to logical windowing as well:


WITH t(v) AS (
  VALUES (1), (1), (3), (5), (5), (5), (6)
)
SELECT
  v,
  array_agg(v) OVER (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE CURRENT ROW) AS current_row,
  array_agg(v) OVER (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE GROUP) AS group,
  array_agg(v) OVER (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE TIES) AS ties,
  array_agg(v) OVER (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE NO OTHERS) AS no_others
FROM t
WINDOW o AS (ORDER BY v)

Resulting in:

v |current_row |group   |ties      |no_others |
--|------------|--------|----------|----------|
1 |{1}         |NULL    |{1}       |{1,1}     |
1 |{1}         |NULL    |{1}       |{1,1}     |
3 |NULL        |NULL    |{3}       |{3}       |
5 |{5,5,6}     |{6}     |{5,6}     |{5,5,5,6} |
5 |{5,5,6}     |{6}     |{5,6}     |{5,5,5,6} |
5 |{5,5,6}     |{6}     |{5,6}     |{5,5,5,6} |
6 |{5,5,5}     |{5,5,5} |{5,5,5,6} |{5,5,5,6} |

GROUPS Same for grouped windows:


WITH t(v) AS (
  VALUES (1), (1), (3), (5), (5), (5), (6)
)
SELECT
  v,
  array_agg(v) OVER (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE CURRENT ROW) AS current_row,
  array_agg(v) OVER (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE GROUP) AS group,
  array_agg(v) OVER (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE TIES) AS ties,
  array_agg(v) OVER (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE NO OTHERS) AS no_others
FROM t
WINDOW o AS (ORDER BY v)

Resulting in:

v |current_row |group       |ties          |no_others     |
--|------------|------------|--------------|--------------|
1 |{1,3}       |{3}         |{1,3}         |{1,1,3}       |
1 |{1,3}       |{3}         |{1,3}         |{1,1,3}       |
3 |{1,1,5,5,5} |{1,1,5,5,5} |{1,1,3,5,5,5} |{1,1,3,5,5,5} |
5 |{3,5,5,6}   |{3,6}       |{3,5,6}       |{3,5,5,5,6}   |
5 |{3,5,5,6}   |{3,6}       |{3,5,6}       |{3,5,5,5,6}   |
5 |{3,5,5,6}   |{3,6}       |{3,5,6}       |{3,5,5,5,6}   |
6 |{5,5,5}     |{5,5,5}     |{5,5,5,6}     |{5,5,5,6}     |

Needless to say that this clause will be supported in jOOQ 3.12 as well: https://github.com/jOOQ/jOOQ/issues/7647 Bonus points for the reader who can think of a real world use-case for this clause, please leave a comment!

PostgreSQL 11’s Support for SQL Standard GROUPS and EXCLUDE Window Function Clauses

The frame clause

Summary of ROWS, RANGE, GROUPS

EXCLUDE clause

Like this:

Published by lukaseder

Leave a ReplyCancel reply

The frame clause

Summary of ROWS, RANGE, GROUPS

EXCLUDE clause

Like this:

Published by lukaseder

Leave a ReplyCancel reply

Discover more from Java, SQL and jOOQ.