PostgreSQL 11’s Support for SQL Standard GROUPS and EXCLUDE Window Function Clauses

Exciting discovery when playing around with PostgreSQL 11! New SQL standard window function clauses have been supported. If you want to play with this, you can do so very easily using docker:

docker pull postgres:11
docker run --name POSTGRES11 -e POSTGRES_PASSWORD=postgres -d postgres:11
docker run -it --rm --link POSTGRES11:postgres postgres psql -h postgres -U postgres

See also: https://hub.docker.com/r/_/postgres

The frame clause

When working with window functions, in some cases you want to add the optional frame clause. For example, to get a sliding average over your data, you will write:

SELECT 
  payment_date,
  amount,
  avg(amount) OVER (
    ORDER BY payment_date, payment_id
    ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )::DECIMAL(10, 2),
  array_agg(amount) OVER (
    ORDER BY payment_date, payment_id
    ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )
FROM payment;

As always I will be running queries against the Sakila database. The above query yields:

payment_date        |amount |avg  |array_agg                   |
--------------------|-------|-----|----------------------------|
2005-05-24 22:53:30 |2.99   |3.32 |          {2.99,2.99,3.99}  |
2005-05-24 22:54:33 |2.99   |3.74 |     {2.99,2.99,3.99,4.99}  |
2005-05-24 23:03:39 |3.99   |4.39 |{2.99,2.99,3.99,4.99,6.99}  |
2005-05-24 23:04:41 |4.99   |3.99 |{2.99,3.99,4.99,6.99,0.99}  |
2005-05-24 23:05:21 |6.99   |3.79 |{3.99,4.99,6.99,0.99,1.99}  |
2005-05-24 23:08:07 |0.99   |3.99 |{4.99,6.99,0.99,1.99,4.99}  |
2005-05-24 23:11:53 |1.99   |3.99 |{6.99,0.99,1.99,4.99,4.99}  |
2005-05-24 23:31:46 |4.99   |3.79 |{0.99,1.99,4.99,4.99,5.99}  |

The array_agg function helps display how the sliding average came to be. For each average value, we’re looking 2 rows ahead and 2 rows behind in the ordered window.

In the above query, I’m using the optional frame clause to specify the frame size. It has three “modes” or “units”:

<window frame units> ::=
  ROWS
| RANGE
| GROUPS

Almost all databases that support window functions support the first two unit types. To my knowledge, only PostgreSQL 11 and H2 1.4.198 now also supports GROUPS. The difference is rather simple to explain:

  • ROWS counts the exact number of rows in the frame.
  • RANGE performs logical windowing where we don’t count the number of rows, but look for a value offset.
  • GROUPS counts all groups of tied rows within the window.

I think this is best explained by example. Let’s look at payments with payment timestamps truncated to the hour:

WITH hourly_payment AS (
  SELECT 
    payment_id,
    date_trunc('h', payment_date) AS hour,
    amount
  FROM payment
)
SELECT *
FROM hourly_payment
ORDER BY hour;

This gives us:

payment_id |hour                |amount |
-----------|--------------------|-------|
12377      |2005-05-24 22:00:00 |2.99   | \  Tied group
3504       |2005-05-24 22:00:00 |2.99   | /

6440       |2005-05-24 23:00:00 |4.99   | \
11032      |2005-05-24 23:00:00 |3.99   |  |
8987       |2005-05-24 23:00:00 |4.99   |  | Tied group
6003       |2005-05-24 23:00:00 |6.99   |  |
14728      |2005-05-24 23:00:00 |0.99   |  |
7274       |2005-05-24 23:00:00 |1.99   | /

12025      |2005-05-25 00:00:00 |0.99   | \
3831       |2005-05-25 00:00:00 |8.99   |  |
7044       |2005-05-25 00:00:00 |4.99   |  |
8623       |2005-05-25 00:00:00 |9.99   |  | Tied group
3386       |2005-05-25 00:00:00 |4.99   |  |
8554       |2005-05-25 00:00:00 |4.99   |  |
10785      |2005-05-25 00:00:00 |5.99   |  |
9014       |2005-05-25 00:00:00 |6.99   | /

15394      |2005-05-25 01:00:00 |2.99   | \
10499      |2005-05-25 01:00:00 |4.99   |  |
5020       |2005-05-25 01:00:00 |2.99   |  | Tied group
490        |2005-05-25 01:00:00 |0.99   |  |
12305      |2005-05-25 01:00:00 |4.99   | /

11796      |2005-05-25 02:00:00 |4.99   | \
9463       |2005-05-25 02:00:00 |4.99   |  | Tied group
13711      |2005-05-25 02:00:00 |4.99   | /

Now we can see that for each hour, we have several payments. When we order payments by hour, there are some “tied” payments within that hour (or “group”), i.e. the order among payments on 2005-05-24 22:00:00 are not ordered deterministically among themselves. The payment ids are pretty random.

Now, if we look at the three window frame units again, how do they behave?

ROWS

WITH hourly_payment AS (
  SELECT 
    payment_id,
    date_trunc('h', payment_date) AS hour
  FROM payment
)
SELECT 
  payment_id,
  hour,
  array_agg(payment_id) OVER (
    ORDER BY hour
    ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )
FROM hourly_payment
ORDER BY hour;

We can see that the size of the window is always precisely 5 rows (except at the beginning and end of the data set):

payment_id |hour                |array_agg                      |
-----------|--------------------|-------------------------------|
12377      |2005-05-24 22:00:00 |{12377,3504,6440}              |
3504       |2005-05-24 22:00:00 |{12377,3504,6440,11032}        |
6440       |2005-05-24 23:00:00 |{12377,3504,6440,11032,8987}   |
11032      |2005-05-24 23:00:00 |{3504,6440,11032,8987,6003}    |
8987       |2005-05-24 23:00:00 |{6440,11032,8987,6003,14728}   |
6003       |2005-05-24 23:00:00 |{11032,8987,6003,14728,7274}   |
14728      |2005-05-24 23:00:00 |{8987,6003,14728,7274,12025}   |
7274       |2005-05-24 23:00:00 |{6003,14728,7274,12025,3831}   |
12025      |2005-05-25 00:00:00 |{14728,7274,12025,3831,7044}   |
3831       |2005-05-25 00:00:00 |{7274,12025,3831,7044,8623}    |
7044       |2005-05-25 00:00:00 |{12025,3831,7044,8623,3386}    |
8623       |2005-05-25 00:00:00 |{3831,7044,8623,3386,8554}     |
3386       |2005-05-25 00:00:00 |{7044,8623,3386,8554,10785}    |
8554       |2005-05-25 00:00:00 |{8623,3386,8554,10785,9014}    |
10785      |2005-05-25 00:00:00 |{3386,8554,10785,9014,15394}   |
9014       |2005-05-25 00:00:00 |{8554,10785,9014,15394,10499}  |
15394      |2005-05-25 01:00:00 |{10785,9014,15394,10499,5020}  |
10499      |2005-05-25 01:00:00 |{9014,15394,10499,5020,490}    |
5020       |2005-05-25 01:00:00 |{15394,10499,5020,490,12305}   |
490        |2005-05-25 01:00:00 |{10499,5020,490,12305,11796}   |
12305      |2005-05-25 01:00:00 |{5020,490,12305,11796,9463}    |
11796      |2005-05-25 02:00:00 |{490,12305,11796,9463,13711}   |
9463       |2005-05-25 02:00:00 |{12305,11796,9463,13711,8167}  |
13711      |2005-05-25 02:00:00 |{11796,9463,13711,8167,1011}   |

There is no notion of a “group” among the rows in the window. But the problem is that we’re getting random PAYMENT_ID values unless we also add the PAYMENT_ID to the ORDER BY clause. This isn’t really what we want, most of the time, so we use:

RANGE

WITH hourly_payment AS (
  SELECT 
    payment_id,
    date_trunc('h', payment_date) AS hour
  FROM payment
)
SELECT 
  payment_id,
  hour,
  EXTRACT(epoch FROM hour) / 3600,
  array_agg(payment_id) OVER (
    ORDER BY EXTRACT(epoch FROM hour) / 3600
    RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )
FROM hourly_payment
ORDER BY hour;

I have switched from ROWS to RANGE and now the ORDER BY clause works on a number based on the epoch of the hour. What happens now?

This now yields:

payment_id |hour                |?column? |array_agg                                                                                                                                                              
-----------|--------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
12377      |2005-05-24 22:00:00 |310270   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014}
3504       |2005-05-24 22:00:00 |310270   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014}

6440       |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
11032      |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
8987       |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
6003       |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
14728      |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
7274       |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}

12025      |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
3831       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
7044       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
8623       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
3386       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
8554       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
10785      |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
9014       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}

15394      |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}
10499      |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}
5020       |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}
490        |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}
12305      |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}

11796      |2005-05-25 02:00:00 |310274   |{12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245,14396,13055,15984,9975,8188,5596,2388,7347,11598,6186}
9463       |2005-05-25 02:00:00 |310274   |{12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245,14396,13055,15984,9975,8188,5596,2388,7347,11598,6186}
13711      |2005-05-25 02:00:00 |310274   |{12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245,14396,13055,15984,9975,8188,5596,2388,7347,11598,6186}

I’ve visually separated the rows by their hour and the array aggregation by the “tied” payment_ids, i.e. the payment IDs that have the same hour.

Observations:

  1. We get the same aggregation value for the entire set of tied rows, so if in two rows, HOUR is the same, then ARRAY_AGG is the same as well
  2. The window size is now a logical size, no longer an offset size, so we’re going back 2 hours and ahead 2 hours (instead of 2 rows). This is why I’ve extracted epoch and divided it by hour, so I will get consecutive integer values for consecutive hours

The same result could have been achieved using interval types:

WITH hourly_payment AS (
  SELECT 
    payment_id,
    date_trunc('h', payment_date) AS hour
  FROM payment
)
SELECT 
  payment_id,
  hour,
  EXTRACT(epoch FROM hour) / 3600,
  array_agg(payment_id) OVER (
    ORDER BY hour
    RANGE BETWEEN INTERVAL '2 hours' PRECEDING 
              AND INTERVAL '2 hours' FOLLOWING
  )
FROM hourly_payment
ORDER BY hour;

See also this article for details:
https://blog.jooq.org/2016/10/31/a-little-known-sql-feature-use-logical-windowing-to-aggregate-sliding-ranges/

GROUPS

The third frame unit is quite useful, as we can now frame the window to a number of groups of same values. In our case, all payments of the same hour are in the same group. So, in order to get a similar result again, we can write:

WITH hourly_payment AS (
  SELECT 
    payment_id,
    payment_date,
    date_trunc('h', payment_date) AS hour
  FROM payment
)
SELECT 
  payment_id,
  hour,
  array_agg(payment_id) OVER (
    ORDER BY hour
    GROUPS BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )
FROM hourly_payment
ORDER BY hour;

In fact, this is not exactly the same result, because if we have gaps in the hours, GROUPS will simply jump over the gaps, whereas RANGE will not.

Summary of ROWS, RANGE, GROUPS

The above case was a real world use-case. A more constructed example that might be easier to digest, visually, can be seen here:

WITH t(id, v) AS (
  VALUES (1, 1), (2, 1), (3, 3), (4, 5), (5, 5), (6, 5), (7, 6)
)
SELECT
  id,
  v,
  array_agg(id) OVER rows,
  array_agg(v)  OVER rows,
  array_agg(id) OVER range,
  array_agg(v)  OVER range,
  array_agg(id) OVER groups,
  array_agg(v)  OVER groups
FROM t
WINDOW 
  o AS (ORDER BY v),
  rows AS (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING),
  range AS (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING),
  groups AS (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

Notice, I’m using the SQL standard WINDOW clause to be able to name and reuse a repeated window specification. I’ve seen this clause to be supported in:

  • MySQL 8.0
  • PostgreSQL
  • Sybase SQL Anywhere

The query yields:

id |v |array_agg |array_agg |array_agg |array_agg |array_agg     |array_agg     |
---|--|----------|----------|----------|----------|--------------|--------------|
1  |1 |{1,2}     |{1,1}     |{1,2}     |{1,1}     |{1,2,3}       |{1,1,3}       |
2  |1 |{1,2,3}   |{1,1,3}   |{1,2}     |{1,1}     |{1,2,3}       |{1,1,3}       |
3  |3 |{2,3,4}   |{1,3,5}   |{3}       |{3}       |{1,2,3,4,5,6} |{1,1,3,5,5,5} |
4  |5 |{3,4,5}   |{3,5,5}   |{4,5,6,7} |{5,5,5,6} |{3,4,5,6,7}   |{3,5,5,5,6}   |
5  |5 |{4,5,6}   |{5,5,5}   |{4,5,6,7} |{5,5,5,6} |{3,4,5,6,7}   |{3,5,5,5,6}   |
6  |5 |{5,6,7}   |{5,5,6}   |{4,5,6,7} |{5,5,5,6} |{3,4,5,6,7}   |{3,5,5,5,6}   |
7  |6 |{6,7}     |{5,6}     |{4,5,6,7} |{5,5,5,6} |{4,5,6,7}     |{5,5,5,6}     |

Observation:

  • The ROWS framed window is of size 3 max in this case (1 row preceding, the current row, and 1 row following)
  • The RANGE framed window is a logical window that looks behind a value of 1 and ahead a value of 1
  • The GROUPS framed window is of size 3 groups max in this case (1 group preceding, the current group, and 1 group following)

Neat, huh?

jOOQ 3.12 will add support for this feature: https://github.com/jOOQ/jOOQ/issues/7646

EXCLUDE clause

This is probably a bit less frequently useful than the new GROUPS clause. There is now a new window frame exclusion clause:

<window frame exclusion> ::=
  EXCLUDE CURRENT ROW
| EXCLUDE GROUP
| EXCLUDE TIES
| EXCLUDE NO OTHERS

It can be used to exclude some rows around the current row from being in the window. I have yet to think of a use case for this. Here’s how it works for:

ROWS

WITH t(v) AS (
  VALUES (1), (1), (3), (5), (5), (5), (6)
)
SELECT
  v,
  array_agg(v) OVER (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
                       EXCLUDE CURRENT ROW) AS current_row,
  array_agg(v) OVER (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE GROUP) AS group,
  array_agg(v) OVER (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE TIES) AS ties,
  array_agg(v) OVER (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE NO OTHERS) AS no_others
FROM t
WINDOW o AS (ORDER BY v)

Resulting in:

v |current_row |group |ties    |no_others |
--|------------|------|--------|----------|
1 |{1}         |NULL  |{1}     |{1,1}     |
1 |{1,3}       |{3}   |{1,3}   |{1,1,3}   |
3 |{1,5}       |{1,5} |{1,3,5} |{1,3,5}   |
5 |{3,5}       |{3}   |{3,5}   |{3,5,5}   |
5 |{5,5}       |NULL  |{5}     |{5,5,5}   |
5 |{5,6}       |{6}   |{5,6}   |{5,5,6}   |
6 |{5}         |{5}   |{5,6}   |{5,6}     |

As you can see, the window may now be completely empty, which results in NULL being emitted.

  • Excluding the current row seems obvious
  • Excluding the current group also
  • Excluding ties excludes all other rows from the group
  • Excluding no others is the default, just like when you don’t put this EXCLUDE clause

RANGE

The exclusion can be applied to logical windowing as well:

WITH t(v) AS (
  VALUES (1), (1), (3), (5), (5), (5), (6)
)
SELECT
  v,
  array_agg(v) OVER (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE CURRENT ROW) AS current_row,
  array_agg(v) OVER (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE GROUP) AS group,
  array_agg(v) OVER (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE TIES) AS ties,
  array_agg(v) OVER (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE NO OTHERS) AS no_others
FROM t
WINDOW o AS (ORDER BY v)

Resulting in:

v |current_row |group   |ties      |no_others |
--|------------|--------|----------|----------|
1 |{1}         |NULL    |{1}       |{1,1}     |
1 |{1}         |NULL    |{1}       |{1,1}     |
3 |NULL        |NULL    |{3}       |{3}       |
5 |{5,5,6}     |{6}     |{5,6}     |{5,5,5,6} |
5 |{5,5,6}     |{6}     |{5,6}     |{5,5,5,6} |
5 |{5,5,6}     |{6}     |{5,6}     |{5,5,5,6} |
6 |{5,5,5}     |{5,5,5} |{5,5,5,6} |{5,5,5,6} |

GROUPS

Same for grouped windows:

WITH t(v) AS (
  VALUES (1), (1), (3), (5), (5), (5), (6)
)
SELECT
  v,
  array_agg(v) OVER (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE CURRENT ROW) AS current_row,
  array_agg(v) OVER (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE GROUP) AS group,
  array_agg(v) OVER (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE TIES) AS ties,
  array_agg(v) OVER (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE NO OTHERS) AS no_others
FROM t
WINDOW o AS (ORDER BY v)

Resulting in:

v |current_row |group       |ties          |no_others     |
--|------------|------------|--------------|--------------|
1 |{1,3}       |{3}         |{1,3}         |{1,1,3}       |
1 |{1,3}       |{3}         |{1,3}         |{1,1,3}       |
3 |{1,1,5,5,5} |{1,1,5,5,5} |{1,1,3,5,5,5} |{1,1,3,5,5,5} |
5 |{3,5,5,6}   |{3,6}       |{3,5,6}       |{3,5,5,5,6}   |
5 |{3,5,5,6}   |{3,6}       |{3,5,6}       |{3,5,5,5,6}   |
5 |{3,5,5,6}   |{3,6}       |{3,5,6}       |{3,5,5,5,6}   |
6 |{5,5,5}     |{5,5,5}     |{5,5,5,6}     |{5,5,5,6}     |

Needless to say that this clause will be supported in jOOQ 3.12 as well: https://github.com/jOOQ/jOOQ/issues/7647

Bonus points for the reader who can think of a real world use-case for this clause, please leave a comment!

Using UNPIVOT to Traverse a Configuration Table’s Rows and Columns

Imagine you have a configuration table like the following:

CREATE TABLE rule (
  name     VARCHAR2(50)         NOT NULL PRIMARY KEY,
  enabled  NUMBER(1)  DEFAULT 1 NOT NULL CHECK (enabled IN (0,1)),
  priority NUMBER(10) DEFAULT 0 NOT NULL,
  flag1    NUMBER(3)  DEFAULT 0 NOT NULL,
  flag2    NUMBER(3)  DEFAULT 0 NOT NULL,
  flag3    NUMBER(3)  DEFAULT 0 NOT NULL,
  flag4    NUMBER(3)  DEFAULT 0 NOT NULL,
  flag5    NUMBER(3)  DEFAULT 0 NOT NULL
);

It specifies a set of rules that

  • Can be enabled / disabled
  • Can be given a priority among themselves
  • Include a set of flags which correspond to the thing you want to configure (e.g. some check to execute)
  • Those flags can be ordered as well

So, given the following data:

INSERT INTO rule (name, priority, flag1, flag5) 
  VALUES ('RULE 1', 1, 1, 2);
INSERT INTO rule (name, priority, flag2, flag5) 
  VALUES ('RULE 2', 2, 2, 1);
INSERT INTO rule (name, priority, flag3, flag4, flag5) 
  VALUES ('RULE 3', 3, 3, 1, 2);
INSERT INTO rule (name, priority, flag3) 
  VALUES ('RULE 4', 4, 1);

SELECT * FROM rule;

We’ll get our configuration “spreadsheet”:

NAME    ENABLED  PRIORITY  FLAG1  FLAG2  FLAG3  FLAG4  FLAG5
------------------------------------------------------------
RULE 1  1        1         1      0      0      0      2
RULE 2  1        2         0      2      0      0      1
RULE 3  1        3         0      0      3      1      2
RULE 4  1        4         0      0      1      0      0

This form is really useful to edit the configuration. If we want to activate FLAG2 in RULE 1, we just go to that cell in some SQL tool like Oracle SQL Developer, and change the value.

But reading the configuration is a bit different. FLAG1 through FLAG5 are not nicely normalised. How to read the data as though it were normalised?

Using UNPIVOT

In Oracle and SQL Server, we can use UNPIVOT for this use case. I’m using Oracle syntax in this blog post. SQL Server’s is just slightly different. Consider the following query:

SELECT name, flag, value
FROM rule
UNPIVOT (
  value FOR flag IN (
    flag1,  
    flag2,  
    flag3,  
    flag4,  
    flag5
  )
)
WHERE enabled = 1
AND value > 0
ORDER BY priority, value;

This will result in the following result set:

NAME    FLAG    VALUE
---------------------
RULE 1  FLAG1   1
RULE 1  FLAG5   2
RULE 2  FLAG5   1
RULE 2  FLAG2   2
RULE 3  FLAG4   1
RULE 3  FLAG5   2
RULE 3  FLAG3   3
RULE 4  FLAG3   1

In this representation, the rules are ordered by priority, and the flags are ordered by their respective value within a rule. The flags that are not turned on (value 0) are simply omitted. This form is much easier to traverse procedurally, when “consuming” the configuration.

How does it work?

In principle, UNPIVOT is just syntax sugar for a bunch of UNION ALL subqueries. We could have written our query like this, instead:

SELECT name, flag, value
FROM (
  SELECT rule.*, 'FLAG1' AS flag, FLAG1 AS value FROM rule
  UNION ALL
  SELECT rule.*, 'FLAG2' AS flag, FLAG2 AS value FROM rule
  UNION ALL
  SELECT rule.*, 'FLAG3' AS flag, FLAG3 AS value FROM rule
  UNION ALL
  SELECT rule.*, 'FLAG4' AS flag, FLAG4 AS value FROM rule
  UNION ALL
  SELECT rule.*, 'FLAG5' AS flag, FLAG5 AS value FROM rule
) rule
WHERE enabled = 1
AND value > 0
ORDER BY priority, value;

Which is decidedly more code. It’s also more work for the database. The execution plans are different (I’m using Oracle 12.2.0.1.0):

UNPIVOT version – single table access

---------------------------------------------
| Id  | Operation            | Name | Rows  |
---------------------------------------------
|   0 | SELECT STATEMENT     |      |       |
|   1 |  SORT ORDER BY       |      |     5 |
|*  2 |   VIEW               |      |     5 |
|   3 |    UNPIVOT           |      |       |
|*  4 |     TABLE ACCESS FULL| RULE |     1 |
---------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(("unpivot_view_005"."VALUE">0 AND 
              "unpivot_view_005"."ENABLED"=1))
   4 - filter("RULE"."ENABLED"=1)

UNION ALL version – multi table access

---------------------------------------------
| Id  | Operation            | Name | Rows  |
---------------------------------------------
|   0 | SELECT STATEMENT     |      |       |
|   1 |  SORT ORDER BY       |      |     8 |
|   2 |   VIEW               |      |     8 |
|   3 |    UNION-ALL         |      |       |
|*  4 |     TABLE ACCESS FULL| RULE |     1 |
|*  5 |     TABLE ACCESS FULL| RULE |     1 |
|*  6 |     TABLE ACCESS FULL| RULE |     2 |
|*  7 |     TABLE ACCESS FULL| RULE |     1 |
|*  8 |     TABLE ACCESS FULL| RULE |     3 |
---------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - filter(("RULE"."ENABLED"=1 AND "FLAG1">0))
   5 - filter(("RULE"."ENABLED"=1 AND "FLAG2">0))
   6 - filter(("RULE"."ENABLED"=1 AND "FLAG3">0))
   7 - filter(("RULE"."ENABLED"=1 AND "FLAG4">0))
   8 - filter(("RULE"."ENABLED"=1 AND "FLAG5">0))

We can also measure the time it takes to execute these queries thousands of times. The following shows resulting times relative to the fastest execution (1):

Run 1, Statement 1 : 1.155
Run 1, Statement 2 : 1.88056

Run 2, Statement 1 : 1.04333
Run 2, Statement 2 : 1.95148

Run 3, Statement 1 : 1.02185
Run 3, Statement 2 : 1.86074

Run 4, Statement 1 : 1
Run 4, Statement 2 : 1.85241

Run 5, Statement 1 : 1.0263
Run 5, Statement 2 : 1.82944

The UNION ALL version is consistently about 2x slower on this very small data set. This is significant in the use case presented here, as a configuration table is probably read many times per day.

Knowing when a rule starts and when it ends

The real world use case that is behind this blog post also needed to know when a rule started and when it ended. I.e., which flag entry was the first and which was the last of the rule. This was easy in the non-normalised representation where each rule was a single row.

In the normalised version, we can use LEAD() and LAG().

Using this query:

SELECT 
  CASE WHEN lag(name, 1, 'NULL') 
            OVER (ORDER BY priority, value) != name 
       THEN 1 ELSE 0 END rule_begin,
  CASE WHEN lead(name, 1, 'NULL') 
            OVER (ORDER BY priority, value) != name 
       THEN 1 ELSE 0 END rule_end,
  name, flag, value
FROM rule
UNPIVOT (
  value FOR flag IN (
    flag1,  
    flag2,  
    flag3,  
    flag4,  
    flag5
  )
)
WHERE enabled = 1
AND value > 0
ORDER BY priority, value;

We’re now getting (with some visual emphasis):

RULE_BEGIN  RULE_END  NAME    FLAG    VALUE
-------------------------------------------
1           0         RULE 1  FLAG1   1
0           1         RULE 1  FLAG5   2

1           0         RULE 2  FLAG5   1
0           1         RULE 2  FLAG2   2

1           0         RULE 3  FLAG4   1
0           0         RULE 3  FLAG5   2
0           1         RULE 3  FLAG3   3

1           1         RULE 4  FLAG3   1

LEAD() looks ahead one row to see if the rule name there is different from the rule name on the current row.

LAG() looks behind one row to see if the rule name there is different from the rule name on the current row.

That’s it – very simple. The window functions part of this example is part of my 10 SQL Tricks talk, which I highly recommend you watch.

Truth First, or Why You Should Mostly Implement Database First Designs

In this much overdue article, I will explain why I think that in almost all cases, you should implement a “database first” design in your application’s data models, rather than a “Java first” design (or whatever your client language is), the latter approach leading to a long road of pain and suffering, once your project grows.

This article is inspired by a recent Stack Overflow question.

Interesting reddit discussions on /r/java and /r/programming.

Code generation

To my surprise, a small group of first time jOOQ users seem to be appalled by the fact that jOOQ heavily relies on source code generation. No one keeps you from using jOOQ the way you want and you don’t have to use code generation, but the default way to use jOOQ according to the manual is to start with a (legacy) database schema, reverse engineer that using jOOQ’s code generator to get a bunch of classes representing your tables, and then to write type safe queries against those tables:

for (Record2<String, String> record : DSL.using(configuration)
//   ^^^^^^^^^^^^^^^^^^^^^^^ Type information derived from the 
//   generated code referenced from the below SELECT clause

       .select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
//           vvvvv ^^^^^^^^^^^^  ^^^^^^^^^^^^^^^ Generated names
       .from(ACTOR)
       .orderBy(1, 2)) {
    // ...
}

The code is generated either manually outside of the build, or automatically with every build. For instance, such a re-generation could follow immediately after a Flyway database migration, which can also be run either manually or automatically.

Source code generation

There are different philosophies, advantages, and disadvantages regarding these manual/automatic approaches, which I don’t want to discuss in this article. But essentially, the point of generated code is that it provides a Java representation of something that we take for granted (a “truth”) either within or outside of our system. In a way, compilers do the same thing when they generate byte code, machine code, or some other type of source code from the original sources – we get a representation of our “truth” in a different language, for whatever reason.

There are many such code generators out there. For instance, XJC can generate Java code from XSD or WSDL files. The principle is always the same:

  • There is some truth (internal or external), like a specification, data model, etc.
  • We need a local representation of that truth in our programming language

And it almost always makes sense to generate that latter, to avoid redundancy.

Type providers and annotation processing

Noteworthy: Another, more modern approach to jOOQ’s particular code generation use-case would be Type Providers, as implemented by F#, in case of which the code is generated by the compiler while compiling. It never really exists in its source form. A similar (but less sophisticated) tool in Java are annotation processors, e.g. Lombok.

In a way, this does the same thing except:

  • You don’t see the generated code (perhaps that’s less appalling to some?)
  • You must ensure the types can be provided, i.e. the “truth” must always be available. Easy in the case of Lombok, which annotates the “truth”. A bit more difficult with database models, which rely on an always available live connection.

What’s the problem with code generation?

Apart from the tricky question whether to trigger code generation manually or automatically, some people seem to think that code must not be generated at all. The reason I hear the most is the idea that it is difficult to set up in a build pipeline. And yes, that is true. There is extra infrastructure overhead. Especially if you’re new to a certain product (like jOOQ, or JAXB, or Hibernate, etc.), setting up an environment takes time you would rather spend learning the API itself and getting value out of it.

If the overhead of learning how the code generator works is too high, then indeed, the API failed to make the code generator easy to use (and later on, to customise). That should be a high priority for any such API. But that’s the only argument against code generation. Other than that, it makes absolutely no sense at all to hand-write the local representation of the internal or external truth.

Many people argue that they don’t have time for that stuff. They need to ship their MVPs. They can finalise their build pipelines later. I say:

“But Hibernate / JPA makes coding Java first easy”

Yes that’s true. And it’s both a bliss and a curse for Hibernate and its users. In Hibernate, you can just write a couple of entities, such as:

@Entity
class Book {
  @Id
  int id;
  String title;
}

And you’re almost set. Let Hibernate generate the boring “details” of how to define this entity in your SQL dialect’s DDL:

CREATE TABLE book (
  id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  title VARCHAR(50),

  CONSTRAINT pk_book PRIMARY KEY (id)
);

CREATE INDEX i_book_title ON book (title);

… and start running the application. That’s really cool to get started quickly and to try out things.

But, huh, wait. I cheated.

  • Will Hibernate really apply that named primary key definition?
  • Will it create the index on TITLE, which I know we’ll need?
  • Will it add an identity specification?

Probably not. While you’re developing your greenfield project, it is convenient to always throw away your entire database and re-generate it from scratch, once you’ve added the additional annotations. So, the Book entity would eventually look like this:

@Entity
@Table(name = "book", indexes = {
  @Index(name = "i_book_title", columnList = "title")
})
class Book {
  @Id
  @GeneratedValue(strategy = IDENTITY)
  int id;
  String title;
}

Cool. Re-generate. Again, this makes it really easy to get started.

But you’ll pay the price later on

At some point, you go to production. And that’s when this model no longer works. Because

Once you go live, you can no longer throw away your database, as your database has become legacy.

From now on, you have to write DDL migration scripts, e.g. using Flyway. And then, what happens to your entities? You can either adapt them manually (so you double the work), or have Hibernate re-generate them for you (how big are your chances of the generation matching your expectations?) You can only lose.

Because once you go to production, you need hotfixes. And those have to go live fast. And since you didn’t prepare for pipelining your migrations to production smoothly, you’ll patch things wildly. And then you run out of time to do it right™. And you’ll blame Hibernate, because it’s always someone else’s fault…

Instead, you could have done things entirely differently from the beginning. Like using those round wheels.

Go Database First

The real “truth” of your database schema, and the “sovereignty” over it, resides with your database. The database is the only place where the schema is defined, and all clients have a copy of the database schema, not vice versa. The data is in your database, not in your client, so it makes perfect sense to enforce the schema and its integrity in the database, right where the data is.

This is old wisdom, nothing new. Primary and unique keys are good. Foreign keys are good. Check constraints are good. Assertions (when they’re finally implemented) are good.

And that’s not where it ends. For instance, if you’re using Oracle, you may want to specify:

  • In what tablespace your table resides
  • What PCTFREE value it has
  • What the cache size of your sequence (behind the identity) is

Maybe, all of this doesn’t matter in small systems, but you don’t have to go “big data” before you can profit from vendor-specific storage optimisations as the above. None of the ORMs I’ve ever seen (including jOOQ) will allow you to use the full set of DDL options that you may want to use on your database. ORMs offer some tools to help you write DDL.

But ultimately, a well-designed schema is hand written in DDL. All generated DDL is only an approximation of that.

What about the client model?

As mentioned before, you will need a copy of your database schema in your client, a client representation. Needless to say that this client representation needs to be in-sync with the real model. How to best do that? By using a code generator.

All databases expose their meta information through SQL. Here’s how to get all tables from your database in various SQL dialects:

-- H2, HSQLDB, MySQL, PostgreSQL, SQL Server
SELECT table_schema, table_name
FROM information_schema.tables

-- DB2
SELECT tabschema, tabname
FROM syscat.tables

-- Oracle
SELECT owner, table_name
FROM all_tables

-- SQLite
SELECT name
FROM sqlite_master

-- Teradata
SELECT databasename, tablename
FROM dbc.tables

These queries (or similar ones, e.g. depending on whether views, materialised views, table valued functions should also be considered) are also run by JDBC’s DatabaseMetaData.getTables() call, or by the jOOQ-meta module.

From the result of such queries, it’s relatively easy to generate any client representation of your database model, regardless what your client technology is.

  • If you’re using JDBC or Spring, you can create a bunch of String constants
  • If you’re using JPA, you can generate the entities themselves
  • If you’re using jOOQ, you can generate the jOOQ meta model

Depending on the amount of features your client API offers (e.g. jOOQ or JPA), the generated meta model can be really rich and complete. Consider, for instance, jOOQ 3.11’s implicit join feature, which relies on generated meta information about the foreign key relationships between your tables.

Now, any database increment will automatically lead to updated client code. For instance, imagine:

ALTER TABLE book RENAME COLUMN title TO book_title;

Would you really want to do this work twice? No way. Just commit the DDL, run it through your build pipeline, and have an updated entity:

@Entity
@Table(name = "book", indexes = {

  // Would you have thought of this?
  @Index(name = "i_book_title", columnList = "book_title")
})
class Book {
  @Id
  @GeneratedValue(strategy = IDENTITY)
  int id;

  @Column("book_title")
  String bookTitle;
}

Or an updated jOOQ class. Plus: Your client code might no longer compile, which can be a good thing! Most DDL changes are also semantic changes, not just syntactic ones. So, it’s great to be able to see in compiled client source code, what code is (or may be) affected by your database increment.

A single truth

Regardless what technology you’re using, there’s always one model that contains the single truth for a subsystem – or at least, we should aim for this goal and avoid the enterprisey mess where “truth” is everywhere and nowhere. It just makes everything much simpler. If you exchange XML files with some other system, you’re going to use XSD. Like jOOQ’s INFORMATION_SCHEMA meta model in XML form:
https://www.jooq.org/xsd/jooq-meta-3.10.0.xsd

  • XSD is well understood
  • XSD specifies XML content very well, and allows for validation in all client languages
  • XSD can be versioned easily, and evolved backwards compatibly
  • XSD can be translated to Java code using XJC

The last bullet is important. When communicating with an external system through XML messages, we want to be sure our messages are valid. That’s really really easy to do with JAXB, XJC, and XSD. It would be outright nuts to think that a Java-first approach where we design our messages as Java objects could somehow be reasonably mapped to XML for someone else to consume. That generated XML would be of very poor quality, undocumented, and hard to evolve. If there’s an SLA on such an interface, we’d be screwed.

Frankly, that’s what happens to JSON APIs all the time, but that’s another story, another rant…

Databases: Same thing

When you’re using databases, it’s the same thing. The database owns its data and it should be the master of the schema. All modifications to the schema should be implemented using DDL directly, to update the single truth.

Once that truth is updated, all clients need to update their copies of the model as well. Some clients may be written in Java, using either (or both of) jOOQ and Hibernate, or JDBC. Other clients may be written in Perl (good luck to them). Even other clients may be written in C#. It doesn’t matter. The main model is in the database. ORM-generated models are of poor quality, not well documented, and hard to evolve.

So, don’t do it. And, don’t do it from the very beginning. Instead, go database first. Build a deployment pipeline that can be automated. Include code generators to copy your database model back into the clients. And stop worrying about code generation. It’s a good thing. You’ll be productive. All it takes is a bit of initial effort to set it up, and you’ll get years of improved productivity for the rest of your project.

Thank me later.

Clarification

Just to be sure: This article in no way asserts that your database model should be imposed on your entire system (e.g. your domain, your business logic, etc. etc.). The claim I made here is that client code interacting with the database should act upon the database model, and not have its own first class model of the database instead. This logic typically resides in the data access layer of your client.

In 2-tier architectures, which still have their place sometimes, that may be the only model of your system. In most systems, however, I consider the data access layer a “subsystem” that encapsulates the database model. So, there.

Exceptions

There are always exceptions, and I promised that the database first and code generation approach may not always be the right choice. These exceptions are (probably not exhaustive):

  • When the schema is unknown and must be discovered. E.g. you’re a tool vendor helping users navigate any schema. Duh… No code generation. But still database first.
  • When the schema needs to be generated on the fly for some task. This sounds a lot like a more or less sophisticated version of the entity attribute value pattern, i.e. you don’t really have a well-defined schema. In that case, it’s often not even sure if an RDBMS will be the right choice.

The nature of exceptions is that they’re exceptional. In the majority of RDBMS usage, the schema is known in advance, placed inside the RDBMS as the single source of “truth”, and clients will have derived copies from it – ideally generated using a code generator.

How to Group By “Nothing” in SQL

The SQL standard knows a lesser known feature called GROUPING SETS. One particular side-effect of that feature is that we can group by “nothing” in SQL. E.g. when querying the Sakila database:

SELECT count(*)
FROM film
GROUP BY ()

This will yield:

count |
------|
1000  |

What’s the point, you’re asking? Can’t we just omit the GROUP BY clause? Of course, this will yield the same result:

SELECT count(*)
FROM film

Yet, the two versions of the query are subtly different. The latter will always return exactly one row. The former will perform grouping and return all the groups. How is this different? Just add a predicate!

SELECT count(*)
FROM film
WHERE 1 = 0
GROUP BY ();

SELECT count(*)
FROM film
WHERE 1 = 0;

Now, the first query will produce nothing!

count |
------|

Whereas the second one produces:

count |
------|
0     |

Subtle, eh? Note that unlike DB2, Oracle and SQL Server (which expose the above behaviour), PostgreSQL does not produce the above result as it seems to implement the SQL standard (so, always producing a row) as shown by Markus Winand:

In SQL:1999 (when it was introduced), the <empty grouping set> was called <grand total>, akin to a grand total that can be calculated in a Microsoft Excel Pivot Table. It does make more sense for grand totals to always be present in the result, despite the absence of any input data.

Standards…

What if your database doesn’t support grouping sets?

Not all databases support the awesome GROUPING SETS feature. Among the ones supported by jOOQ, these do:

  • DB2 LUW
  • HANA
  • Oracle
  • PostgreSQL 9.5+
  • SQL Server
  • Sybase SQL Anywhere
  • Teradata

Note that the following databases support a vendor-specific syntax for ROLLUP, which doesn’t help with the empty grouping set.

  • CUBRID
  • MariaDB
  • MySQL
  • Vertica

So, can we emulate it for the other databases?

Of course. There are two ways to emulate the empty grouping set:

By using a constant

You could try using a constant literal:

SELECT count(*)
FROM film
WHERE 1 = 0
GROUP BY 'a';

Sometimes, you’ll have to tweak the database into thinking it is not a constant literal, because it will not accept that:

SELECT count(*)
FROM film
WHERE 1 = 0
GROUP BY 'a' || 'b';

And if that’s also not supported, try wrapping the literal in a subquery:

SELECT count(*)
FROM film
WHERE 1 = 0
GROUP BY (SELECT 1);

One of the above three syntaxes is usually accepted, by these databases:

  • Firebird
  • HSQLDB
  • MariaDB
  • MySQL
  • PostgreSQL
  • Redshift
  • SQLite
  • Vertica

By using a dummy table

In rare cases, none of the above works as the database’s SQL parser tries to be “clever” and rejects my silly attempts to fool it. But no one can fool me!

I’ll just cross join whatever is in the FROM clause with a dummy table (akin to an emulation of table dee) and then group by the dummy table’s column:

SELECT count(*)
FROM film, (SELECT 1 x) dummy
WHERE 1 = 0
GROUP BY dummy.x;

This is guaranteed to work, including on these databases:

  • Access
  • Informix
  • Ingres
  • SQL Data Warehouse
  • Sybase ASE

Q.E.D. 👏

(needless to say that jOOQ supports this emulation. You can play around with it here: https://www.jooq.org/translate)

Selecting all Columns Except One in PostgreSQL

Google’s BigQuery has a very interesting SQL language feature, which I’ve missed many times in other databases:

select:
    SELECT  [{ ALL | DISTINCT }]
        { [ expression. ]* [ EXCEPT ( column_name [, ...] ) ]
            [ REPLACE ( expression [ AS ] column_name [, ...] ) ]
        | expression [ [ AS ] alias ] } [, ...]
    [ FROM from_item  [, ...] ]
    [ WHERE bool_expression ]
    ...

Notice the two keywords EXCEPT and REPLACE that can be used along with an asterisked expression.

An Example

For example, when running a query like this one (which fetches the longest film(s) every actor in the Sakila database played in):

SELECT *
FROM (
  SELECT 
    a.*, 
    f.*, 
    RANK() OVER (PARTITION BY actor_id ORDER BY length DESC) rk
  FROM film f
  JOIN film_actor fa USING (film_id)
  JOIN actor a USING (actor_id)
) t
WHERE rk = 1
ORDER BY first_name, last_name

This is one way to apply TOP-N per category filtering in SQL, which works with most modern databases, including MySQL 8.0. Essentially, we’re calculating the rank of a film per actor ordered by the film’s length.

The result looks like this:

actor_id |first_name  |last_name    |..|title                  |length|..|rk |
---------|------------|-------------|..|-----------------------|------|--|---|
71       |ADAM        |GRANT        |..|GLADIATOR WESTWARD     |   173|..|1  |
71       |ADAM        |GRANT        |..|BALLROOM MOCKINGBIRD   |   173|..|1  |
132      |ADAM        |HOPPER       |..|TORQUE BOUND           |   179|..|1  |
165      |AL          |GARLAND      |..|JACKET FRISCO          |   181|..|1  |

Let’s assume for a moment that we really need the entire projection of the ACTOR and FILM tables (so, SELECT * is fine), but we certainly don’t need the useless RK column, which is always 1.

Sometimes, having some excess columns is not going to be a problem, but sometimes it is. How to remove it? We can’t reference the ACTOR and FILM tables anymore in the outer query:

SELECT a.*, f.* -- Would be great, but wrong syntax
FROM (
  SELECT 
    a.*, 
    f.*, 
    RANK() OVER (PARTITION BY actor_id ORDER BY length DESC) rk
  FROM film f
  JOIN film_actor fa USING (film_id)
  JOIN actor a USING (actor_id)
) t
WHERE rk = 1
ORDER BY first_name, last_name

The outer query only has one table, and that’s the (derived) table T.

How to Solve This

In BigQuery syntax, we could now simply write

SELECT * EXCEPT rk
FROM (...) t
WHERE rk = 1
ORDER BY first_name, last_name

Which is really quite convenient! We want to project everything, except this one column. But none of the more popular SQL databases support this syntax.

Luckily, in PostgreSQL, we can use a workaround: Nested records:

SELECT (a).*, (f).* -- Unnesting the records again
FROM (
  SELECT 
    a, -- Nesting the actor table
    f, -- Nesting the film table
    RANK() OVER (PARTITION BY actor_id ORDER BY length DESC) rk
  FROM film f
  JOIN film_actor fa USING (film_id)
  JOIN actor a USING (actor_id)
) t
WHERE rk = 1
ORDER BY (a).first_name, (a).last_name;

Notice how we’re no longer projecting A.* and F.* inside of the derived table T, but instead, the entire table (record). In the outer query, we have to use some slightly different syntax to unnest the record again (e.g. (A).FIRST_NAME), and we’re done.

How Does This Work?

Informix, Oracle, PostgreSQL, and maybe a few lesser known ones, have implemented the SQL standard’s ORDBMS features to various degrees. ORDBMS attempted to combine relational and object oriented features in the SQL language (and in the storage model). For all practical purposes, this essentially just means that we can nest records and collections.

For instance, in PostgreSQL, we can write:

-- Explicit ROW constructor
SELECT 1, ROW(2, ROW(3, 4))

-- Implicit ROW constructor
SELECT 1, (2, (3, 4))

And we’ll get:

x        |row       |
---------|----------|
1        |(2,(3,4)) |

Along with ordinary “scalar” values, we can have nested rows (or nested tuples) constructed very easily. Conveniently, we can also reference a table without its column names in the projection, such as:

SELECT a, f
FROM film f
JOIN film_actor fa USING (film_id)
JOIN actor a USING (actor_id)

Which produces the aforementioned result:

a    |f    |
-----|-----|
(...)|(...)|
(...)|(...)|
(...)|(...)|
...

Similar things are possible in Oracle as well, except that Oracle doesn’t support structural row/tuple types, only nominal ones. We’d have to create some types first, prior to being able to use them:

CREATE TABLE film_t AS OBJECT (...);

Bonus

Of course, if you’re using SQL Server or Oracle, you wouldn’t have this problem, because then you could use the totally underrated WITH TIES clause along with CROSS APPLY:

SQL Server

SELECT *
FROM actor a
CROSS APPLY (
  SELECT TOP 1 WITH TIES f.*
  FROM film f
  JOIN film_actor fa 
    ON f.film_id = fa.film_id
	AND fa.actor_id = a.actor_id
  ORDER BY length DESC
) f
ORDER BY first_name, last_name;

Oracle

(Do check performance on this!)

SELECT *
FROM actor a
CROSS APPLY (
  SELECT f.*
  FROM film f
  JOIN film_actor fa 
    ON f.film_id = fa.film_id
	AND fa.actor_id = a.actor_id
  ORDER BY length DESC
  FETCH FIRST ROW WITH TIES
) f
ORDER BY first_name, last_name;

PostgreSQL and DB2 support the LATERAL keyword, which could be used with FETCH FIRST ROW ONLY semantics (so, no ties are selected).

For more details about TOP N per category queries, see this blog post

How to Run a Bulk INSERT .. RETURNING Statement With Oracle and JDBC

When inserting records into SQL databases, we often want to fetch back generated IDs and possibly other trigger, sequence, or default generated values. Let’s assume we have the following table:

-- DB2
CREATE TABLE x (
  i INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, 
  j VARCHAR(50), 
  k DATE DEFAULT CURRENT_DATE
);

-- PostgreSQL
CREATE TABLE x (
  i SERIAL4 PRIMARY KEY, 
  j VARCHAR(50), 
  k DATE DEFAULT CURRENT_DATE
);

-- Oracle
CREATE TABLE x (
  i INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, 
  j VARCHAR2(50), 
  k DATE DEFAULT SYSDATE
);

DB2

DB2 is the only database currently supported by jOOQ, which implements the SQL standard according to which we can SELECT from any INSERT statement, including:

SELECT *
FROM FINAL TABLE (
  INSERT INTO x (j)
  VALUES ('a'), ('b'), ('c')
);

The above query returns:

I |J |K          |
--|--|-----------|
1 |a |2018-05-02 |
2 |b |2018-05-02 |
3 |c |2018-05-02 |

Pretty neat! This query can simply be run like any other query in JDBC, and you don’t have to go through any hassles.

PostgreSQL and Firebird

These databases have a vendor specific extension that does the same thing, almost as powerful:

-- Simple INSERT .. RETURNING query
INSERT INTO x (j)
VALUES ('a'), ('b'), ('c')
RETURNING *;

-- If you want to do more fancy stuff
WITH t AS (
  INSERT INTO x (j)
  VALUES ('a'), ('b'), ('c')
  RETURNING *
)
SELECT * FROM t;

Both syntaxes work equally well, the latter is just as powerful as DB2’s, where the result of an insertion (or update, delete, merge) can be joined to other tables. Again, no problem with JDBC

Oracle

In Oracle, this is a bit more tricky. The Oracle SQL language doesn’t have an equivalent of DB2’s FINAL TABLE (DML statement). The Oracle PL/SQL language, however, does support the same syntax as PostgreSQL and Firebird. This is perfectly valid PL/SQL

-- Create a few auxiliary types first
CREATE TYPE t_i AS TABLE OF NUMBER(38);
/
CREATE TYPE t_j AS TABLE OF VARCHAR2(50);
/
CREATE TYPE t_k AS TABLE OF DATE;
/

DECLARE 
  -- These are the input values
  in_j t_j := t_j('a', 'b', 'c');
  
  out_i t_i;
  out_j t_j;
  out_k t_k;
  
  c1 SYS_REFCURSOR;
  c2 SYS_REFCURSOR;
  c3 SYS_REFCURSOR;
BEGIN

  -- Use PL/SQL's FORALL command to bulk insert the
  -- input array type and bulk return the results
  FORALL i IN 1 .. in_j.COUNT
    INSERT INTO x (j)
    VALUES (in_j(i))
    RETURNING i, j, k
    BULK COLLECT INTO out_i, out_j, out_k;
  
  -- Fetch the results and display them to the console
  OPEN c1 FOR SELECT * FROM TABLE(out_i);  
  OPEN c2 FOR SELECT * FROM TABLE(out_j);  
  OPEN c3 FOR SELECT * FROM TABLE(out_k); 
  
  dbms_sql.return_result(c1);
  dbms_sql.return_result(c2);
  dbms_sql.return_result(c3);
END;
/

A bit verbose, but it has the same effect. Now, from JDBC:

try (Connection con = DriverManager.getConnection(url, props);
    Statement s = con.createStatement();

    // The statement itself is much more simple as we can
    // use OUT parameters to collect results into, so no
    // auxiliary local variables and cursors are needed
    CallableStatement c = con.prepareCall(
        "DECLARE "
      + "  v_j t_j := ?; "
      + "BEGIN "
      + "  FORALL j IN 1 .. v_j.COUNT "
      + "    INSERT INTO x (j) VALUES (v_j(j)) "
      + "    RETURNING i, j, k "
      + "    BULK COLLECT INTO ?, ?, ?; "
      + "END;")) {

    try {

        // Create the table and the auxiliary types
        s.execute(
            "CREATE TABLE x ("
          + "  i INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,"
          + "  j VARCHAR2(50),"
          + "  k DATE DEFAULT SYSDATE"
          + ")");
        s.execute("CREATE TYPE t_i AS TABLE OF NUMBER(38)");
        s.execute("CREATE TYPE t_j AS TABLE OF VARCHAR2(50)");
        s.execute("CREATE TYPE t_k AS TABLE OF DATE");

        // Bind input and output arrays
        c.setArray(1, ((OracleConnection) con).createARRAY(
            "T_J", new String[] { "a", "b", "c" })
        );
        c.registerOutParameter(2, Types.ARRAY, "T_I");
        c.registerOutParameter(3, Types.ARRAY, "T_J");
        c.registerOutParameter(4, Types.ARRAY, "T_K");

        // Execute, fetch, and display output arrays
        c.execute();
        Object[] i = (Object[]) c.getArray(2).getArray();
        Object[] j = (Object[]) c.getArray(3).getArray();
        Object[] k = (Object[]) c.getArray(4).getArray();

        System.out.println(Arrays.asList(i));
        System.out.println(Arrays.asList(j));
        System.out.println(Arrays.asList(k));
    }
    finally {
        try {
            s.execute("DROP TYPE t_i");
            s.execute("DROP TYPE t_j");
            s.execute("DROP TYPE t_k");
            s.execute("DROP TABLE x");
        }
        catch (SQLException ignore) {}
    }
}

The above code will display:

[1, 2, 3]
[a, b, c]
[2018-05-02 10:40:34.0, 2018-05-02 10:40:34.0, 2018-05-02 10:40:34.0]

Exactly what we wanted.

jOOQ support

A future version of will emulate the above PL/SQL block from the jOOQ INSERT .. RETURNING statement:

DSL.using(configuration)
   .insertInto(X)
   .columns(X.J)
   .values("a")
   .values("b")
   .values("c")
   .returning(X.I, X.J, X.K)
   .fetch();

This will correctly emulate the query for all of the databases that natively support the syntax. In the case of Oracle, since jOOQ cannot create nor assume any SQL TABLE types, PL/SQL types from the DBMS_SQL package will be used

The relevant issue is here: https://github.com/jOOQ/jOOQ/issues/5863

The Performance Difference Between SQL Row-by-row Updating, Batch Updating, and Bulk Updating

Something that has been said many times, but needs constant repeating until every developer is aware of the importance of this is the performance difference between row-by-row updating and bulk updating. If you cannot guess which one will be much faster, remember that row-by-row kinda rhymes with slow-by-slow (hint hint).

Disclaimer: This article will discuss only non-concurrent updates, which are much easier to reason about. In a concurrent update situation, a lot of additional factors will add complexity to the problem, including the locking strategy, transaction isolation levels, or simply how the database vendor implements things in detail. For the sake of simplicity, I’ll assume no concurrent updates are being made.

Example query

Let’s say we have a simple table for our blog posts (using Oracle syntax, but the effect is the same on all databases):

CREATE TABLE post (
  id INT NOT NULL PRIMARY KEY,
  text VARCHAR2(1000) NOT NULL,
  archived NUMBER(1) NOT NULL CHECK (archived IN (0, 1)),
  creation_date DATE NOT NULL
);

CREATE INDEX post_creation_date_i ON post (creation_date);

Now, let’s add some 10000 rows:

INSERT INTO post
SELECT 
  level,
  lpad('a', 1000, 'a'),
  0 AS archived,
  DATE '2017-01-01' + (level / 100)
FROM dual
CONNECT BY level <= 10000;

EXEC dbms_stats.gather_table_stats('TEST', 'POST');

Now imagine, we want to update this table and set all posts to ARCHIVED = 1 if they are from last year, e.g. CREATION_DATE < DATE '2018-01-01'. There are various ways to do this, but you should have built an intuition that doing the update in one single UPDATE statement is probably better than looping over each individual row and updating each individual row explicitly. Right?

Right.

Then, why do we keep doing it?

Let me ask this differently:

Does it matter?

The best way to find out is to benchmark. I’m doing two benchmarks for this:

  1. One that is run in PL/SQL, showing the performance difference between different approaches that are available to PL/SQL (namely looping, the FORALL syntax, and a single bulk UPDATE)
  2. One that is run in Java, doing JDBC calls, showing the performance difference between different approaches available to Java (namely looping, caching PreparedStatement but still looping, batching, and a single bulk UPDATE)

Benchmarking PL/SQL

The code of the benchmark can be found in this gist. I will also include it at the bottom of this blog post. The results are:

Run 1, Statement 1 : .01457 (avg : .0098)
Run 1, Statement 2 : .0133  (avg : .01291)
Run 1, Statement 3 : .02351 (avg : .02519)
Run 2, Statement 1 : .00882 (avg : .0098)
Run 2, Statement 2 : .01159 (avg : .01291)
Run 2, Statement 3 : .02348 (avg : .02519)
Run 3, Statement 1 : .01012 (avg : .0098)
Run 3, Statement 2 : .01453 (avg : .01291)
Run 3, Statement 3 : .02544 (avg : .02519)
Run 4, Statement 1 : .00799 (avg : .0098)
Run 4, Statement 2 : .01346 (avg : .01291)
Run 4, Statement 3 : .02958 (avg : .02519)
Run 5, Statement 1 : .00749 (avg : .0098)
Run 5, Statement 2 : .01166 (avg : .01291)
Run 5, Statement 3 : .02396 (avg : .02519)

The difference between Statement 1 and 3 is a factor of 2.5x

Showing the time it takes for each statement type to complete, each time updating 3649 / 10000 rows. The winner is:

Statement 1, running a bulk update

It looks like this:

UPDATE post
SET archived = 1
WHERE archived = 0 AND creation_date < DATE '2018-01-01';

Runner-up (not too far away) is:

Statement 2, using the PL/SQL FORALL syntax

It works like this:

DECLARE
  TYPE post_ids_t IS TABLE OF post.id%TYPE;
  v_post_ids post_ids_t;
BEGIN
  SELECT id 
  BULK COLLECT INTO v_post_ids
  FROM post 
  WHERE archived = 0 AND creation_date < DATE '2018-01-01';

  FORALL i IN 1 .. v_post_ids.count
    UPDATE post
    SET archived = 1
    WHERE id = v_post_ids(i);
END;

Loser (by a factor of 2.5x on our specific data set) is:

Statement 3, using an ordinary LOOP and running row-by-row updates

FOR rec IN (
  SELECT id 
  FROM post 
  WHERE archived = 0 AND creation_date < DATE '2018-01-01'
) LOOP
  UPDATE post
  SET archived = 1
  WHERE id = rec.id;
END LOOP;

It does not really come as a surprise. We’re switching between the PL/SQL engine and the SQL engine many many times, and also, instead of running through the post table only once in O(N) time, we’re looking up individual ID values in O(log N) time, N times, so the complexity went from

O(N) -> O(N log N)

We’d get far worse results for larger tables!

What about doing this from Java?

The difference is much more drastic if each call to the SQL engine has to be done over the network from another process. Again, the benchmark code is available from a gist, and I will paste it to the end of this blog post as well.

The result is (same time unit):

Run 0, Statement 1: PT4.546S
Run 0, Statement 2: PT3.52S
Run 0, Statement 3: PT0.144S
Run 0, Statement 4: PT0.028S
Run 1, Statement 1: PT3.712S
Run 1, Statement 2: PT3.185S
Run 1, Statement 3: PT0.138S
Run 1, Statement 4: PT0.025S
Run 2, Statement 1: PT3.481S
Run 2, Statement 2: PT3.007S
Run 2, Statement 3: PT0.122S
Run 2, Statement 4: PT0.026S
Run 3, Statement 1: PT3.518S
Run 3, Statement 2: PT3.077S
Run 3, Statement 3: PT0.113S
Run 3, Statement 4: PT0.027S
Run 4, Statement 1: PT3.54S
Run 4, Statement 2: PT2.94S
Run 4, Statement 3: PT0.123S
Run 4, Statement 4: PT0.03S

The difference between Statement 1 and 4 is a factor of 100x !!

So, who’s winning? Again (by far):

Statement 4, running the bulk update

In fact, the time is not too far away from the time taken by PL/SQL. With larger data sets being updated, the two results will converge. The code is:

try (Statement s = c.createStatement()) {
    s.executeUpdate(
        "UPDATE post\n" +
        "SET archived = 1\n" +
        "WHERE archived = 0\n" +
        "AND creation_date < DATE '2018-01-01'\n");
}

Followed by the not that much worse (but still 3.5x worse):

Statement 3, running the batch update

Batching can be compared to PL/SQL’s FORALL statement. While we’re running individual row-by-row updates, we’re sending all the update statements in one batch to the SQL engine. This does save a lot of time on the network and all the layers in between.

The code looks like this:

try (Statement s = c.createStatement();
    ResultSet rs = s.executeQuery(
        "SELECT id FROM post WHERE archived = 0\n"
      + "AND creation_date < DATE '2018-01-01'"
    );
    PreparedStatement u = c.prepareStatement(
        "UPDATE post SET archived = 1 WHERE id = ?"
    )) {

    while (rs.next()) {
        u.setInt(1, rs.getInt(1));
        u.addBatch();
    }

    u.executeBatch();
}

Followed by the losers:

Statement 1 and 2, running row by row updates

The difference between statement 1 and 2 is that 2 caches the PreparedStatement, which allows for reusing some resources. This can be a good thing, but didn’t have a very significant effect in our case, compared to the batch / bulk alternatives. The code is:

// Statement 1:
try (Statement s = c.createStatement();
    ResultSet rs = s.executeQuery(
        "SELECT id FROM post\n"
      + "WHERE archived = 0\n"
      + "AND creation_date < DATE '2018-01-01'"
    )) {

    while (rs.next()) {
        try (PreparedStatement u = c.prepareStatement(
            "UPDATE post SET archived = 1 WHERE id = ?"
        )) {
            u.setInt(1, rs.getInt(1));
            u.executeUpdate();
        }
    }
}

// Statement 2:
try (Statement s = c.createStatement();
    ResultSet rs = s.executeQuery(
        "SELECT id FROM post\n"
      + "WHERE archived = 0\n"
      + "AND creation_date < DATE '2018-01-01'"
    );
    PreparedStatement u = c.prepareStatement(
        "UPDATE post SET archived = 1 WHERE id = ?"
    )) {

    while (rs.next()) {
        u.setInt(1, rs.getInt(1));
        u.executeUpdate();
    }
}

Conclusion

As shown previously on this blog, there is a significant cost of JDBC server roundtrips, which can be seen in the JDBC benchmark. This cost is much more severe if we unnecessarily create many server roundtrips for a task that could be done in a single roundtrip, namely by using a SQL bulk UPDATE statement.

This is not only true for updates, but also for all the other statements, including SELECT, DELETE, INSERT, and MERGE. If doing everything in a single statement isn’t possible due to the limitations of SQL, we can still save roundtrips by grouping statements in a block, either by using an anonymous block in databases that support them:

BEGIN
  statement1;
  statement2;
  statement3;
END;

(you can easily send these anonymous blocks over JDBC, as well!)

Or, by emulating anonymous blocks using the JDBC batch API (has its limitations), or by writing stored procedures.

The performance gain is not always worth the trouble of moving logic from the client to the server, but very often (as in the above case), the move is a no-brainer and there’s absolutely no reason against it.

So, remember: Stop doing row-by-row (slow-by-slow) operations when you could run the same operation in bulk, in a single SQL statement.

Hint: Always know what your ORM (if you’re using one) is doing, because the ORM can help you with automatic batching / bulking in many cases. But it often cannot, or it is too difficult to make it do so, so resorting to SQL is the way to go.

Code

PL/SQL benchmark

SET SERVEROUTPUT ON

DROP TABLE post;

CREATE TABLE post (
  id INT NOT NULL PRIMARY KEY,
  text VARCHAR2(1000) NOT NULL,
  archived NUMBER(1) NOT NULL CHECK (archived IN (0, 1)),
  creation_date DATE NOT NULL
);

CREATE INDEX post_creation_date_i ON post (creation_date);

ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;

CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  
  PROCEDURE reset_post IS
  BEGIN
    EXECUTE IMMEDIATE 'TRUNCATE TABLE post';
    INSERT INTO post
    SELECT 
      level AS id,
      lpad('a', 1000, 'a') AS text,
      0 AS archived,
      DATE '2017-01-01' + (level / 100) AS creation_date
    FROM dual
    CONNECT BY level <= 10000;
    dbms_stats.gather_table_stats('TEST', 'POST');
  END reset_post;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
  
    reset_post;
    v_ts := SYSTIMESTAMP;
    
    UPDATE post
    SET archived = 1
    WHERE archived = 0 AND creation_date < DATE '2018-01-01';
  
    INSERT INTO results VALUES (r, 1, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    
    reset_post;
    v_ts := SYSTIMESTAMP;
    
    DECLARE
      TYPE post_ids_t IS TABLE OF post.id%TYPE;
      v_post_ids post_ids_t;
    BEGIN
      SELECT id 
      BULK COLLECT INTO v_post_ids
      FROM post 
      WHERE archived = 0 AND creation_date < DATE '2018-01-01';
    
      FORALL i IN 1 .. v_post_ids.count
        UPDATE post
        SET archived = 1
        WHERE id = v_post_ids(i);
    END;
    
    INSERT INTO results VALUES (r, 2, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    
    reset_post;
    v_ts := SYSTIMESTAMP;
      
    FOR rec IN (
      SELECT id 
      FROM post 
      WHERE archived = 0 AND creation_date < DATE '2018-01-01'
    ) LOOP
      UPDATE post
      SET archived = 1
      WHERE id = rec.id;
    END LOOP;
      
    INSERT INTO results VALUES (r, 3, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed AS NUMBER(10, 5)) ratio,
      CAST(AVG(elapsed) OVER (PARTITION BY stmt) AS NUMBER(10, 5)) avg_ratio
    FROM results
    ORDER BY run, stmt
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio || ' (avg : ' || rec.avg_ratio || ')');
  END LOOP;
  
  dbms_output.put_line('');
  dbms_output.put_line('Copyright Data Geekery GmbH');
  dbms_output.put_line('https://www.jooq.org/benchmark');
END;
/

DROP TABLE results;

JDBC benchmark

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.time.Duration;
import java.time.Instant;
import java.util.Properties;

public class OracleUpdate {

    public static void main(String[] args) throws Exception {
        Class.forName("oracle.jdbc.OracleDriver");

        String url = "jdbc:oracle:thin:@192.168.99.100:1521:ORCLCDB";
        String user = "TEST";
        String password = "TEST";

        Properties properties = new Properties();
        properties.setProperty("user", user);
        properties.setProperty("password", password);

        try (Connection c = DriverManager.getConnection(url, properties)) {
            for (int i = 0; i < 5; i++) {
                Instant ts;

                resetPost(c);
                ts = Instant.now();

                try (Statement s = c.createStatement();
                    ResultSet rs = s.executeQuery(
                        "SELECT id FROM post WHERE archived = 0 AND creation_date < DATE '2018-01-01'"
                    )) {

                    while (rs.next()) {
                        try (PreparedStatement u = c.prepareStatement(
                            "UPDATE post SET archived = 1 WHERE id = ?"
                        )) {
                            u.setInt(1, rs.getInt(1));
                            u.executeUpdate();
                        }
                    }
                }

                System.out.println("Run " + i + ", Statement 1: " + Duration.between(ts, Instant.now()));

                resetPost(c);
                ts = Instant.now();

                try (Statement s = c.createStatement();
                    ResultSet rs = s.executeQuery(
                        "SELECT id FROM post WHERE archived = 0 AND creation_date < DATE '2018-01-01'"
                    );
                    PreparedStatement u = c.prepareStatement(
                        "UPDATE post SET archived = 1 WHERE id = ?"
                    )) {

                    while (rs.next()) {
                        u.setInt(1, rs.getInt(1));
                        u.executeUpdate();
                    }
                }

                System.out.println("Run " + i + ", Statement 2: " + Duration.between(ts, Instant.now()));

                resetPost(c);
                ts = Instant.now();

                try (Statement s = c.createStatement();
                    ResultSet rs = s.executeQuery(
                        "SELECT id FROM post WHERE archived = 0 AND creation_date < DATE '2018-01-01'"
                    );
                    PreparedStatement u = c.prepareStatement(
                        "UPDATE post SET archived = 1 WHERE id = ?"
                    )) {

                    while (rs.next()) {
                        u.setInt(1, rs.getInt(1));
                        u.addBatch();
                    }

                    u.executeBatch();
                }
                System.out.println("Run " + i + ", Statement 3: " + Duration.between(ts, Instant.now()));

                resetPost(c);
                ts = Instant.now();

                try (Statement s = c.createStatement()) {
                    s.executeUpdate("UPDATE post\n" +
                        "SET archived = 1\n" +
                        "WHERE archived = 0 AND creation_date < DATE '2018-01-01'\n");
                }

                System.out.println("Run " + i + ", Statement 4: " + Duration.between(ts, Instant.now()));
            }
        }
    }

    static void resetPost(Connection c) throws SQLException {
        try (Statement s = c.createStatement()) {
            s.executeUpdate("TRUNCATE TABLE post");
            s.executeUpdate("INSERT INTO post\n" +
                "    SELECT \n" +
                "      level,\n" +
                "      lpad('a', 1000, 'a'),\n" +
                "      0,\n" +
                "      DATE '2017-01-01' + (level / 10)\n" +
                "    FROM dual\n" +
                "    CONNECT BY level <= 10000");
            s.executeUpdate("BEGIN dbms_stats.gather_table_stats('TEST', 'POST'); END;");
        }
    }
}