A Rarely Seen, but Useful SQL Feature: CORRESPONDING

I recently stumbled upon a standard SQL feature that was implemented, to my surprise, in HSQLDB. The keyword is CORRESPONDING, and it can be used with all set operations, including UNION, INTERSECT, and EXCEPT.

Let’s look at the sakila database. It has 3 tables with people in it:

CREATE TABLE actor (
    actor_id integer NOT NULL PRIMARY KEY,
    first_name varchar(45) NOT NULL,
    last_name varchar(45) NOT NULL,
    last_update timestamp
);

CREATE TABLE customer (
    customer_id integer NOT NULL PRIMARY KEY,
    store_id smallint NOT NULL,
    first_name varchar(45) NOT NULL,
    last_name varchar(45) NOT NULL,
    email varchar(50),
    address_id smallint NOT NULL,
    create_date date NOT NULL,
    last_update timestamp,
    active boolean
);

CREATE TABLE staff (
    staff_id integer NOT NULL PRIMARY KEY,
    first_name varchar(45) NOT NULL,
    last_name varchar(45) NOT NULL,
    address_id smallint NOT NULL,
    email varchar(50),
    store_id smallint NOT NULL,
    active boolean NOT NULL,
    username varchar(16) NOT NULL,
    password varchar(40),
    last_update timestamp,
    picture blob
);

Similar, but not the same. What if we wanted to get all the “people” from our database? One way to do that in any ordinary database product is:

SELECT first_name, last_name
FROM actor
UNION ALL
SELECT first_name, last_name
FROM customer
UNION ALL
SELECT first_name, last_name
FROM staff
ORDER BY first_name, last_name

The result might look like this:

|first_name|last_name|
|----------|---------|
|AARON     |SELBY    |
|ADAM      |GOOCH    |
|ADAM      |GRANT    |
|ADAM      |HOPPER   |
|ADRIAN    |CLARY    |
|AGNES     |BISHOP   |
|AL        |GARLAND  |
|ALAN      |DREYFUSS |
|...       |...      |

Using CORRESPONDING

Now, in HSQLDB, and in standard SQL, you can use CORRESPONDING for this kind of task. For example:

SELECT *
FROM actor
UNION ALL CORRESPONDING
SELECT *
FROM customer
UNION ALL CORRESPONDING
SELECT *
FROM staff
ORDER BY first_name, last_name

The result is this:

|first_name|last_name|last_update            |
|----------|---------|-----------------------|
|AARON     |SELBY    |2006-02-15 04:57:20.000|
|ADAM      |GOOCH    |2006-02-15 04:57:20.000|
|ADAM      |GRANT    |2006-02-15 04:34:33.000|
|ADAM      |HOPPER   |2006-02-15 04:34:33.000|
|ADRIAN    |CLARY    |2006-02-15 04:57:20.000|
|AGNES     |BISHOP   |2006-02-15 04:57:20.000|
|AL        |GARLAND  |2006-02-15 04:34:33.000|
|ALAN      |DREYFUSS |2006-02-15 04:34:33.000|
|...       |...      |...                    |

So, what has happened? The columns FIRST_NAME, LAST_NAME, and LAST_UPDATE are common to all three tables. In other words, if you run this query against the INFORMATION_SCHEMA in HSQLDB:

SELECT column_name
FROM information_schema.columns
WHERE table_name = 'ACTOR'
INTERSECT
SELECT column_name
FROM information_schema.columns
WHERE table_name = 'CUSTOMER'
INTERSECT
SELECT column_name
FROM information_schema.columns
WHERE table_name = 'STAFF'

You get exactly these 3 columns:

|COLUMN_NAME|
|-----------|
|FIRST_NAME |
|LAST_NAME  |
|LAST_UPDATE|

In other words, CORRESPONDING creates the intersection of columns among the subqueries of a set operation (i.e. the “shared columns”), projects those, and applies the set operation that projection. In a way, this is similar to a NATURAL JOIN, which also tries to find that intersection of columns to produce a join predicate. However, NATURAL JOIN then projects all of the columns (or the union of the columns), not just the shared ones.

Using CORRESPONDING BY

Just like NATURAL JOIN, this is a risky operation. As soon as one subquery changes its projection (e.g. because of a table column rename), the result of all such queries will change as well, and it might not even produce a syntax error, just a different result.

In fact, in the above example, we probably didn’t even care about that LAST_UPDATE column. It was included in the UNION ALL set operation by accident, just like NATURAL JOIN would join using LAST_UPDATE by accident.

With joins, we can use JOIN .. USING (first_name, last_name) to at least specify by which shared column names we want to join the two tables. With CORRESPONDING, we can supply the optional BY clause for the same purpose:

SELECT *
FROM actor
UNION ALL CORRESPONDING BY (first_name, last_name)
SELECT *
FROM customer
UNION ALL CORRESPONDING BY (first_name, last_name)
SELECT *
FROM staff
ORDER BY first_name, last_name;

This now produces only the two desired columns:

|first_name|last_name|
|----------|---------|
|AARON     |SELBY    |
|ADAM      |GOOCH    |
|ADAM      |GRANT    |
|ADAM      |HOPPER   |
|ADRIAN    |CLARY    |
|AGNES     |BISHOP   |
|AL        |GARLAND  |
|ALAN      |DREYFUSS |
|...       |...      |

In fact, this way, we could even use the syntax meaningfully for INTERSECT and EXCEPT, e.g. to find customers who share their names with an actor:

SELECT *
FROM actor
INTERSECT CORRESPONDING BY (first_name, last_name)
SELECT *
FROM customer
ORDER BY first_name, last_name;

Producing:

|first_name|last_name|
|----------|---------|
|JENNIFER  |DAVIS    |

Other dialects

I haven’t encountered this syntax many times in other dialects before. Perhaps, it will ship to PostgreSQL in the future. A branch has been worked on by Vik Fearing:

jOOQ might soon support it in the API / parser / translator:

https://github.com/jOOQ/jOOQ/issues/5285

Using jOOQ’s DiagnosticsConnection to detect N+1 Queries

N+1 queries are a popular problem in many applications that run SQL queries. The problem can be described easily as follows:

  • 1 query fetching a parent value is run
  • N queries fetching each individual child values are run

This problem isn’t limited to SQL, it can happen with any poorly designed API that does not allow for batch and/or bulk processing (even stored procedures). But with SQL, it’s particularly painful, because in many cases, running tons of logic in a single query would be totally possible, especially with jOOQ’s MULTISET and SQL/XML or SQL/JSON support.

In the worst case, the N+1 problem is caused by a third party ORM – or rather, its poor implementation / configuration, but some ORMs make it really easy to shoot oneself in the foot with N+1 problems…

An example

Let’s stick with JDBC for now to illustrate how N+1 queries happen.

try (Statement stmt = conn.createStatement()) {

    // The parent query, fetching actors
    try (ResultSet r1 = stmt.executeQuery(
        """
        SELECT actor_id, first_name, last_name
        FROM actor
        LIMIT 5
        """
    )) {
        while (r1.next()) {
            System.out.println();
            System.out.println(
                "Actor: " + r1.getString(2) + " " + r1.getString(2));

            // The child query, fetching films per actor
            try (PreparedStatement pstmt = conn.prepareStatement(
                """
                SELECT count(*) FROM film_actor WHERE actor_id = ?
                """
            )) {
                pstmt.setInt(1, r1.getInt(1));

                try (ResultSet r2 = pstmt.executeQuery()) {
                    while (r2.next()) {
                        System.out.println("Films: " + r2.getInt(1));
                    }
                }
            }
        }
    }
}

When run against the sakila database, the above prints:

Actor: PENELOPE PENELOPE
Films: 19

Actor: NICK NICK
Films: 25

Actor: ED ED
Films: 22

Actor: JENNIFER JENNIFER
Films: 22

Actor: JOHNNY JOHNNY
Films: 29

Obviously correct, but we could have easily run this in a single query:

SELECT
  a.first_name,
  a.last_name,
  count(fa.film_id)
FROM actor AS a
LEFT JOIN film_actor AS fa USING (actor_id)
GROUP BY a.actor_id

Given that we have 200 actors in total, which do you prefer? Running 1+200 queries or just 1 query? If you’re in control of your SQL, this mistake is much less likely to happen, but what if you’re not in (complete) control, because the SQL is generated based on eager/lazy loading configurations and complex entity graph annotations, then you’ll be happy you can just plug in jOOQ’s DiagnosticsConnection’s repeated statements diagnostic into your integration test environment (not necessarily in production, as there’s some overhead to parse and normalise all the SQL strings).

Applied to the above JDBC example:

DSLContext ctx = DSL.using(connection);
ctx.configuration().set(new DefaultDiagnosticsListener() {
    @Override
    public void repeatedStatements(DiagnosticsContext c) {

        // Custom callback, may also throw exceptions, etc.
        System.out.println(
            "Repeated statement: " + c.normalisedStatement());
    }
});

Connection conn = ctx.diagnosticsConnection();

You’re now getting the following output:

Actor: PENELOPE PENELOPE
Films: 19

Actor: NICK NICK
Repeated statement: select count(*) from film_actor where actor_id = ?;
Films: 25

Actor: ED ED
Repeated statement: select count(*) from film_actor where actor_id = ?;
Films: 22

Actor: JENNIFER JENNIFER
Repeated statement: select count(*) from film_actor where actor_id = ?;
Films: 22

Actor: JOHNNY JOHNNY
Repeated statement: select count(*) from film_actor where actor_id = ?;
Films: 29

As you can see, the diagnostics connection starts logging after the first repetition of the statement, the assumption being that within a transaction, it is generally unnecessary to ever repeat a statement more than once, because there’s almost always a better way.

Using this with JPA / Hibernate

You probably don’t write JDBC statements manually like this, but it doesn’t matter who calls JDBC (you, jOOQ, JdbcTemplate, Hibernate, etc.). If you proxy your connection (or DataSource) with jOOQ’s DiagnosticsConnection or DiagnosticsDataSource, then you can intercept such events easily, no matter the cause.

Future versions of jOOQ will add a lot more diagnostics via https://github.com/jOOQ/jOOQ/issues/7527.

To see what’s available in jOOQ already, refer to the manual.

The Useful BigQuery * EXCEPT Syntax

One of the coolest things about using and making jOOQ is that we get to discover the best extensions to the standard SQL language by vendors, and add support for those clauses in jOOQ via emulations.

One of these syntaxes is BigQuery’s * EXCEPT syntax. Everyone who ever wrote ad-hoc SQL queries would have liked this syntax to be available, occasionally. Why is it needed? Look at this query:

SELECT * FROM actor

The result being:

|actor_id|first_name|last_name   |last_update            |
|--------|----------|------------|-----------------------|
|1       |PENELOPE  |GUINESS     |2006-02-15 04:34:33.000|
|2       |NICK      |WAHLBERG    |2006-02-15 04:34:33.000|
|3       |ED        |CHASE       |2006-02-15 04:34:33.000|
|4       |JENNIFER  |DAVIS       |2006-02-15 04:34:33.000|
|5       |JOHNNY    |LOLLOBRIGIDA|2006-02-15 04:34:33.000|
|6       |BETTE     |NICHOLSON   |2006-02-15 04:34:33.000|
|...

But that LAST_UPDATE column is mighty annoying, especially if we want to NATURAL JOIN things. E.g. this doesn’t work:

SELECT actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor AS a
NATURAL LEFT JOIN film_actor AS fa
GROUP BY actor_id

The result is just actors without films, because accidentally, the LAST_UPDATE column was included in the NATURAL JOIN:

|actor_id|first_name|last_name|count|
|--------|----------|---------|-----|
|58      |CHRISTIAN |AKROYD   |0    |
|8       |MATTHEW   |JOHANSSON|0    |
|116     |DAN       |STREEP   |0    |
|184     |HUMPHREY  |GARLAND  |0    |
|87      |SPENCER   |PECK     |0    |

This is the biggest flaw of NATURAL JOIN, making it almost useless for schemas that aren’t perfectly designed for NATURAL JOIN usage, but this is ad-hoc SQL, and it would have been nice to do that.

We could, if we had * EXCEPT like this:

SELECT 
  a.actor_id, 
  a.first_name, 
  a.last_name, 
  count(fa.film_id)
FROM (
  SELECT * EXCEPT (last_update) FROM actor
) AS a
NATURAL LEFT JOIN (
  SELECT * EXCEPT (last_update) FROM film_actor
) AS fa
GROUP BY 
  a.actor_id, 
  a.first_name, 
  a.last_name

Unfortunately, this doesn’t work in PostgreSQL and other dialects, but jOOQ can emulate it. If you provide the online SQL translator with the sakila database table definitions:

CREATE TABLE actor (
    actor_id integer NOT NULL,
    first_name character varying(45) NOT NULL,
    last_name character varying(45) NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);
CREATE TABLE film_actor (
    actor_id smallint NOT NULL,
    film_id smallint NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);

Then, it can resolve the EXCEPT syntax of the query and produce this:

SELECT
  a.actor_id,
  a.first_name,
  a.last_name,
  count(fa.film_id)
FROM (
  SELECT actor.actor_id, actor.first_name, actor.last_name
  FROM actor
) a
  NATURAL LEFT OUTER JOIN (
    SELECT film_actor.actor_id, film_actor.film_id
    FROM film_actor
  ) fa
GROUP BY
  a.actor_id,
  a.first_name,
  a.last_name

Obviously, we could have just omitted NATURAL JOIN to achieve the same result, but sometimes, it’s just nice to have yet another tool in the tool chain to write a nice query. With jOOQ, the query would read:

Actor a = ACTOR.as("a");
FilmActor fa = FILM_ACTOR.as("fa");

ctx.select(
        a.ACTOR_ID,
        a.FIRST_NAME,
        a.LAST_NAME,
        count(fa.FILM_ID))
   .from(
        select(asterisk().except(a.LAST_UPDATE)).from(a).asTable(a))
   .naturalLeftOuterJoin(
        select(asterisk().except(fa.LAST_UPDATE)).from(fa).asTable(fa))
   .groupBy(a.ACTOR_ID, a.FIRST_NAME, a.LAST_NAME)
   .fetch();

3.16.0 Release with a new Public Query Object Model API, Spatial Support, YugabyteDB Support and Much More

This release tackles two long standing and complex feature requests that users
have asked us to offer for a long time: a public API for manipulating jOOQ’s
query object model (QOM), and spatial support.

New Query Object Model (QOM)

Every jOOQ query is modeled as an expression tree constructed via our intuitive
DSL. For some use-cases there exist historic model API versions (e.g.
SelectQuery), but those models aren’t readable or transformable. We’re now
starting to offer most of our expression tree model as a public API for users to
consume and manipulate. Every expression tree element has a corresponding type
in org.jooq.impl.QOM. Every type provides access to its components using "$"
prefixed method names, e.g.:

// Create an expression using the DSL API:
Field field = substring(BOOK.TITLE, 2, 4);

// Access the expression's internals using the model API
if (field instanceof QOM.Substring substring) {
    Field string = substring.$string();
    Field startingPosition = substring.$startingPosition();
    Field length = substring.$length();
}

The new API is experimental and might change in the next minor release.

Licensed power users will get auxiliary APIs to traverse and transform the
expression tree, e.g. traversal:

// Contains 7 query parts 
long count2 = BOOK.ID.eq(1).or(BOOK.ID.eq(2))
    .$traverse(Traversers.collecting(Collectors.counting());

Or replacement:

// Contains redundant operators
Condition c = not(not(BOOK.ID.eq(1)));
System.out.println(c.$replace(q ->
    q instanceof QOM.Not n1 && n1.$arg1() instanceof QOM.Not n2
        ? n2.$arg1()
        : q
));

The above prints the following, having removed the redundant NOT operators:

"BOOK"."ID" = 1

This new API is very powerful for even more sophisticated dynamic SQL use-cases,
such as:

  • Optimising SQL expressions as the above NOT NOT example
  • Row level security
  • Soft deletion
  • Shared schema multi tenancy
  • Audit column support
  • And much more (stay tuned for future blogs and out-of-the-box transformations)

For more info, see:

Spatial support

A long awaited feature that starts shipping to our commercially licensed
customers is spatial support. Many dialects support the ISO/IEC 13249-3:2016 SQL
standard extension, and finally, so do we.

jOOQ introduces new auxiliary data types for GEOMETRY and GEOGRAPHY data to use
as standardised bind variables, containing WKB or WKT data, as well as a
variety of out of the box functions and predicates.

Future versions will add more support for other functions and dialects.

New dialect and versions

Another new officially supported SQL dialect has been added to all jOOQ editions
including the jOOQ Open Source Edition: YugabyteDB. This was a sponsored
integration, thank you very much Yugabyte!

Other dialects have upgraded versions, meaning:

  • We support this version in all editions now
  • The jOOQ Open Source Edition now requires this version if applicable

The versions are:

  • Firebird 4.0
  • H2 2.0.202
  • MariaDB 10.6
  • PostgreSQL 14
  • Oracle 21c

Computed and readonly columns including ROWID

A lot of dialects support computed columns (“generated” columns), and we now
support those as well in jOOQ. For most use-cases, this does not affect jOOQ
usage, but especially when writing CRUD code, the new readonly column feature
can be very helpful to avoid having to exclude computed columns manually from
the CRUD operation.

This also includes a new, improved ROWID support, where users can choose to work
with a synthetic ROWID column instead of the primary key, when performing CRUD
operations.

All of these features are available in the commercial jOOQ editions.

Jakarta EE

We’ve moved on from Java EE to Jakarta EE dependencies. This change is
currently backwards incompatible because:

  • It greatly facilitates maintaining the related code
  • It prevents tons of user problems resulting from having both dependencies
  • We’re not actually integrating tightly with Java EE / Jakarta EE

The following Jakarta EE modules are affected:

  • JAXB, which we’re using to load our configuration objects.
  • Validation, whose annotations can be generated by the code generator.
  • JPA, which is used by the DefaultRecordMapper and the JPADatabase

If this breaking change causes issues, please get in touch at
https://github.com/jOOQ/jOOQ/issues/9641

Various improvements

As with every minor release, a lot of smaller improvements have been
implemented. Some highlights include:

  • PostgreSQL procedures are now supported in code generation and at runtime.
  • SQLite JSON support has been added, which includes the MULTISET emulation!
  • A lot of MULTISET / ROW improvements have been implemented
  • R2DBC 0.9 has been released, and we’ve upgraded our dependency
  • The Java 17 distribution now requires Java 17 instead of Java 16
  • Pre jOOQ 3.6 deprecations have been removed

Full release notes here

How to customise a jOOQ Configuration that is injected using Spring Boot

Starting from Spring Boot 2.5, there’s a handy new callback that you can implement, called DefaultConfigurationCustomizer, where the word DefaultConfiguration corresponds to jOOQ’s DefaultConfiguration. You can simply create a class like this in your project:

import org.jooq.conf.RenderQuotedNames;
import org.jooq.impl.DefaultConfiguration;
import org.springframework.boot.autoconfigure.jooq.*;
import org.springframework.context.annotation.*;

@Configuration
public class Config {
	@Bean
	public DefaultConfigurationCustomizer configurationCustomiser() {
		return (DefaultConfiguration c) -> c.settings()
			.withRenderQuotedNames(
				RenderQuotedNames.EXPLICIT_DEFAULT_UNQUOTED
			);
	}
}

The above callback receives the DefaultConfiguration at its initialisation stage, during which you can still safely mutate it to change any of the defaults, or add additional settings, etc. That’s it!

You can set a breakpoint inside of the lambda to validate when this initialisation is invoked:

Caused by: java.lang.RuntimeException
	at org.jooq.example.spring.Config.lambda$0(Config.java:30)
	at org.springframework.boot.autoconfigure.jooq.JooqAutoConfiguration$DslContextConfiguration.lambda$jooqConfiguration$1(JooqAutoConfiguration.java:104)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
	at java.base/java.util.stream.SortedOps$RefSortingSink.end(SortedOps.java:395)
	at java.base/java.util.stream.Sink$ChainedReference.end(Sink.java:258)
	at java.base/java.util.stream.Sink$ChainedReference.end(Sink.java:258)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:510)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at org.springframework.boot.autoconfigure.jooq.JooqAutoConfiguration$DslContextConfiguration.jooqConfiguration(JooqAutoConfiguration.java:104)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:154)
	... 119 more

To play around with this example, check out our jOOQ-spring-boot-example on github.

Using JDK Collectors to De-duplicate parent/child nested collections

In classic SQL (i.e. before jOOQ’s awesome MULTISET operator), nested collections were fetched using ordinary (outer) joins. An example of such a query would be a query running against the sakila database to fetch actors and their films. Using jOOQ:

Result<?> result =
ctx.select(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME,
        FILM.FILM_ID,
        FILM.TITLE)
   .from(ACTOR)
   .leftJoin(FILM_ACTOR).on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .leftJoin(FILM).on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .orderBy(
       ACTOR.ACTOR_ID,
       FILM.FILM_ID)
   .fetch();

The result from the jOOQ debug log would look something like this:

+--------+----------+---------+-------+---------------------+
|actor_id|first_name|last_name|film_id|title                |
+--------+----------+---------+-------+---------------------+
|       1|PENELOPE  |GUINESS  |      1|ACADEMY DINOSAUR     |
|       1|PENELOPE  |GUINESS  |     23|ANACONDA CONFESSIONS |
|       1|PENELOPE  |GUINESS  |     25|ANGELS LIFE          |
|       1|PENELOPE  |GUINESS  |    106|BULWORTH COMMANDMENTS|
|       1|PENELOPE  |GUINESS  |    140|CHEAPER CLYDE        |
+--------+----------+---------+-------+---------------------+

As expected for a SQL join operation, this denormalises the data, leading to duplicate entries for each actor, or if you sort differently, e.g. by FILM_ID, then you’d also see the duplicate entries per film:

+--------+----------+---------+-------+----------------+
|actor_id|first_name|last_name|film_id|title           |
+--------+----------+---------+-------+----------------+
|       1|PENELOPE  |GUINESS  |      1|ACADEMY DINOSAUR|
|      10|CHRISTIAN |GABLE    |      1|ACADEMY DINOSAUR|
|      20|LUCILLE   |TRACY    |      1|ACADEMY DINOSAUR|
|      30|SANDRA    |PECK     |      1|ACADEMY DINOSAUR|
|      40|JOHNNY    |CAGE     |      1|ACADEMY DINOSAUR|
+--------+----------+---------+-------+----------------+

This is just how a join operation works. It creates a cartesian product and then filters by primary key / foreign key matches, which JPA users know under the name MultipleBagFetchException.

De-duplicating and nesting the collections with fetchGroups()

What we usually want is some sort of nested data structure, e.g. fetch the films per actor. A simple utility in jOOQ is to just use fetchGroups():

Map<ActorRecord, Result<FilmRecord>> result =
ctx.select(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME,
        FILM.FILM_ID,
        FILM.TITLE)
   .from(ACTOR)
   .leftJoin(FILM_ACTOR).on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .leftJoin(FILM).on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .orderBy(
       ACTOR.ACTOR_ID,
       FILM.FILM_ID)
   .fetchGroups(ACTOR, FILM);

This is super convenient, it’s order preserving, but it has a flaw. It is not “smart” enough to remember the LEFT JOIN semantics and do The Right Thing, which is produce an empty list of FilmRecord in case an actor doesn’t have any films. If that’s the case, then there’s a NULL film in the result set in SQL:

+--------+----------+---------+-------+---------------------+
|actor_id|first_name|last_name|film_id|title                |
+--------+----------+---------+-------+---------------------+
|       1|PENELOPE  |GUINESS  |      1|ACADEMY DINOSAUR     |
|       1|PENELOPE  |GUINESS  |     23|ANACONDA CONFESSIONS |
|       1|PENELOPE  |GUINESS  |     25|ANGELS LIFE          |
|       1|PENELOPE  |GUINESS  |    106|BULWORTH COMMANDMENTS|
|       1|PENELOPE  |GUINESS  |    140|CHEAPER CLYDE        |
|     ...|...       |...      |    ...|...                  |
|     201|UNKNOWN   |ACTOR    | {null}|{null}               |
+--------+----------+---------+-------+---------------------+

So, we won’t get an empty list, but a list containing an empty FilmRecord, just like when you GROUP BY actor and COUNT(*) the number of films:

var r =
ctx.select(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME,
        count(),
        count(FILM.FILM_ID))
   .from(ACTOR)
   .leftJoin(FILM_ACTOR).on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .leftJoin(FILM).on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .groupBy(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME)
   .orderBy(
       ACTOR.ACTOR_ID)
   .fetch();

The result of this query might look like this:

+--------+----------+------------+-----+-----+
|actor_id|first_name|last_name   |count|count|
+--------+----------+------------+-----+-----+
|       1|PENELOPE  |GUINESS     |   19|   19|
|       2|NICK      |WAHLBERG    |   25|   25|
|       3|ED        |CHASE       |   22|   22|
|       4|JENNIFER  |DAVIS       |   22|   22|
|       5|JOHNNY    |LOLLOBRIGIDA|   29|   29|
|     ...|...       |...         |  ...|  ...|
|     201|UNKNOWN   |ACTOR       |    1|    0|
+--------+----------+------------+-----+-----+

Observe how the desired count value of 0 can only be achieved when we pass the nullable FILM.FILM_ID column as an argument. So, what would be the equivalent fetchGroups() call that exposes this behaviour?

Deduplicating and nesting the collections with JDK Collectors

A very much underestimated JDK feature are Collectors. While they were introduced specifically for usage with the Stream API, they could be used with any type of Iterable, in principle, and I’m still hoping that a future JDK will offer Iterable.collect(), among other things.

With jOOQ, you can collect the results of any query by calling ResultQuery.collect(). To translate the above fetchGroups() example, we can write this, producing almost the same result:

Map<ActorRecord, List<FilmRecord>> result =
ctx.select(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME,
        FILM.FILM_ID,
        FILM.TITLE)
   .from(ACTOR)
   .leftJoin(FILM_ACTOR).on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .leftJoin(FILM).on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .orderBy(
       ACTOR.ACTOR_ID,
       FILM.FILM_ID)
   .collect(groupingBy(
       r -> r.into(ACTOR), filtering(
           r -> r.get(FILM.FILM_ID) != null, mapping(
               r -> r.into(FILM), toList()
           )
       )
   ));

The above collect() call nests these operations:

  • It groups by actor (just like fetchGroups())
  • It filters group contents by those films whose ID isn’t NULL (this can’t be achieved with fetchGroups()).
  • It maps the group contents to contain only FILM content, not the entire projection.

So, obviously more verbose, but also much more poweful and easy to introduce custom client side aggregation behaviour, in case you can’t move the aggregation logic to your SQL statement.

More on powerful collectors in this article here:

Nesting collections directly in SQL

No article on this blog would be complete without plugging the awesome MULTISET alternative to nest collections. After all, the above deduplication algorithm only really works if you’re joining down a single parent-child relationship path, and it’s quite inefficient when there are a lot of duplicate data sets.

Assuming these auxiliary data types:

record Film(String title) {}
record Actor(String firstName, String lastName) {}
record Category(String name) {}

You can write a query like this:

// We're importing the new Records::mapping method for convenience
import static org.jooq.Records.mapping;

Result<Record3<Film, List<Actor>, List<Category>>> result = ctx
    .select(
        FILM.TITLE.convertFrom(Film::new),
        multiset(
            select(
                FILM_ACTOR.actor().FIRST_NAME,
                FILM_ACTOR.actor().LAST_NAME
            )
            .from(FILM_ACTOR)
            .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
        ).convertFrom(r -> r.map(mapping(Actor::new))),
        multiset(
            select(FILM_CATEGORY.category().NAME)
            .from(FILM_CATEGORY)
            .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
        ).convertFrom(r -> r.map(mapping(Category::new)))
    )
    .from(FILM)
    .orderBy(FILM.TITLE)
    .fetch();

Notice that we’re now getting a result set containing

  • All the films
  • The actors per film as a nested collection
  • The categories per film as another nested collection

The example is using implicit joins to avoid some of the more verbose join syntaxes inside of the MULTISET expressions, but that’s not strictly relevant for this example.

The result being something like:

+----------------------------+--------------------------------------------------+----------------------------+
|title                       |multiset                                          |multiset                    |
+----------------------------+--------------------------------------------------+----------------------------+
|Film[title=ACADEMY DINOSAUR]|[Actor[firstName=PENELOPE, lastName=GUINESS], A...|[Category[name=Documentary]]|
|Film[title=ACE GOLDFINGER]  |[Actor[firstName=BOB, lastName=FAWCETT], Actor[...|[Category[name=Horror]]     |
|Film[title=ADAPTATION HOLES]|[Actor[firstName=NICK, lastName=WAHLBERG], Acto...|[Category[name=Documentary]]|
|Film[title=AFFAIR PREJUDICE]|[Actor[firstName=JODIE, lastName=DEGENERES], Ac...|[Category[name=Horror]]     |
|Film[title=AFRICAN EGG]     |[Actor[firstName=GARY, lastName=PHOENIX], Actor...|[Category[name=Family]]     |
+----------------------------+--------------------------------------------------+----------------------------+

Conclusion

There are many ways that lead to Rome. Classic SQL based approaches to nest collections used some sort of deduplication technique in the client. In jOOQ, you could always do this with fetchGroups(), and since recently also with collect() directly (collect() was always available via an intermediary stream() call).

For truly powerful nesting of collections, however, our recommendation is always to move your logic into SQL directly, using native array functionality, or using MULTISET

Why You Should Use jOOQ With Code Generation

I’m answering many jOOQ questions on Stack Overflow, and a lot of times. The problem has the same cause: People not using jOOQ’s code generator. The main reason people seem not to be using it, is because it takes some extra time to set up, but as with anything well designed, the initial investment will always pay off.

In this article, I want to briefly summarise what the code generator is, why you should use it, and what is the main reason not to use it (hint: you probably don’t have that case).

What’s the jOOQ code generator?

jOOQ is an internal DSL that pretends that you can write type safe, embedded, dynamic SQL directly in Java. Just like you can do that in PL/SQL, PL/pgSQL, T-SQL, SQL/PSM and all the other procedural dialects. Being internal, the big difference is that dynamic SQL is very easy, because jOOQ makes SQL “composable”.

As an internal DSL, it models the SQL language as a chain of Java method calls. The technique behind the fluent API design is described here. A typical jOOQ query looks like this:

var author =
ctx.select(AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME)
   .from(AUTHOR)
   .where(AUTHOR.ID.eq(10))
   .fetchOne();

Now those AUTHOR table references, as well as those AUTHOR.ID column references are generated code. They are Java representations of your database meta data, and include a variety of information:

  • The catalog / schema they belong to
  • The table they belong to (if it’s a column)
  • The data type
  • Some attached converters and data type bindings
  • Other useful meta data, such as constraints (primary key, unique key, foreign key), identities, default expressions, generated expressions, etc.
  • Auxiliary DSL methods, such as type safe table aliasing methods

As you can see, you get a lot of out of the box features!

What’s the price to pay for code generation?

As I understand, a lot of users (especially new ones) are eager to try out jOOQ immediately and don’t want to bother setting up the code generator, which seems to be an unnecessary step at first. E.g. when using JAXB, you don’t have to use XJC to generate annotated bindings for your XSD files (you might not even have an XSD). But with jOOQ, it’s different.

Yes, the manual can be overwhelming, but the tutorial isn’t, and neither are the examples. You don’t even have to use the Maven or Gradle plugins for your first attempts with jOOQ. Why not just use jbang. With just a few lines of CLI, you can start right away.

To try out basic things, you can also use the DDLDatabase, which generates code from SQL files directly, without a live connection to an actual database, though that isn’t really a big issue anymore with testcontainers.org and/or docker, which allows you to spin up a sample database in a matter of seconds.

What I’m trying to say here is: Don’t be overwhelmed. Yes, there’s some initial investment, but it will pay off massively (see below), will prevent tons of problems (see below), and it just requires a few more minutes of setting up your project. Heck, you could even start playing around with the bug reporting template, which includes code generation configuration for Java, Scala, and Kotlin!

What do you get in return?

Let’s start with the obvious:

1. Compile time type safety

You already get some degree of compile time type safety from using jOOQ’s DSL alone, namely, you cannot mistype keywords, such as SELECT, or forget your parentheses around an IN predicate, such as A IN (B, C). And you get auto-completion for your SQL syntax

But without code generation, you won’t get type safety for your schema objects. This is a big deal, and once you’ve gotten used to it, you will never want to miss it again!

Every table, every column, every constraint, every index, every sequence, every procedure, every function, every package, every schema, every catalog, every domain, every enum type, every object type gets a generated Java representation in jOOQ. You don’t have to go to your ERD, or database, or wherever it is to look up your schema. You have it right there in your code. As a first step, you can use this for auto completion in your favourite IDE!

Not just that. You can also never make a mistake with your meta data:

Yep. No more typos. But also! No more regressions when someone renames a column in the database. As soon as a column name changes, or is deleted, your Java code stops compiling.

No more wrong data types. See the above screenshot? It says that FIRST_NAME is a TableField<AuthorRecord, String>. So your Java compiler, through jOOQ’s elaborate generic API, will already know that you’re projecting a String column. It’s not 2003 anymore. We have generics to prevent type casts or unsafe conversions. Check out the compilation error on this:

Why ever worry about such data type problems again? You already declared the data type once, when you created the table:

CREATE TABLE author (
  id INTEGER NOT NULL PRIMARY KEY,
  first_name TEXT NOT NULL,
  last_name TEXT NOT NULL,
  ...
);

Why go through all that busywork of re-declaring the data type again in Java? In fact, why go through all that busywork of re-declaring anything of the above again in Java? No one enjoys typing all that stuff twice or keeping it in sync. In fact, it’s an outright bad idea. There should only be one source of truth for your meta model, and that’s the DDL you ran in your database to create the schema.

Sure, you can have other representations of your data, e.g. some DDD version, some JSON version for your frontend, whatever. But whenever you query your database, there’s absolutely no reason not to use the exact meta model which you’ve already declared in your database.

I think that’s kind of obvious? Why waive all that for a little (initial) extra setup convenience?

2. Schema introspection

You don’t just get that benefit when writing your queries. You also get it when reading them. You can always quickly navigate to the column you’re interested in and read its SQL comments in the Javadoc. Maybe you have this specified? (Totally underused SQL feature!)

COMMENT ON COLUMN author.first_name IS 'The author''s first name';

Now look what jOOQ’s code generator does with it:

Seems obvious, no? Or, you want to see who is using this column? Just check the call hierarchy or whatever IDE search tool you’re interested in:

This is much better than just text searching for FIRST_NAME, which might be case sensitive and matches all strings like that, not just that particular column of the AUTHOR table.

3. Runtime meta model: Data types

In fact, you won’t just profit from this type safety at compile time, but also at runtime. Trust me. There are so many edge cases in some JDBC drivers and/or SQL dialects where the data type must be communicated to the database explicitly.

Sure, you can write something like this in jOOQ, and it works:

var author =
ctx.select(field("author.first_name"), field("author.last_name"))
   .from("author")
   .where(field("author.id").eq(10))
   .fetchOne();

The above is using plain SQL templates, a feature that is normally used by users to extend jOOQ with custom functionality. Sure, it can be used instead of using code generation, too, and occasionally, that’s just the right thing to do.

When bind variables have insufficient type contexts

But there are many cases where you should give the SQL engine more context about the bind value. For example, in Derby (or older versions of Db2), you can’t just do this:

select null
from sysibm.sysdummy1;

You’ll get an error:

SQL Error [30000] [42X01]: Syntax error: Encountered “null” at line 1, column 8.

Instead, you have to CAST.

select cast(null as int)
from sysibm.sysdummy1;

jOOQ always adds those casts for you, if we find out through our extensive integraiton tests that in any given situation, a dialect might require it. But jOOQ can only do that if you provide it with a type, and often you do, implicitly, but sometimes, you don’t, and then you have to debug a query that used to work before. That can be very unexpected.

When a Field<Object> leads to compilation errors

This one is tricky. The Java language version 8 has made a peculiar decision around overload resolution with generics. Assuming you have overloads like this (and jOOQ is full of these, for convenience reasons):

public <T> void setValue(Parameter<T> parameter, T value) {}
public <T> void setValue(Parameter<T> parameter, Field<T> value) {}

Then there has been a backwards incompatible change of compilation behaviour in Java 8.

While this works perfectly fine:

Parameter<String> p1 = ...
Field<String> f1 = ...
setValue(p1, f1);

This call doesn’t compile:

Parameter<Object> p2 = ...
Field<Object> f2 = ...
setValue(p2, f2);

It seems that Field<Object> is not more specific than Object in this case, which seems weird, but that’s how the JLS designed generic overload resolution. In fact, it’s still more specific, but another rule declares the two types as ambiguous before we get to the specificity resolution. Read the above link for details.

Let’s look again at the previous jOOQ query that doesn’t use code generation:

Record2<Object, Object> author =
ctx.select(field("author.first_name"), field("author.last_name"))
   .from("author")
   .where(field("author.id").eq(10)) // Field<Object>
   .fetchOne();

It’s all full of Field<Object>, so you’ll run into this problem eventually, trust me, mostly with the UPDATE .. SET clause. And it will be quite the puzzler.

The workaround is to always attach type information to your columns:

// Assuming this static import
import static org.jooq.impl.SQLDataType.*;

Record2<String, String> author =
ctx.select(
        field("author.first_name", VARCHAR),
        field("author.last_name", VARCHAR))
   .from("author")
   .where(field("author.id", INTEGER).eq(10)) // Field<Integer>
   .fetchOne();

But why do all that, when the code generator could have already done it for you, automatically? And what if you change your AUTHOR.ID column to be a BIGINT instead of an INTEGER? Will you manually update every query?

4. Emulations

Some emulations are not possible without generated meta data. They mostly involve the emulation of the INSERT .. RETURNING syntax, which relies on knowing both primary key and identity information, but other syntaxes may be affected as well.

You wouldn’t believe how many times I’ve supported users who then wanted to use internal API to tell jOOQ about primary key or identity meta information in convoluted ways to make this work, rather than just generate code. With generated code, the following query works in a lot of RDBMS out of the box. Out. Of. The. Box:

AuthorRecord author =
ctx.insertInto(AUTHOR, AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME)
   .values("John", "Doe")
   .returning(AUTHOR.ID)
   .fetchOne();
  • Can this be run in a single query (e.g. in Firebird, Oracle, PostgreSQL as above, in SQL Server using OUTPUT, in Db2, H2 using data change delta tables)?
  • Can it be run in a single query using JDBC’s Statement.getGeneratedKeys()?
  • Does it require a second query to fetch the identity and/or other columns?

jOOQ can always find a way for you in this case, but only if it knows which column the identity is.

5. Converters

Basic SQL only really has a few built-in data types:

  • Various numeric types
  • Various string and binary string types
  • Various temporal types

But is it a good idea to work with these low level types, everywhere throughout your application? Do you want to use BigDecimal for your monetary amounts all over the place? Or would it be better to create a more useful Amount domain class? One where you can attach functionality, currency, etc.?

With jOOQ, you can use converters for that (or bindings, if that affects how you bind the values to JDBC). Have a fancy FirstName class that models various types of first names?

/**
 * Not just any ordinary first name string!
 */
record LeFirstName(String firstName) {}

Yes, you could insist on still not using code generation and attach that to your plain SQL templating query:

Record2<LeFirstName, String> author =
ctx.select(
        field("author.first_name", VARCHAR.asConvertedDataType(
            LeFirstName.class, LeFirstName::new, LeFirstName::firstName
        )), 
        field("author.last_name", VARCHAR))
   .from("author")
   .where(field("author.id", INTEGER).eq(10))
   .fetchOne();

Yes, you could extract that field definition in an auxiliary class for better reuse:

class LeAuthor {
    static Field<LeFirstName> firstName = field("author.first_name", 
        VARCHAR.asConvertedDataType(
            LeFirstName.class, LeFirstName::new, LeFirstName::firstName
        ));
    static Field<String> lastName = field("author.last_name", VARCHAR));
}

Record2<LeFirstName, String> author =
ctx.select(LeAuthor.firstName, LeAuthor.lastName)
   .from("author")
   .where(field("author.id", INTEGER).eq(10))
   .fetchOne();

And now, it looks almost like jOOQ’s code generator was used. Except it wasn’t. Why not? You can attach your converter to all of your FIRST_NAME columns of all your tables automatically with just a bit of extra configuration in the code generator, so you’ll never forget. Especially, if your converter implements some must-use logic, such as e.g. hashing or encrypting, etc. So, the above example would just be:

// After tweaking the codegen configuration, you now
// get your custom type whenever you reference FIRST_NAME
Record2<LeFirstName, String> author =
ctx.select(AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME)
   .from(AUTHOR)
   .where(AUTHOR.ID.eq(10))
   .fetchOne();

Why write manually, what the machine can write for you? You wouldn’t manually write other derived code, such as byte code, or assembly either?

6. Type safe aliasing

Have a complex query and want to alias your tables? No problem with jOOQ. Just declare the aliases up front and you’re all set:

// Declare the alias:
var a = AUTHOR.as("a");

// Use the alias. Columns are still there, type safe
var author =
ctx.select(a.FIRST_NAME, a.LAST_NAME)
   .from(a)
   .where(a.ID.eq(10))
   .fetchOne();

While aliasing also works without code generation (everything works without code generation, which is just convenience for more complex API calls), you won’t get type safe access to your column names from aliases like the above. And you can still call your type hierarchy in the IDE to check where the FIRST_NAME column is referenced:

Or get compilation errors on typos, or auto completion, etc. etc. What’s not to love?

7. Implicit joins

One of my favourite jOOQ features is implicit joins. Look at this jOOQ query:

ctx.select(
          BOOK.author().FIRST_NAME,
          BOOK.author().LAST_NAME,
          BOOK.TITLE,
          BOOK.language().CD.as("language"))
   .from(BOOK)
   .fetch();

Starting from a child table (e.g. BOOK), jOOQ can implicitly (left) join your parent tables AUTHOR and LANGUAGE, because life’s too short to repetitively type out all the same joins and join predicates again and again. The above is equivalent to:

ctx.select(
          AUTHOR.FIRST_NAME,
          AUTHOR.LAST_NAME,
          BOOK.TITLE,
          LANGUAGE.CD.as("language"))
   .from(BOOK)
   .leftJoin(AUTHOR).on(BOOK.AUTHOR_ID.eq(AUTHOR.ID))
   .leftJoin(LANGUAGE).on(BOOK.LANGUAGE_ID.eq(LANGUAGE.ID))
   .fetch();

It’s a matter of taste and style, yes. You don’t have to use implicit joins everywhere. Some joins cannot be expressed as implicit joins, but very often, this style greatly simplifies your queries by reducing noise.

And it’s only possible with generated code!

Why? Because those BOOK.author() and BOOK.language() methods are generated, and they return again instances of generated tables containing generated columns (with implicit table aliases), and data types, and possibly converters, and what not.

Are you still categorically excluding working with the code generator just because it takes a little extra time to set up?

8. CRUD

jOOQ isn’t a full fledged ORM like JPA implementations, but you can still get some convenience to avoid having to spell out CRUD queries all the time. jOOQ calls these UpdatableRecord.

When you fetch the UpdatableRecord like this:

BookRecord book =
ctx.selectFrom(BOOK)
   .where(BOOK.ID.eq(17))
   .fetchOne();

Then, you can change its values (e.g. in a rich client), and store the record again to generate an UPDATE statement:

book.setTitle("New title");
book.store();

There are a few opt in extra features, like optimistic locking, and you guessed it:

All available only with generated code.

9. Stored procedures

If you have those (and I highly recommend you do if only for performance reasons), then using code generation is even more of a no brainer.

If you have, in PostgreSQL:

CREATE FUNCTION p_count_authors_by_name (
  author_name VARCHAR, 
  result OUT INTEGER
)
AS $$
DECLARE
  v_result INT;
BEGIN
  SELECT COUNT(*)
  INTO v_result
  FROM author
  WHERE first_name LIKE author_name
  OR last_name LIKE author_name;

  result := v_result;
END;
$$ LANGUAGE plpgsql;

Now jOOQ’s code generator generates the following routine call for you:

public static Integer pCountAuthorsByName(
      Configuration configuration
    , String authorName
) {
    // Do you really care?
}

Yep, that’s it. Do you really care what happens behind the scenes? This binding logic is about as exciting as sorting your socks. It’s the same story every time you have some RPC technology. We didn’t care about the implementation in CORBA, with SOAP, with RMI, with remote EJBs, and we don’t care how to bind to a stored procedure.

It’s just some remote callable that accepts a String and returns an Integer. Code generation is your friend here, again. And yes, as always, if you think this function is missing a second parameter, you’ll just add it and your client code stops compiling offering you a chance to refactor!

And the best thing is: This supports all of your weird edge cases including:

  • SQL types (enums, arrays, object types, UDTs)
  • PL/SQL types (record types, table types, associative arrays)
  • Ref cursor types
  • Table valued functions

10. Multitenancy

A cool feature in jOOQ is that jOOQ supports schema level multi tenancy out of the box for you where you can dynamically rename your catalogs, schemas, and table names at runtime depending on whatever (e.g. user tenant).

This feature is called schema mapping, and it obviously… doesn’t work if you use plain SQL templates without code generation. (because a plain SQL template can contain any kind of SQL string, that’s what they’re for).

Not just that. With generated code, all of your objects are fully qualified by default, and you can omit qualification when you like. With your own templates, you can’t easily change that. Want to port your schema from A to B? Well, you’ll hand-rewrite everything, good luck. With jOOQ’s generated code, it’s just flags, configurations, etc.

11. Embedded types

Remember converters? Even more sophisticated are embedded types, i.e. jOOQ’s way of wrapping multiple database columns into a single client side value, pretending your database supports UDTs (such as Oracle, PostgreSQL).

Because when you work with currencies, what you really want to do is combine an AMOUNT column with a CURRENCY column, because after all, USD 1.00 isn’t really the same thing as EUR 1.00, nor can they be directly compared or combined in arithmetics.

Embedded types are currently only available using the code generator, which produces all the meta data required for jOOQ’s runtime to map / unmap your flat result sets.

This is especially powerful when you use:

  • Embedded keys (the only thing you ever compare an BOOK.AUTHOR_ID with is AUTHOR.ID, or maybe another foreign key that points to the same primary key, so why not use a type to enforce type safety?)
  • Embedded domains (you already declared a semantic type in your database, so you want to reuse that name also in client code)

12. Data change management

Another thing that people often don’t want to setup correctly and cleanly up front because it does take some extra time is data change management (e.g. using Flyway or Liquibase). And the regrets are similar once the project grows.

Using jOOQ’s code generator kind forces you to also think about data change management early on, and that is a good thing!

Your data model changes over time, and so does your client code working on that model. jOOQ’s code generation process should be embedded in your CI/CD pipeline:

  • You automatically apply a change in your model
  • You automatically re-generate your code
  • You automatically run your integration tests
  • You automatically deploy the tested result as a unit

Here’s how to do all of that with testcontainers and jOOQ, see an example in action here.

Let me stress this once more.

If you don’t cleanly set up your CI/CD pipeline including your database early on in your project, you’ll do it later on, and it’ll be much harder. But if you do it, then adding jOOQ’s code generation to the game is a no-brainer, and you get everything above for free!

Conclusion

There are many other minor things that you get when you use jOOQ’s code generator. It’s such a powerful tool to get the most out of both jOOQ and your database. It actively encourages you to use all of your RDBMS features, including:

  • Complex models (because joining is much simpler once you have all the meta data, and even tools like implicit joins)
  • Views (those are generated too)
  • Data types (generated too)
  • Procedures (same)
  • Etc.

Using code generation in jOOQ is a clear path towards becoming a power user of your own database, which you (hopefully) design with a lot of love and craftsmanship.

Exception

I promised an exception. You can’t use the code generator when your schema is dynamic (i.e. unknown at compile time). But only few systems are designed like that.

Fun with PostGIS: Mandelbrot Set, Game of Life, and More

The upcoming jOOQ 3.16 will finally offer support for the various RDBMS GIS extensions via issue #982. This is great news per se, and will be covered in a future blog post, when the integration is ready. This post here is about something else.

Adding support for such a feature is a great source of procrastination. You see, when you develop a backend library, all you ever do is implement algorithms, API, and other abstract stuff. But GIS is different. With GIS, a SQL developer like me suddenly has access to a “UI”, and having access to a “UI” woke the inner programmer child in me.

I was pleasantly surprised that DBeaver, my preferred SQL editor, has out of the box support for GIS’s WKT format. E.g. run a query like this:

SELECT st_geomfromtext('polygon((0 0, 2 3, 0 6, -2 3, 0 0))');

The output being

Let’s play around with GIS

So, naturally, what other thing to do than play around with it? The most obvious next thing to generate would be your favourite logo, which happens to be very easy to map to polygons. Let’s look at a few GIS features step by step. And for this blog post, I’ll be using PostGIS, which you can get very easily from docker hub:

WITH
  -- These "sprites" are just reusable 2D polygons, which I declared
  -- to avoid having to deal with the lengthy WKT format
  sprites AS (
    SELECT 

      -- We project the original 1x1 square, and a scaled 4x4 version
      s AS square,
      st_scale(s, 4, 4) AS square4

    -- Here, we're creating a square called "s"
    FROM (VALUES 
      (st_polygonfromtext('polygon ((0 0, 1 0, 1 1, 0 1, 0 0))'))
    ) t (s)
  )

-- st_union combines 2 polygons into a multipolygon
SELECT st_union(square, st_translate(square4, 2, 0))
FROM sprites

The output of the above query is

MULTIPOLYGON (((0 0, 1 0, 1 1, 0 1, 0 0)), ((2 0, 6 0, 6 4, 2 4, 2 0)))

Or, using DBeaver’s visualisation utility:

The st_union isn’t much different from any other set union. Note that I’ve translated the larger square 2 units to the right, so they don’t overlap. Otherwise, the union would have just been the bigger square.

A polygon describes a set of points in the 2D number plane. Creating a union of those two sets of points is natural. We can also create a difference of the two squares, instead:

WITH
  sprites AS (
    SELECT 
      s AS square,
      st_scale(s, 4, 4) AS square4
    FROM (VALUES 
      (st_polygonfromtext('polygon ((0 0, 1 0, 1 1, 0 1, 0 0))'))
    ) t (s)
  )
SELECT st_difference(square4, square)
FROM sprites

Resulting in:

POLYGON ((0 4, 4 4, 4 0, 1 0, 1 1, 0 1, 0 4))

Or, visually:

With these simple tools, we can now create all 4 letters of the jOOQ logo. As a reminder, the tools were:

  • st_polygonfromtext: Create a polygon from a WKT text representation
  • st_scale: Scale any geometry to a new size
  • st_translate: Translate any geometry to a new position
  • st_union: Combine two geometries (or more, if used as an aggregate function)
  • st_difference: Remove one geometry from another

The GIS API is vast, both in PostGIS as well as in the ISO/IEC 13249-3:2016 standard version, but these simple tools shall suffice for now. Let’s look at this query:

WITH
  -- Creating the two squares to work with
  sprites AS (
    SELECT 
      s AS square,
      st_scale(s, 4, 4) AS square4
    FROM (VALUES 
      (st_polygonfromtext('polygon ((0 0, 1 0, 1 1, 0 1, 0 0))'))
    ) t (s)
  ),
  
  -- All the 4 letters j, o1, o2, q of the jOOQ logo
  letters AS (
    SELECT
    
      -- The letter j
      st_difference(
        st_difference(
          st_difference(
            square4, 
            st_translate(st_scale(square, 2, 3), 1, 1)
          ),
          st_translate(st_scale(square, 1, 2), 0, 2)
        ),
        st_translate(st_scale(square, 1, 0.5), 3, 2.5)
      ) AS j,
      
      -- The letter o
      st_difference(square4, st_translate(square, 1, 1)) AS o1,
      
      -- The letter o
      st_difference(square4, st_translate(square, 2, 2)) AS o2,
      
      -- The letter q
      st_union(
        st_difference(
          square4, 
          st_translate(st_scale(square, 2, 2), 1, 1)
        ),
        st_translate(st_scale(square, 1, 2.5), 1.5, -1)
      ) as q
    FROM sprites
  )
SELECT

  -- Combine all the 4 letters
  st_union(v)
FROM
  letters,
  
  -- Arrange the 4 letters next to each other
  LATERAL (VALUES
    (st_translate(j, 0, 5)),
    (st_translate(o1, 5, 5)),
    (o2),
    (st_translate(q, 5, 0))
  ) t (v);

This produces:

Neat, huh?

Next step: The Mandelbrot Set

A natural next step for any procrastinator is to generate the Mandelbrot set. To prepare ourselves for what’s behind the Mandelbrot, have a look at this neat video (more procrastination):

There are different ways to calculate it with SQL, here’s one that I came up with:

WITH RECURSIVE

  -- These are the dimensions that you can play around with
  dims (r1, r2, i1, i2, s, it, p) AS (
    VALUES (
      
      -- The dimensions of the real axis
      -2::float, 1::float, 
      
      -- The dimensions of the imaginary axis
      -1.5::float, 1.5::float, 
      
      -- The step size on each axis, per pixel or sprite
      0.01::float, 
      
      -- The maximum number of iterations
      100, 
      
      -- "Infinity", i.e. when to stop
      256.0::float
    )
  ),
  
  -- The square again, as before
  sprites (s) AS (VALUES 
    (st_polygonfromtext('polygon ((0 0, 0 1, 1 1, 1 0, 0 0))'))
  ),
  
  -- The number plane, as ints
  n1 (r, i) AS (
    SELECT r, i 
    FROM 
      dims, 
      generate_series((r1 / s)::int, (r2 / s)::int) r,
      generate_series((i1 / s)::int, (i2 / s)::int) i
  ),
  
  -- The number plane as scaled floats
  n2 (r, i) AS (
    SELECT r::float * s::float, i::float * s::float
    FROM dims, n1
  ),
  
  -- The recursive calculation of the Mandelbrot formula
  -- zn = (zn-1)^2 + c
  l (cr, ci, zr, zi, g, it, p) AS (
    SELECT r, i, 0::float, 0::float, 0, it, p FROM n2, dims
    UNION ALL
    SELECT cr, ci, zr*zr - zi*zi + cr, 2*zr*zi + ci, g + 1, it, p 
    FROM l
    
    -- The recursions stops when we reach the maximum
    WHERE g < it
    
    -- Or, when we reach "infinity"
    AND zr*zr + zi*zi < p
  ),
  
  -- Find the last calculated value per point in the
  -- complex number plane c (cr, ci), discard the others
  x (cr, ci, zr, zi, g) AS (
    SELECT DISTINCT ON (cr, ci) cr, ci, zr, zi, g
    FROM l
    ORDER BY cr, ci, g DESC
  )
  
-- Turn every calculated point into a square
SELECT 
  st_union(
    st_translate(sprites.s, round(cr / dims.s), round(ci / dims.s))
  )
FROM x, sprites, dims

-- Retain only the points *not* belonging to the Mandelbrot set
-- You can also inverse the equation to retain the points that
-- belong to the set
WHERE zr*zr + zi*zi > p;

Note, I’m using an artificial “infinity” here, because:

  1. That speeds things up with little noticeable difference at this zoom scale
  2. I couldn’t figure out how to make PostgreSQL overflow float operations to float infinities, like Java or CockroachDB or other IEEE 754 float implementations do. Any help appreciated, see this Stack Overflow question.

The output is the well known shape

You can play around with this and use different values for “dims” to zoom in, e.g.

dims (r1, r2, i1, i2, s, it, p) as (values (
  (-0.925-0.032)::float, (-0.925+0.032)::float, 
  (0.266-0.032)::float, (0.266+0.032)::float, 
  0.00005::float, 100, 256.0::float)
)

This will generate some really neat zooms, all with SQL:

Granted, it’s not the most efficient way to calculate these things, but that wasn’t the point here, was it?

Game of Life

I can be nerd-sniped:

And of course, I had to accept that challenge too! Now, the Game of Life is a simple “game” where we have the x,y natural number plane (e.g. “pixels”), and with each coordinate, we associate whether that “cell” is dead or alive. Then, we establish the following set of rules:

  1. Any live cell with two or three live neighbours survives.
  2. Any dead cell with three live neighbours becomes a live cell.
  3. All other live cells die in the next generation. Similarly, all other dead cells stay dead.

No, it’s hard to “animate” things in SQL using spatial extensions, so as a workaround, I’ll just display iterations of 100×100 pixel tiles of the Game of Life next to each other. The first iteration is just the starting point, e.g. a random number of “alive” cells:

WITH RECURSIVE

  -- Again generate a "sprite"a from a square polygon
  sprites (s) AS (VALUES (
    st_polygonfromtext('polygon ((0 0, 0 1, 1 1, 1 0, 0 0))')
  )),

  -- Generate the x, y number plane and associate a boolean value
  -- with each coordinate, generated randomly.
  -- 10% of all cells are alive
  m (x, y, b) AS (
    SELECT x, y, 
      CASE WHEN random() > 0.9 THEN 1 ELSE 0 END 
    FROM 
      generate_series(1, 100) x, 
      generate_series(1, 100) y
  )
SELECT st_union(st_translate(sprites.s, x::float, y::float))
FROM m, sprites

-- Display only "alive" cells
WHERE b = 1

This will produce something like:

Now, all we have to do is iterate the game formula. Now, recursive SQL has quite a few limitations. E.g. I couldn’t get PostgreSQL to aggregate recursive data, nor self-join the recursive table to find the nearest neighbors of any cell. But with window functions, this is definitely possible. So here goes:

WITH RECURSIVE
  sprites (s) AS (VALUES (
    st_polygonfromtext('polygon ((0 0, 0 1, 1 1, 1 0, 0 0))')
  )),
  m (x, y, b) AS (
    SELECT x, y, 
      CASE WHEN random() > 0.9 THEN 1 ELSE 0 END
    FROM 
      generate_series(1, 100) x, 
      generate_series(1, 100) y
  ),
  
  -- This is the recursion of the Game of Life
  e (x, y, b, g) AS (
  
    -- The recursion starts with the above initial data
    SELECT x, y, b, 1 FROM m
    UNION ALL
    SELECT
      e.x, e.y,
      CASE
      
        -- If this cell is alive, and 2 or 3
        -- neighbors are alive, too, then this
        -- cell stays alive
        WHEN e.b = 1 AND
          sum(e.b) OVER w1
        + sum(e.b) OVER w2
        + sum(e.b) OVER w3 IN (2, 3)
          THEN 1
          
        -- If this cell is dead, and 3 neighbors
        -- are alive, then this cell becomes alive
        WHEN e.b = 0 AND
          sum(e.b) OVER w1
        + sum(e.b) OVER w2
        + sum(e.b) OVER w3 = 3
          THEN 1
          
        -- Otherwise, this cell dies or stays dead
        ELSE 0
      END,
      e.g + 1
    FROM e
    
    -- The recursion stops after 100 iterations
    WHERE e.g < 100
    WINDOW
    
      -- We order the data set by x, y. In SQL, there isn't 
      -- a notion of a 2-dimensional number plane. We only
      -- have 1 dimension.
      o AS (ORDER BY x, y),
      
      -- w1 is all the neighbors of the previous row in the x, y
      -- plane, which is all the SQL rows that are 101-99 rows
      -- before the current row.
      w1 AS (o ROWS BETWEEN 101 PRECEDING AND  99 PRECEDING),
      
      -- w2 is all the neighbors of the current row in the x, y
      -- number plane, excluding the current cell
      w2 AS (o ROWS BETWEEN   1 PRECEDING AND   1 FOLLOWING 
               EXCLUDE CURRENT ROW),
               
      -- w3 is all the neighbors of the next row
      w3 AS (o ROWS BETWEEN  99 FOLLOWING AND 101 FOLLOWING)
  )

-- Finally, combine all the iterations  
SELECT st_union(st_translate(
  sprites.s, 
  (x + (g - 1) % 10 * 120)::float, 
  (y - (g - 1) / 10 * 120)::float
))
FROM e, sprites
WHERE b = 1
;

Think of the window specifications as follows:

----+-------------+-------------+-------------+-------------+----
... | 105: (1, 5) | 106: (1, 6) | 107: (1, 7) | 108: (1, 8) | ...
... | 205: (2, 5) | 206: (2, 6) | 207: (2, 7) | 208: (2, 8) | ...
... | 305: (3, 5) | 306: (3, 6) | 307: (3, 7) | 308: (3, 8) | ...
... | 405: (4, 5) | 406: (4, 6) | 407: (4, 7) | 408: (4, 8) | ...

So, in a 100×100 grid, x=3,y=7 can be encoded as 307, and its neighbors are

  • w1: 206, 207, 208, i.e. between 101 preceding and 99 preceding of 307
  • w2: 306, 308, i.e. between 1 preceding and 1 following of 307
  • w3: 406, 407, 408, i.e. between 99 following and 101 following of 307

The output looks like this:

Or, zooming in on iterations 1-4:

Or 21-24:

That’s really cool, huh!

Glider Gun

The nerd-snipe was to animate the glider gun pattern, which just means we’ll have to replace the random generation of the first iteration by something constant.

WITH RECURSIVE
  sprites (s) AS (VALUES (
    st_polygonfromtext('polygon ((0 0, 0 1, 1 1, 1 0, 0 0))')
  )),
  m (x, y, b) AS (
    SELECT x, y, 

      -- Initial data here
      CASE WHEN (x, y) IN (
        (2, 6), (2, 7), (3, 6), (3, 7), (12, 6), (12, 7), (12, 8), 
        (13, 5), (13, 9), (14, 4), (14, 10), (15, 4), (15, 10),
        (16, 7), (17, 5), (17, 9), (18, 6), (18, 7), (18, 8), 
        (19, 7), (22, 4), (22, 5), (22, 6), (23, 4), (23, 5), 
        (23, 6), (24, 3), (24, 7), (26, 2), (26, 3), (26, 7), 
        (26, 8), (36, 4), (36, 5), (37, 4), (37, 5)
      ) THEN 1 ELSE 0 END 
    FROM 
      generate_series(1, 100) x, 
      generate_series(1, 100) y
  ),
  e (x, y, b, g) AS (
    SELECT x, y, b, 1 FROM m
    UNION ALL
    SELECT
      e.x, e.y,
      CASE
        WHEN e.b = 1 AND
          sum(e.b) OVER w1
        + sum(e.b) OVER w2
        + sum(e.b) OVER w3 IN (2, 3)
          THEN 1
        WHEN e.b = 0 AND
          sum(e.b) OVER w1
        + sum(e.b) OVER w2
        + sum(e.b) OVER w3 = 3
          THEN 1
        ELSE 0
      END,
      e.g + 1
    FROM e
    WHERE e.g < 100
    WINDOW
      o AS (ORDER BY x, y),
      w1 AS (o ROWS BETWEEN 101 PRECEDING AND  99 PRECEDING),
      w2 AS (o ROWS BETWEEN   1 PRECEDING AND   1 FOLLOWING 
               EXCLUDE CURRENT ROW),
      w3 AS (o ROWS BETWEEN  99 FOLLOWING AND 101 FOLLOWING)
  )
SELECT st_union(st_translate(
  sprites.s, 
  (x + (g - 1) % 10 * 120)::float, 
  (y - (g - 1) / 10 * 120)::float
))
FROM e, sprites
WHERE b = 1
;

And, as you can see, the gun works:

It started with this:

And eventually produces the well known gliders:

PostgreSQL 14’s enable_memoize For Improved Performance of Nested Loop Joins

I’ve recently discovered a pleasant new addition to PostgreSQL 14, the new enable_memoize flag that improves the performance of some nested loop joins where statistics hint at this being appropriate. I mean, who can resist this temptation:

Improving query speed by 1000x hints at something very suboptimal having been going on before, and a tool like memoization can be of great help. But will it also help with an “ordinary” join? I wanted to try myself.

What’s memoization?

In a perfect world free of side effects (and SQL is such a perfect world, in theory), memoization means that we can substitute y for f(x) in any computation, given that y = f(x). For example, no matter how many times you calculate UPPER('x'), you’ll always get 'X'. If the calculation of such a function is costly, and there are only few possible input values, then why not just maintain a hash map that maps all previous input values and use that to look up known (or at least frequent) values instead of computing them again?

As I’ve shown previously on this blog, Oracle 11g has introduced a feature called scalar subquery caching, a feature which you can activate in jOOQ to avoid costly PL/SQL context switches.

In the case of PostgreSQL’s enable_memoize, this can be particularly useful for nested loop joins in SQL, and to reference the above tweet, a lateral join is often executed via a nested loop join.

Turning the feature on and off

I’ve created a schema like this:

CREATE TABLE t AS
SELECT i, i % 5 AS j 
FROM generate_series(1, 100000) AS t(i);

CREATE TABLE u AS
SELECT i, i % 20000 as j 
FROM generate_series(1, 100000) AS t(i);

CREATE INDEX uj ON u(j);

In summary:

  • Both tables t and u have 100000 rows.
  • t.j has only 5 distinct values, each value appears 20000 times.
  • u.j has 20000 distinct values, each value appears 5 times.

When running this on PostgreSQL 14:

SELECTcurrent_setting('enable_memoize');

I get:

|current_setting|
|---------------|
|on             |

So, the feature is active, which I can also see in an EXPLAIN of the following query:

EXPLAIN
SELECT *
FROM t JOIN u ON t.j = u.j;

The plan is:

|QUERY PLAN                                                            |
|----------------------------------------------------------------------|
|Nested Loop  (cost=0.30..8945.41 rows=496032 width=16)                |
|  ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=8)         |
|  ->  Memoize  (cost=0.30..0.41 rows=5 width=8)                       | 
|        Cache Key: t.j                                                |
|        ->  Index Scan using uj on u  (cost=0.29..0.40 rows=5 width=8)|
|              Index Cond: (j = t.j)                                   |

Without memoization, when joining the two tables like that, then, for the 100000 rows in t, I have to look up 100000x the 5 matching rows in u. But if memoization kicks in, then I will have to perform the lookup only 5 times, because there are only 5 distinct values of t.j

We can play around with execution plans by turning the feature on or off:

SET enable_memoize = ON;
SET enable_memoize = OFF;

When turned off, PostgreSQL seems to choose a hash join or merge join instead, on my machine (between multiple executions, the plan might switch)

|QUERY PLAN                                                         |
|-------------------------------------------------------------------|
|Hash Join  (cost=3084.00..11568.51 rows=499351 width=16)           |
|  Hash Cond: (t.j = u.j)                                           |
|  ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=8)      |
|  ->  Hash  (cost=1443.00..1443.00 rows=100000 width=8)            |
|        ->  Seq Scan on u  (cost=0.00..1443.00 rows=100000 width=8)|


|QUERY PLAN                                                              |
|------------------------------------------------------------------------|
|Merge Join  (cost=9748.11..763846.11 rows=50000000 width=16)            |
|  Merge Cond: (u.j = t.j)                                               |
|  ->  Index Scan using uj on u  (cost=0.29..3848.29 rows=100000 width=8)|
|  ->  Sort  (cost=9747.82..9997.82 rows=100000 width=8)                 |
|        Sort Key: t.j                                                   |
|        ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=8)     |

Let’s Benchmark

We’re using the usual benchmark technique described here:

  • We repeat an operation 25x in mode A and mode B and compare (or more than 25, if it’s a fast operation)
  • We repeat the above 5x to mitigate any warmup and other caching effects

You can run the following benchmark on the above schema yourself, to verify:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 25;
  rec RECORD;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();
    SET enable_memoize = OFF;
  
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT t.*
        FROM t JOIN u ON t.j = u.j
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', r, (clock_timestamp() - v_ts);
    v_ts := clock_timestamp();
    SET enable_memoize = ON;

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT t.*
        FROM t JOIN u ON t.j = u.j
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', r, (clock_timestamp() - v_ts);
    RAISE INFO '';
  END LOOP;
END$$;

On my machine, the results are consistent, decent, not too impressive, but still significant:

Run 1, Statement 1: 00:00:03.763426
Run 1, Statement 2: 00:00:03.401346

Run 2, Statement 1: 00:00:03.769419
Run 2, Statement 2: 00:00:03.375677

Run 3, Statement 1: 00:00:03.771465
Run 3, Statement 2: 00:00:03.374413

Run 4, Statement 1: 00:00:03.769136
Run 4, Statement 2: 00:00:03.398734

Run 5, Statement 1: 00:00:03.772544
Run 5, Statement 2: 00:00:03.375272

I.e. a 10% speedup. Across the whole system, that alone would already be worth it.

Optimising LATERAL

Let’s try optimising LATERAL instead. We could run a query like this:

SELECT *
FROM 
  t, 
  LATERAL (
    SELECT count(*) 
    FROM u 
    WHERE t.j = u.j
  ) AS u(j)

The EXPLAIN of the above is

|QUERY PLAN                                                                       |
|---------------------------------------------------------------------------------|
|Nested Loop  (cost=4.40..3969.47 rows=100000 width=16)                           |
|  ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=8)                    |
|  ->  Memoize  (cost=4.40..4.42 rows=1 width=8)                                  |
|        Cache Key: t.j                                                           |
|        ->  Aggregate  (cost=4.39..4.40 rows=1 width=8)                          |
|              ->  Index Only Scan using uj on u  (cost=0.29..4.38 rows=5 width=0)|
|                    Index Cond: (j = t.j)                                        |

So, we can again cache the computation of the COUNT(*) value for each of the 5 distinct t.j input values, rather than re-calculating this every time. Surely, this must be even better than before?

Benchmark time!

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 25;
  rec RECORD;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();
    SET enable_memoize = OFF;
  
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT *
        FROM 
          t, 
          LATERAL (
            SELECT count(*) 
            FROM u 
            WHERE t.j = u.j
          ) AS u(j)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', r, (clock_timestamp() - v_ts);
    v_ts := clock_timestamp();
    SET enable_memoize = ON;

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT *
        FROM 
          t, 
          LATERAL (
            SELECT count(*) 
            FROM u 
            WHERE t.j = u.j
          ) AS u(j)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', r, (clock_timestamp() - v_ts);
    RAISE INFO '';
  END LOOP;
END$$;

And this time, we can see a significant speedup!

Run 1, Statement 1: 00:00:03.419728
Run 1, Statement 2: 00:00:01.083941

Run 2, Statement 1: 00:00:03.404954
Run 2, Statement 2: 00:00:01.098404

Run 3, Statement 1: 00:00:03.425725
Run 3, Statement 2: 00:00:01.093883

Run 4, Statement 1: 00:00:03.441691
Run 4, Statement 2: 00:00:01.127837

Run 5, Statement 1: 00:00:03.420172
Run 5, Statement 2: 00:00:01.097943

That’s great news! Wait, does this work also for ordinary correlated subqueries? Because the above LATERAL correlated subquery could be rewritten as:

SELECT 
  t.*, 
  (
    SELECT count(*)
    FROM u
    WHERE t.j = u.j
  ) j
FROM t;

Regrettably, the plan doesn’t show memoization:

|QUERY PLAN                                                                   |
|-----------------------------------------------------------------------------|
|Seq Scan on t  (cost=0.00..441693.00 rows=100000 width=16)                   |
|  SubPlan 1                                                                  |
|    ->  Aggregate  (cost=4.39..4.40 rows=1 width=8)                          |
|          ->  Index Only Scan using uj on u  (cost=0.29..4.38 rows=5 width=0)|
|                Index Cond: (j = t.j)                                        |

And the benchmark (you can paste the query into the benchmark logic yourself to reproduce) confirms there’s no memoization effect

Run 1, Statement 1: 00:00:03.617562
Run 1, Statement 2: 00:00:03.605765

Run 2, Statement 1: 00:00:03.610084
Run 2, Statement 2: 00:00:03.682064

Run 3, Statement 1: 00:00:03.725952
Run 3, Statement 2: 00:00:03.705622

Run 4, Statement 1: 00:00:03.672669
Run 4, Statement 2: 00:00:03.644612

Run 5, Statement 1: 00:00:03.645741
Run 5, Statement 2: 00:00:03.642717

It seems that with this new feature, correlated subqueries could be rewritten to nested loop outer joins in the future? Other optimisers already do this, and we would have effectively the same feature here as Oracle’s scalar subquery caching.

Conclusion

The feature is turned on in PostgreSQL 14. Apart from some additional memory consumption, which might be a small problem if the optimiser is wrong and statistics are off, I don’t see any drawback of this new feature. SQL is a side-effect free (in theory) 4GL, meaning the optimiser can replace any computation by a cached value that depends only on the computation’s input values.

A correlated subquery is a function, whose input parameters are the predicates and other references to the outer query’s columns. As such, the result of a correlated subquery can be cached, or memoized. As shown above, this has drastic effects on your present day SQL queries, from whcih you can profit just by upgrading to PostgreSQL 14.

Java’s Checked Exceptions Are Just Weird Union Types

This fun fact has been on my mind for a while, and a recent reddit thread about “Smuggling Checked Exceptions with Sealed Interfaces” made me write this post here. Namely, Java had union types before it was cool! (If you squint hard).

What are union types?

Ceylon is an underrated JVM language that never really took off, which is too bad, because the concepts it introduced are very elegant (see e.g. how they implemented nullable types as syntax sugar on top of union types, which IMO is much better than anything monadic using Option types or kotlin’s ad-hoc type system extension).

So, one of those concepts are union types. One of the most popular language that supports them, currently, is TypeScript, though C++, PHP, and Python also have something similar. (The fact whether the union type is tagged or not isn’t relevant to this post).

If you understand Java’s intersection types A & B (meaning that something is both a subtype of A and of B), then it’s easy to understand union types A | B (meaning that something is a subtype of any of A or B). TypeScript shows a simple example of this

function printId(id: number | string) {
  console.log("Your ID is: " + id);
}
// OK
printId(101);
// OK
printId("202");
// Error
printId({ myID: 22342 });

Structural vs nominal typing

Such a union type (or intersection type) is a structural type, as opposed to what we’ve been doing in Java via nominal types, where you have to declare a named type for this union every time you want to use it. E.g. in jOOQ, we have things like:

interface FieldOrRow {}
interface Field<T> extends FieldOrRow {}
interface Row extends FieldOrRow {}

Very soon, the Java 17 distribution will seal the above type hierarchy as follows (incomplete, subtypes of Field<T> and Row are omitted for brevity):

sealed interface FieldOrRow permits Field<T>, Row {}
sealed interface Field<T> extends FieldOrRow permits ... {}
sealed interface Row extends FieldOrRow permits ... {}

There are pros and cons for doing these things structurally or nominally:

Structural typing:

  • Pro: You can create any ad-hoc union of any set of types anywhere
  • Pro: You don’t have to change the existing type hierarchy, which is essential when you don’t have access to it, e.g. when you want to do something like the above number | string type. This kinda works like JSR 308 type annotations that were introduced in Java 8.

Nominal typing:

  • Pro: You can attach documentation to the type, and reuse it formally (rather than structurally). TypeScript and many other languages offer type aliases for this kind of stuff, so you can have a bit of both worlds, though the alias is erased, meaning you keep the complex structural type leading to curious error messages.
  • Pro: You can seal the type hierarchy to allow for exhaustiveness checking among subtypes (e.g. above, there can only be Field<T> or Row subtypes of FieldOrRow. A structurally typed union type is implicitly “sealed” ad-hoc by the union type description (not sure if that’s how it’s called), but with nominal types, you can make sure no one else can extend the type hierarchy, (except where you permit it explicitly using the non-sealed keyword)

Ultimately, as ever so often, things like structural and nominal typing are two sides of the same coin, pros and cons mostly depending on taste and on how much you control a code base.

So, how are checked exceptions union types?

When you declare a method that throws checked exceptions, the return type of the method is really such a union type. Look at this example in Java:

public String getTitle(int id) throws SQLException;

The call-site now has to “check” the result of this method call using try-catch, or declare re-throwing the checked exception(s):

try {
    String title = getTitle(1);
    doSomethingWith(title);
}
catch (SQLException e) {
    handle(e);
}

If early Java had union types rather than checked exceptions, we might have declared this as follows, instead:

public String|SQLException getTitle(int id);

Likewise, a caller of this method will have to “check” the result of this method call. There’s no simple way of re-throwing it, so if we do want to re-throw, we’d need some syntax sugar, or repeat the same code all the time, Go-style:

// Hypothetical Java syntax:
String|SQLException result = getTitle(1);

switch (result) {
    case String title -> doSomethingWith(title);
    case SQLException e -> handle(e);
}

It would be obvious how such a JEP 406 style switch pattern matching statement or expression could implement an exhaustiveness check, just like with the existing JEP 409 sealed classes approach, the only difference, again, being that everything is now structurally typed, rather than nominally typed.

In fact, if you declare multiple checked exceptions, such as the JDK’s reflection API:

public Object invoke(Object obj, Object... args)
throws 
    IllegalAccessException, 
    IllegalArgumentException,
    InvocationTargetException

With union types, this would just be this, instead:

// Hypothetical Java syntax:
public Object
    | IllegalAccessException
    | IllegalArgumentException
    | InvocationTargetException invoke(Object obj, Object... args)

And the union type syntax from the catch block, which checks for exhaustiveness (yes, we have union types in catch!)…

try {
    Object returnValue = method.invoke(obj);
    doSomethingWith(returnValue);
}
catch (IllegalAccessException | IllegalArgumentException e) {
    handle1(e);
}
catch (InvocationTargetException e) {
    handle2(e);
}

Could still check for exhaustiveness with the switch pattern matching approach:

// Hypothetical Java syntax:
Object
    | IllegalAccessException
    | IllegalArgumentException
    | InvocationTargetException result = method.invoke(obj);

switch (result) {
    case IllegalAccessException, 
         IllegalArgumentException e -> handle1(e);
    case InvocationTargetException e -> handle2(e);
    case Object returnValue = doSomethingWith(returnValue);
}

A subtle caveat here is that exceptions are subtypes of Object, so we must put that case at the end, as it “dominates” the others (see JEP 406 for a discussion about dominance). Again, we can prove exhaustiveness, because all types that are involved in the union type have a switch case.

Can we emulate union types with checked exceptions?

You know what Jeff Goldblum would say

But this blog is known to do it anyway. Assuming that for every possible type, we had a synthetic (code generated?) checked exception that wraps it (because in Java, exceptions are not allowed to be generic):

// Use some "protective" base class, so no one can introduce 
// RuntimeExceptions to the type hierarchy
class E extends Exception {

    // Just in case you're doing this in performance sensitive code...
    @Override
    public Throwable fillInStackTrace() {
        return this;
    }
}

// Now create a wrapper exception for every type you want to represent
class EString extends E {
    String s;
    EString(String s) {
        this.s = s;
    }
}
class Eint extends E {
    int i;
    Eint(int i) {
        this.i = i;
    }
}

The benefit of this is we don’t have to wait for Valhalla to support primitive types in generics, nor to reify them. We’ve already emulated that as you can see above.

Next, we need a switch emulation for arbitrary degrees (22 will probably be enough?). Here’s one for degree 2:

// Create an arbitrary number of switch utilities for each arity up 
// to, say 22 as is best practice
class Switch2<E1 extends E, E2 extends E> {
    E1 e1;
    E2 e2;

    private Switch2(E1 e1, E2 e2) {
        this.e1 = e1;
        this.e2 = e2;
    }

    static <E1 extends E, E2 extends E> Switch2<E1, E2> of1(E1 e1) {
        return new Switch2<>(e1, null);
    }

    static <E1 extends E, E2 extends E> Switch2<E1, E2> of2(E2 e2) {
        return new Switch2<>(null, e2);
    }

    void check() throws E1, E2 {
        if (e1 != null)
            throw e1;
        else
            throw e2;
    }
}

And finally, here’s how we can emulate our exhaustiveness checking switch with catch blocks!

// "Union type" emulating String|int
Switch2<EString, Eint> s = Switch2.of1(new EString("hello"));

// Doesn't compile, Eint isn't caught (catches aren't exhaustive)
try {
    s.check();
}
catch (EString e) {}

// Compiles fine
try {
    s.check();
}
catch (EString e) {}
catch (Eint e) {}

// Also compiles fine
try {
    s.check();
}
catch (EString | Eint e) {}

// Doesn't compile because Eint "partially dominates" EString | Eint
try {
    s.check();
}
catch (Eint e) {}
catch (EString | Eint e) {}

“Neat”, huh? We could even imagine destructuring within the catch block, such that we can automatically unwrap the value from the auxiliary “E” type.

Since we already have “union types” in Java (in catch blocks), and since checked exception declarations could be retrofitted to form a union type with the method’s actual return type, my hopes are still that in some distant future, a more powerful Java will be available where these “union types” (and also intersection types) will be made first class. APIs like jOOQ would greatly profit from this!