Why You Should Use jOOQ With Code Generation

I’m answering many jOOQ questions on Stack Overflow, and a lot of times. The problem has the same cause: People not using jOOQ’s code generator. The main reason people seem not to be using it, is because it takes some extra time to set up, but as with anything well designed, the initial investment will always pay off.

In this article, I want to briefly summarise what the code generator is, why you should use it, and what is the main reason not to use it (hint: you probably don’t have that case).

What’s the jOOQ code generator?

jOOQ is an internal DSL that pretends that you can write type safe, embedded, dynamic SQL directly in Java. Just like you can do that in PL/SQL, PL/pgSQL, T-SQL, SQL/PSM and all the other procedural dialects. Being internal, the big difference is that dynamic SQL is very easy, because jOOQ makes SQL “composable”.

As an internal DSL, it models the SQL language as a chain of Java method calls. The technique behind the fluent API design is described here. A typical jOOQ query looks like this:

var author =
ctx.select(AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME)
   .from(AUTHOR)
   .where(AUTHOR.ID.eq(10))
   .fetchOne();

Now those AUTHOR table references, as well as those AUTHOR.ID column references are generated code. They are Java representations of your database meta data, and include a variety of information:

  • The catalog / schema they belong to
  • The table they belong to (if it’s a column)
  • The data type
  • Some attached converters and data type bindings
  • Other useful meta data, such as constraints (primary key, unique key, foreign key), identities, default expressions, generated expressions, etc.
  • Auxiliary DSL methods, such as type safe table aliasing methods

As you can see, you get a lot of out of the box features!

What’s the price to pay for code generation?

As I understand, a lot of users (especially new ones) are eager to try out jOOQ immediately and don’t want to bother setting up the code generator, which seems to be an unnecessary step at first. E.g. when using JAXB, you don’t have to use XJC to generate annotated bindings for your XSD files (you might not even have an XSD). But with jOOQ, it’s different.

Yes, the manual can be overwhelming, but the tutorial isn’t, and neither are the examples. You don’t even have to use the Maven or Gradle plugins for your first attempts with jOOQ. Why not just use jbang. With just a few lines of CLI, you can start right away.

To try out basic things, you can also use the DDLDatabase, which generates code from SQL files directly, without a live connection to an actual database, though that isn’t really a big issue anymore with testcontainers.org and/or docker, which allows you to spin up a sample database in a matter of seconds.

What I’m trying to say here is: Don’t be overwhelmed. Yes, there’s some initial investment, but it will pay off massively (see below), will prevent tons of problems (see below), and it just requires a few more minutes of setting up your project. Heck, you could even start playing around with the bug reporting template, which includes code generation configuration for Java, Scala, and Kotlin!

What do you get in return?

Let’s start with the obvious:

1. Compile time type safety

You already get some degree of compile time type safety from using jOOQ’s DSL alone, namely, you cannot mistype keywords, such as SELECT, or forget your parentheses around an IN predicate, such as A IN (B, C). And you get auto-completion for your SQL syntax

But without code generation, you won’t get type safety for your schema objects. This is a big deal, and once you’ve gotten used to it, you will never want to miss it again!

Every table, every column, every constraint, every index, every sequence, every procedure, every function, every package, every schema, every catalog, every domain, every enum type, every object type gets a generated Java representation in jOOQ. You don’t have to go to your ERD, or database, or wherever it is to look up your schema. You have it right there in your code. As a first step, you can use this for auto completion in your favourite IDE!

Not just that. You can also never make a mistake with your meta data:

Yep. No more typos. But also! No more regressions when someone renames a column in the database. As soon as a column name changes, or is deleted, your Java code stops compiling.

No more wrong data types. See the above screenshot? It says that FIRST_NAME is a TableField<AuthorRecord, String>. So your Java compiler, through jOOQ’s elaborate generic API, will already know that you’re projecting a String column. It’s not 2003 anymore. We have generics to prevent type casts or unsafe conversions. Check out the compilation error on this:

Why ever worry about such data type problems again? You already declared the data type once, when you created the table:

CREATE TABLE author (
  id INTEGER NOT NULL PRIMARY KEY,
  first_name TEXT NOT NULL,
  last_name TEXT NOT NULL,
  ...
);

Why go through all that busywork of re-declaring the data type again in Java? In fact, why go through all that busywork of re-declaring anything of the above again in Java? No one enjoys typing all that stuff twice or keeping it in sync. In fact, it’s an outright bad idea. There should only be one source of truth for your meta model, and that’s the DDL you ran in your database to create the schema.

Sure, you can have other representations of your data, e.g. some DDD version, some JSON version for your frontend, whatever. But whenever you query your database, there’s absolutely no reason not to use the exact meta model which you’ve already declared in your database.

I think that’s kind of obvious? Why waive all that for a little (initial) extra setup convenience?

2. Schema introspection

You don’t just get that benefit when writing your queries. You also get it when reading them. You can always quickly navigate to the column you’re interested in and read its SQL comments in the Javadoc. Maybe you have this specified? (Totally underused SQL feature!)

COMMENT ON COLUMN author.first_name IS 'The author''s first name';

Now look what jOOQ’s code generator does with it:

Seems obvious, no? Or, you want to see who is using this column? Just check the call hierarchy or whatever IDE search tool you’re interested in:

This is much better than just text searching for FIRST_NAME, which might be case sensitive and matches all strings like that, not just that particular column of the AUTHOR table.

3. Runtime meta model: Data types

In fact, you won’t just profit from this type safety at compile time, but also at runtime. Trust me. There are so many edge cases in some JDBC drivers and/or SQL dialects where the data type must be communicated to the database explicitly.

Sure, you can write something like this in jOOQ, and it works:

var author =
ctx.select(field("author.first_name"), field("author.last_name"))
   .from("author")
   .where(field("author.id").eq(10))
   .fetchOne();

The above is using plain SQL templates, a feature that is normally used by users to extend jOOQ with custom functionality. Sure, it can be used instead of using code generation, too, and occasionally, that’s just the right thing to do.

When bind variables have insufficient type contexts

But there are many cases where you should give the SQL engine more context about the bind value. For example, in Derby (or older versions of Db2), you can’t just do this:

select null
from sysibm.sysdummy1;

You’ll get an error:

SQL Error [30000] [42X01]: Syntax error: Encountered “null” at line 1, column 8.

Instead, you have to CAST.

select cast(null as int)
from sysibm.sysdummy1;

jOOQ always adds those casts for you, if we find out through our extensive integraiton tests that in any given situation, a dialect might require it. But jOOQ can only do that if you provide it with a type, and often you do, implicitly, but sometimes, you don’t, and then you have to debug a query that used to work before. That can be very unexpected.

When a Field<Object> leads to compilation errors

This one is tricky. The Java language version 8 has made a peculiar decision around overload resolution with generics. Assuming you have overloads like this (and jOOQ is full of these, for convenience reasons):

public <T> void setValue(Parameter<T> parameter, T value) {}
public <T> void setValue(Parameter<T> parameter, Field<T> value) {}

Then there has been a backwards incompatible change of compilation behaviour in Java 8.

While this works perfectly fine:

Parameter<String> p1 = ...
Field<String> f1 = ...
setValue(p1, f1);

This call doesn’t compile:

Parameter<Object> p2 = ...
Field<Object> f2 = ...
setValue(p2, f2);

It seems that Field<Object> is not more specific than Object in this case, which seems weird, but that’s how the JLS designed generic overload resolution. In fact, it’s still more specific, but another rule declares the two types as ambiguous before we get to the specificity resolution. Read the above link for details.

Let’s look again at the previous jOOQ query that doesn’t use code generation:

Record2<Object, Object> author =
ctx.select(field("author.first_name"), field("author.last_name"))
   .from("author")
   .where(field("author.id").eq(10)) // Field<Object>
   .fetchOne();

It’s all full of Field<Object>, so you’ll run into this problem eventually, trust me, mostly with the UPDATE .. SET clause. And it will be quite the puzzler.

The workaround is to always attach type information to your columns:

// Assuming this static import
import static org.jooq.impl.SQLDataType.*;

Record2<String, String> author =
ctx.select(
        field("author.first_name", VARCHAR),
        field("author.last_name", VARCHAR))
   .from("author")
   .where(field("author.id", INTEGER).eq(10)) // Field<Integer>
   .fetchOne();

But why do all that, when the code generator could have already done it for you, automatically? And what if you change your AUTHOR.ID column to be a BIGINT instead of an INTEGER? Will you manually update every query?

4. Emulations

Some emulations are not possible without generated meta data. They mostly involve the emulation of the INSERT .. RETURNING syntax, which relies on knowing both primary key and identity information, but other syntaxes may be affected as well.

You wouldn’t believe how many times I’ve supported users who then wanted to use internal API to tell jOOQ about primary key or identity meta information in convoluted ways to make this work, rather than just generate code. With generated code, the following query works in a lot of RDBMS out of the box. Out. Of. The. Box:

AuthorRecord author =
ctx.insertInto(AUTHOR, AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME)
   .values("John", "Doe")
   .returning(AUTHOR.ID)
   .fetchOne();
  • Can this be run in a single query (e.g. in Firebird, Oracle, PostgreSQL as above, in SQL Server using OUTPUT, in Db2, H2 using data change delta tables)?
  • Can it be run in a single query using JDBC’s Statement.getGeneratedKeys()?
  • Does it require a second query to fetch the identity and/or other columns?

jOOQ can always find a way for you in this case, but only if it knows which column the identity is.

5. Converters

Basic SQL only really has a few built-in data types:

  • Various numeric types
  • Various string and binary string types
  • Various temporal types

But is it a good idea to work with these low level types, everywhere throughout your application? Do you want to use BigDecimal for your monetary amounts all over the place? Or would it be better to create a more useful Amount domain class? One where you can attach functionality, currency, etc.?

With jOOQ, you can use converters for that (or bindings, if that affects how you bind the values to JDBC). Have a fancy FirstName class that models various types of first names?

/**
 * Not just any ordinary first name string!
 */
record LeFirstName(String firstName) {}

Yes, you could insist on still not using code generation and attach that to your plain SQL templating query:

Record2<LeFirstName, String> author =
ctx.select(
        field("author.first_name", VARCHAR.asConvertedDataType(
            LeFirstName.class, LeFirstName::new, LeFirstName::firstName
        )), 
        field("author.last_name", VARCHAR))
   .from("author")
   .where(field("author.id", INTEGER).eq(10))
   .fetchOne();

Yes, you could extract that field definition in an auxiliary class for better reuse:

class LeAuthor {
    static Field<LeFirstName> firstName = field("author.first_name", 
        VARCHAR.asConvertedDataType(
            LeFirstName.class, LeFirstName::new, LeFirstName::firstName
        ));
    static Field<String> lastName = field("author.last_name", VARCHAR));
}

Record2<LeFirstName, String> author =
ctx.select(LeAuthor.firstName, LeAuthor.lastName)
   .from("author")
   .where(field("author.id", INTEGER).eq(10))
   .fetchOne();

And now, it looks almost like jOOQ’s code generator was used. Except it wasn’t. Why not? You can attach your converter to all of your FIRST_NAME columns of all your tables automatically with just a bit of extra configuration in the code generator, so you’ll never forget. Especially, if your converter implements some must-use logic, such as e.g. hashing or encrypting, etc. So, the above example would just be:

// After tweaking the codegen configuration, you now
// get your custom type whenever you reference FIRST_NAME
Record2<LeFirstName, String> author =
ctx.select(AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME)
   .from(AUTHOR)
   .where(AUTHOR.ID.eq(10))
   .fetchOne();

Why write manually, what the machine can write for you? You wouldn’t manually write other derived code, such as byte code, or assembly either?

6. Type safe aliasing

Have a complex query and want to alias your tables? No problem with jOOQ. Just declare the aliases up front and you’re all set:

// Declare the alias:
var a = AUTHOR.as("a");

// Use the alias. Columns are still there, type safe
var author =
ctx.select(a.FIRST_NAME, a.LAST_NAME)
   .from(a)
   .where(a.ID.eq(10))
   .fetchOne();

While aliasing also works without code generation (everything works without code generation, which is just convenience for more complex API calls), you won’t get type safe access to your column names from aliases like the above. And you can still call your type hierarchy in the IDE to check where the FIRST_NAME column is referenced:

Or get compilation errors on typos, or auto completion, etc. etc. What’s not to love?

7. Implicit joins

One of my favourite jOOQ features is implicit joins. Look at this jOOQ query:

ctx.select(
          BOOK.author().FIRST_NAME,
          BOOK.author().LAST_NAME,
          BOOK.TITLE,
          BOOK.language().CD.as("language"))
   .from(BOOK)
   .fetch();

Starting from a child table (e.g. BOOK), jOOQ can implicitly (left) join your parent tables AUTHOR and LANGUAGE, because life’s too short to repetitively type out all the same joins and join predicates again and again. The above is equivalent to:

ctx.select(
          AUTHOR.FIRST_NAME,
          AUTHOR.LAST_NAME,
          BOOK.TITLE,
          LANGUAGE.CD.as("language"))
   .from(BOOK)
   .leftJoin(AUTHOR).on(BOOK.AUTHOR_ID.eq(AUTHOR.ID))
   .leftJoin(LANGUAGE).on(BOOK.LANGUAGE_ID.eq(LANGUAGE.ID))
   .fetch();

It’s a matter of taste and style, yes. You don’t have to use implicit joins everywhere. Some joins cannot be expressed as implicit joins, but very often, this style greatly simplifies your queries by reducing noise.

And it’s only possible with generated code!

Why? Because those BOOK.author() and BOOK.language() methods are generated, and they return again instances of generated tables containing generated columns (with implicit table aliases), and data types, and possibly converters, and what not.

Are you still categorically excluding working with the code generator just because it takes a little extra time to set up?

8. CRUD

jOOQ isn’t a full fledged ORM like JPA implementations, but you can still get some convenience to avoid having to spell out CRUD queries all the time. jOOQ calls these UpdatableRecord.

When you fetch the UpdatableRecord like this:

BookRecord book =
ctx.selectFrom(BOOK)
   .where(BOOK.ID.eq(17))
   .fetchOne();

Then, you can change its values (e.g. in a rich client), and store the record again to generate an UPDATE statement:

book.setTitle("New title");
book.store();

There are a few opt in extra features, like optimistic locking, and you guessed it:

All available only with generated code.

9. Stored procedures

If you have those (and I highly recommend you do if only for performance reasons), then using code generation is even more of a no brainer.

If you have, in PostgreSQL:

CREATE FUNCTION p_count_authors_by_name (
  author_name VARCHAR, 
  result OUT INTEGER
)
AS $$
DECLARE
  v_result INT;
BEGIN
  SELECT COUNT(*)
  INTO v_result
  FROM author
  WHERE first_name LIKE author_name
  OR last_name LIKE author_name;

  result := v_result;
END;
$$ LANGUAGE plpgsql;

Now jOOQ’s code generator generates the following routine call for you:

public static Integer pCountAuthorsByName(
      Configuration configuration
    , String authorName
) {
    // Do you really care?
}

Yep, that’s it. Do you really care what happens behind the scenes? This binding logic is about as exciting as sorting your socks. It’s the same story every time you have some RPC technology. We didn’t care about the implementation in CORBA, with SOAP, with RMI, with remote EJBs, and we don’t care how to bind to a stored procedure.

It’s just some remote callable that accepts a String and returns an Integer. Code generation is your friend here, again. And yes, as always, if you think this function is missing a second parameter, you’ll just add it and your client code stops compiling offering you a chance to refactor!

And the best thing is: This supports all of your weird edge cases including:

  • SQL types (enums, arrays, object types, UDTs)
  • PL/SQL types (record types, table types, associative arrays)
  • Ref cursor types
  • Table valued functions

10. Multitenancy

A cool feature in jOOQ is that jOOQ supports schema level multi tenancy out of the box for you where you can dynamically rename your catalogs, schemas, and table names at runtime depending on whatever (e.g. user tenant).

This feature is called schema mapping, and it obviously… doesn’t work if you use plain SQL templates without code generation. (because a plain SQL template can contain any kind of SQL string, that’s what they’re for).

Not just that. With generated code, all of your objects are fully qualified by default, and you can omit qualification when you like. With your own templates, you can’t easily change that. Want to port your schema from A to B? Well, you’ll hand-rewrite everything, good luck. With jOOQ’s generated code, it’s just flags, configurations, etc.

11. Embedded types

Remember converters? Even more sophisticated are embedded types, i.e. jOOQ’s way of wrapping multiple database columns into a single client side value, pretending your database supports UDTs (such as Oracle, PostgreSQL).

Because when you work with currencies, what you really want to do is combine an AMOUNT column with a CURRENCY column, because after all, USD 1.00 isn’t really the same thing as EUR 1.00, nor can they be directly compared or combined in arithmetics.

Embedded types are currently only available using the code generator, which produces all the meta data required for jOOQ’s runtime to map / unmap your flat result sets.

This is especially powerful when you use:

  • Embedded keys (the only thing you ever compare an BOOK.AUTHOR_ID with is AUTHOR.ID, or maybe another foreign key that points to the same primary key, so why not use a type to enforce type safety?)
  • Embedded domains (you already declared a semantic type in your database, so you want to reuse that name also in client code)

12. Data change management

Another thing that people often don’t want to setup correctly and cleanly up front because it does take some extra time is data change management (e.g. using Flyway or Liquibase). And the regrets are similar once the project grows.

Using jOOQ’s code generator kind forces you to also think about data change management early on, and that is a good thing!

Your data model changes over time, and so does your client code working on that model. jOOQ’s code generation process should be embedded in your CI/CD pipeline:

  • You automatically apply a change in your model
  • You automatically re-generate your code
  • You automatically run your integration tests
  • You automatically deploy the tested result as a unit

Here’s how to do all of that with testcontainers and jOOQ, see an example in action here.

Let me stress this once more.

If you don’t cleanly set up your CI/CD pipeline including your database early on in your project, you’ll do it later on, and it’ll be much harder. But if you do it, then adding jOOQ’s code generation to the game is a no-brainer, and you get everything above for free!

Conclusion

There are many other minor things that you get when you use jOOQ’s code generator. It’s such a powerful tool to get the most out of both jOOQ and your database. It actively encourages you to use all of your RDBMS features, including:

  • Complex models (because joining is much simpler once you have all the meta data, and even tools like implicit joins)
  • Views (those are generated too)
  • Data types (generated too)
  • Procedures (same)
  • Etc.

Using code generation in jOOQ is a clear path towards becoming a power user of your own database, which you (hopefully) design with a lot of love and craftsmanship.

Exception

I promised an exception. You can’t use the code generator when your schema is dynamic (i.e. unknown at compile time). But only few systems are designed like that.

Fun with PostGIS: Mandelbrot Set, Game of Life, and More

The upcoming jOOQ 3.16 will finally offer support for the various RDBMS GIS extensions via issue #982. This is great news per se, and will be covered in a future blog post, when the integration is ready. This post here is about something else.

Adding support for such a feature is a great source of procrastination. You see, when you develop a backend library, all you ever do is implement algorithms, API, and other abstract stuff. But GIS is different. With GIS, a SQL developer like me suddenly has access to a “UI”, and having access to a “UI” woke the inner programmer child in me.

I was pleasantly surprised that DBeaver, my preferred SQL editor, has out of the box support for GIS’s WKT format. E.g. run a query like this:

SELECT st_geomfromtext('polygon((0 0, 2 3, 0 6, -2 3, 0 0))');

The output being

Let’s play around with GIS

So, naturally, what other thing to do than play around with it? The most obvious next thing to generate would be your favourite logo, which happens to be very easy to map to polygons. Let’s look at a few GIS features step by step. And for this blog post, I’ll be using PostGIS, which you can get very easily from docker hub:

WITH
  -- These "sprites" are just reusable 2D polygons, which I declared
  -- to avoid having to deal with the lengthy WKT format
  sprites AS (
    SELECT 

      -- We project the original 1x1 square, and a scaled 4x4 version
      s AS square,
      st_scale(s, 4, 4) AS square4

    -- Here, we're creating a square called "s"
    FROM (VALUES 
      (st_polygonfromtext('polygon ((0 0, 1 0, 1 1, 0 1, 0 0))'))
    ) t (s)
  )

-- st_union combines 2 polygons into a multipolygon
SELECT st_union(square, st_translate(square4, 2, 0))
FROM sprites

The output of the above query is

MULTIPOLYGON (((0 0, 1 0, 1 1, 0 1, 0 0)), ((2 0, 6 0, 6 4, 2 4, 2 0)))

Or, using DBeaver’s visualisation utility:

The st_union isn’t much different from any other set union. Note that I’ve translated the larger square 2 units to the right, so they don’t overlap. Otherwise, the union would have just been the bigger square.

A polygon describes a set of points in the 2D number plane. Creating a union of those two sets of points is natural. We can also create a difference of the two squares, instead:

WITH
  sprites AS (
    SELECT 
      s AS square,
      st_scale(s, 4, 4) AS square4
    FROM (VALUES 
      (st_polygonfromtext('polygon ((0 0, 1 0, 1 1, 0 1, 0 0))'))
    ) t (s)
  )
SELECT st_difference(square4, square)
FROM sprites

Resulting in:

POLYGON ((0 4, 4 4, 4 0, 1 0, 1 1, 0 1, 0 4))

Or, visually:

With these simple tools, we can now create all 4 letters of the jOOQ logo. As a reminder, the tools were:

  • st_polygonfromtext: Create a polygon from a WKT text representation
  • st_scale: Scale any geometry to a new size
  • st_translate: Translate any geometry to a new position
  • st_union: Combine two geometries (or more, if used as an aggregate function)
  • st_difference: Remove one geometry from another

The GIS API is vast, both in PostGIS as well as in the ISO/IEC 13249-3:2016 standard version, but these simple tools shall suffice for now. Let’s look at this query:

WITH
  -- Creating the two squares to work with
  sprites AS (
    SELECT 
      s AS square,
      st_scale(s, 4, 4) AS square4
    FROM (VALUES 
      (st_polygonfromtext('polygon ((0 0, 1 0, 1 1, 0 1, 0 0))'))
    ) t (s)
  ),
  
  -- All the 4 letters j, o1, o2, q of the jOOQ logo
  letters AS (
    SELECT
    
      -- The letter j
      st_difference(
        st_difference(
          st_difference(
            square4, 
            st_translate(st_scale(square, 2, 3), 1, 1)
          ),
          st_translate(st_scale(square, 1, 2), 0, 2)
        ),
        st_translate(st_scale(square, 1, 0.5), 3, 2.5)
      ) AS j,
      
      -- The letter o
      st_difference(square4, st_translate(square, 1, 1)) AS o1,
      
      -- The letter o
      st_difference(square4, st_translate(square, 2, 2)) AS o2,
      
      -- The letter q
      st_union(
        st_difference(
          square4, 
          st_translate(st_scale(square, 2, 2), 1, 1)
        ),
        st_translate(st_scale(square, 1, 2.5), 1.5, -1)
      ) as q
    FROM sprites
  )
SELECT

  -- Combine all the 4 letters
  st_union(v)
FROM
  letters,
  
  -- Arrange the 4 letters next to each other
  LATERAL (VALUES
    (st_translate(j, 0, 5)),
    (st_translate(o1, 5, 5)),
    (o2),
    (st_translate(q, 5, 0))
  ) t (v);

This produces:

Neat, huh?

Next step: The Mandelbrot Set

A natural next step for any procrastinator is to generate the Mandelbrot set. To prepare ourselves for what’s behind the Mandelbrot, have a look at this neat video (more procrastination):

There are different ways to calculate it with SQL, here’s one that I came up with:

WITH RECURSIVE

  -- These are the dimensions that you can play around with
  dims (r1, r2, i1, i2, s, it, p) AS (
    VALUES (
      
      -- The dimensions of the real axis
      -2::float, 1::float, 
      
      -- The dimensions of the imaginary axis
      -1.5::float, 1.5::float, 
      
      -- The step size on each axis, per pixel or sprite
      0.01::float, 
      
      -- The maximum number of iterations
      100, 
      
      -- "Infinity", i.e. when to stop
      256.0::float
    )
  ),
  
  -- The square again, as before
  sprites (s) AS (VALUES 
    (st_polygonfromtext('polygon ((0 0, 0 1, 1 1, 1 0, 0 0))'))
  ),
  
  -- The number plane, as ints
  n1 (r, i) AS (
    SELECT r, i 
    FROM 
      dims, 
      generate_series((r1 / s)::int, (r2 / s)::int) r,
      generate_series((i1 / s)::int, (i2 / s)::int) i
  ),
  
  -- The number plane as scaled floats
  n2 (r, i) AS (
    SELECT r::float * s::float, i::float * s::float
    FROM dims, n1
  ),
  
  -- The recursive calculation of the Mandelbrot formula
  -- zn = (zn-1)^2 + c
  l (cr, ci, zr, zi, g, it, p) AS (
    SELECT r, i, 0::float, 0::float, 0, it, p FROM n2, dims
    UNION ALL
    SELECT cr, ci, zr*zr - zi*zi + cr, 2*zr*zi + ci, g + 1, it, p 
    FROM l
    
    -- The recursions stops when we reach the maximum
    WHERE g < it
    
    -- Or, when we reach "infinity"
    AND zr*zr + zi*zi < p
  ),
  
  -- Find the last calculated value per point in the
  -- complex number plane c (cr, ci), discard the others
  x (cr, ci, zr, zi, g) AS (
    SELECT DISTINCT ON (cr, ci) cr, ci, zr, zi, g
    FROM l
    ORDER BY cr, ci, g DESC
  )
  
-- Turn every calculated point into a square
SELECT 
  st_union(
    st_translate(sprites.s, round(cr / dims.s), round(ci / dims.s))
  )
FROM x, sprites, dims

-- Retain only the points *not* belonging to the Mandelbrot set
-- You can also inverse the equation to retain the points that
-- belong to the set
WHERE zr*zr + zi*zi > p;

Note, I’m using an artificial “infinity” here, because:

  1. That speeds things up with little noticeable difference at this zoom scale
  2. I couldn’t figure out how to make PostgreSQL overflow float operations to float infinities, like Java or CockroachDB or other IEEE 754 float implementations do. Any help appreciated, see this Stack Overflow question.

The output is the well known shape

You can play around with this and use different values for “dims” to zoom in, e.g.

dims (r1, r2, i1, i2, s, it, p) as (values (
  (-0.925-0.032)::float, (-0.925+0.032)::float, 
  (0.266-0.032)::float, (0.266+0.032)::float, 
  0.00005::float, 100, 256.0::float)
)

This will generate some really neat zooms, all with SQL:

Granted, it’s not the most efficient way to calculate these things, but that wasn’t the point here, was it?

Game of Life

I can be nerd-sniped:

And of course, I had to accept that challenge too! Now, the Game of Life is a simple “game” where we have the x,y natural number plane (e.g. “pixels”), and with each coordinate, we associate whether that “cell” is dead or alive. Then, we establish the following set of rules:

  1. Any live cell with two or three live neighbours survives.
  2. Any dead cell with three live neighbours becomes a live cell.
  3. All other live cells die in the next generation. Similarly, all other dead cells stay dead.

No, it’s hard to “animate” things in SQL using spatial extensions, so as a workaround, I’ll just display iterations of 100×100 pixel tiles of the Game of Life next to each other. The first iteration is just the starting point, e.g. a random number of “alive” cells:

WITH RECURSIVE

  -- Again generate a "sprite"a from a square polygon
  sprites (s) AS (VALUES (
    st_polygonfromtext('polygon ((0 0, 0 1, 1 1, 1 0, 0 0))')
  )),

  -- Generate the x, y number plane and associate a boolean value
  -- with each coordinate, generated randomly.
  -- 10% of all cells are alive
  m (x, y, b) AS (
    SELECT x, y, 
      CASE WHEN random() > 0.9 THEN 1 ELSE 0 END 
    FROM 
      generate_series(1, 100) x, 
      generate_series(1, 100) y
  )
SELECT st_union(st_translate(sprites.s, x::float, y::float))
FROM m, sprites

-- Display only "alive" cells
WHERE b = 1

This will produce something like:

Now, all we have to do is iterate the game formula. Now, recursive SQL has quite a few limitations. E.g. I couldn’t get PostgreSQL to aggregate recursive data, nor self-join the recursive table to find the nearest neighbors of any cell. But with window functions, this is definitely possible. So here goes:

WITH RECURSIVE
  sprites (s) AS (VALUES (
    st_polygonfromtext('polygon ((0 0, 0 1, 1 1, 1 0, 0 0))')
  )),
  m (x, y, b) AS (
    SELECT x, y, 
      CASE WHEN random() > 0.9 THEN 1 ELSE 0 END
    FROM 
      generate_series(1, 100) x, 
      generate_series(1, 100) y
  ),
  
  -- This is the recursion of the Game of Life
  e (x, y, b, g) AS (
  
    -- The recursion starts with the above initial data
    SELECT x, y, b, 1 FROM m
    UNION ALL
    SELECT
      e.x, e.y,
      CASE
      
        -- If this cell is alive, and 2 or 3
        -- neighbors are alive, too, then this
        -- cell stays alive
        WHEN e.b = 1 AND
          sum(e.b) OVER w1
        + sum(e.b) OVER w2
        + sum(e.b) OVER w3 IN (2, 3)
          THEN 1
          
        -- If this cell is dead, and 3 neighbors
        -- are alive, then this cell becomes alive
        WHEN e.b = 0 AND
          sum(e.b) OVER w1
        + sum(e.b) OVER w2
        + sum(e.b) OVER w3 = 3
          THEN 1
          
        -- Otherwise, this cell dies or stays dead
        ELSE 0
      END,
      e.g + 1
    FROM e
    
    -- The recursion stops after 100 iterations
    WHERE e.g < 100
    WINDOW
    
      -- We order the data set by x, y. In SQL, there isn't 
      -- a notion of a 2-dimensional number plane. We only
      -- have 1 dimension.
      o AS (ORDER BY x, y),
      
      -- w1 is all the neighbors of the previous row in the x, y
      -- plane, which is all the SQL rows that are 101-99 rows
      -- before the current row.
      w1 AS (o ROWS BETWEEN 101 PRECEDING AND  99 PRECEDING),
      
      -- w2 is all the neighbors of the current row in the x, y
      -- number plane, excluding the current cell
      w2 AS (o ROWS BETWEEN   1 PRECEDING AND   1 FOLLOWING 
               EXCLUDE CURRENT ROW),
               
      -- w3 is all the neighbors of the next row
      w3 AS (o ROWS BETWEEN  99 FOLLOWING AND 101 FOLLOWING)
  )

-- Finally, combine all the iterations  
SELECT st_union(st_translate(
  sprites.s, 
  (x + (g - 1) % 10 * 120)::float, 
  (y - (g - 1) / 10 * 120)::float
))
FROM e, sprites
WHERE b = 1
;

Think of the window specifications as follows:

----+-------------+-------------+-------------+-------------+----
... | 105: (1, 5) | 106: (1, 6) | 107: (1, 7) | 108: (1, 8) | ...
... | 205: (2, 5) | 206: (2, 6) | 207: (2, 7) | 208: (2, 8) | ...
... | 305: (3, 5) | 306: (3, 6) | 307: (3, 7) | 308: (3, 8) | ...
... | 405: (4, 5) | 406: (4, 6) | 407: (4, 7) | 408: (4, 8) | ...

So, in a 100×100 grid, x=3,y=7 can be encoded as 307, and its neighbors are

  • w1: 206, 207, 208, i.e. between 101 preceding and 99 preceding of 307
  • w2: 306, 308, i.e. between 1 preceding and 1 following of 307
  • w3: 406, 407, 408, i.e. between 99 following and 101 following of 307

The output looks like this:

Or, zooming in on iterations 1-4:

Or 21-24:

That’s really cool, huh!

Glider Gun

The nerd-snipe was to animate the glider gun pattern, which just means we’ll have to replace the random generation of the first iteration by something constant.

WITH RECURSIVE
  sprites (s) AS (VALUES (
    st_polygonfromtext('polygon ((0 0, 0 1, 1 1, 1 0, 0 0))')
  )),
  m (x, y, b) AS (
    SELECT x, y, 

      -- Initial data here
      CASE WHEN (x, y) IN (
        (2, 6), (2, 7), (3, 6), (3, 7), (12, 6), (12, 7), (12, 8), 
        (13, 5), (13, 9), (14, 4), (14, 10), (15, 4), (15, 10),
        (16, 7), (17, 5), (17, 9), (18, 6), (18, 7), (18, 8), 
        (19, 7), (22, 4), (22, 5), (22, 6), (23, 4), (23, 5), 
        (23, 6), (24, 3), (24, 7), (26, 2), (26, 3), (26, 7), 
        (26, 8), (36, 4), (36, 5), (37, 4), (37, 5)
      ) THEN 1 ELSE 0 END 
    FROM 
      generate_series(1, 100) x, 
      generate_series(1, 100) y
  ),
  e (x, y, b, g) AS (
    SELECT x, y, b, 1 FROM m
    UNION ALL
    SELECT
      e.x, e.y,
      CASE
        WHEN e.b = 1 AND
          sum(e.b) OVER w1
        + sum(e.b) OVER w2
        + sum(e.b) OVER w3 IN (2, 3)
          THEN 1
        WHEN e.b = 0 AND
          sum(e.b) OVER w1
        + sum(e.b) OVER w2
        + sum(e.b) OVER w3 = 3
          THEN 1
        ELSE 0
      END,
      e.g + 1
    FROM e
    WHERE e.g < 100
    WINDOW
      o AS (ORDER BY x, y),
      w1 AS (o ROWS BETWEEN 101 PRECEDING AND  99 PRECEDING),
      w2 AS (o ROWS BETWEEN   1 PRECEDING AND   1 FOLLOWING 
               EXCLUDE CURRENT ROW),
      w3 AS (o ROWS BETWEEN  99 FOLLOWING AND 101 FOLLOWING)
  )
SELECT st_union(st_translate(
  sprites.s, 
  (x + (g - 1) % 10 * 120)::float, 
  (y - (g - 1) / 10 * 120)::float
))
FROM e, sprites
WHERE b = 1
;

And, as you can see, the gun works:

It started with this:

And eventually produces the well known gliders:

PostgreSQL 14’s enable_memoize For Improved Performance of Nested Loop Joins

I’ve recently discovered a pleasant new addition to PostgreSQL 14, the new enable_memoize flag that improves the performance of some nested loop joins where statistics hint at this being appropriate. I mean, who can resist this temptation:

Improving query speed by 1000x hints at something very suboptimal having been going on before, and a tool like memoization can be of great help. But will it also help with an “ordinary” join? I wanted to try myself.

What’s memoization?

In a perfect world free of side effects (and SQL is such a perfect world, in theory), memoization means that we can substitute y for f(x) in any computation, given that y = f(x). For example, no matter how many times you calculate UPPER('x'), you’ll always get 'X'. If the calculation of such a function is costly, and there are only few possible input values, then why not just maintain a hash map that maps all previous input values and use that to look up known (or at least frequent) values instead of computing them again?

As I’ve shown previously on this blog, Oracle 11g has introduced a feature called scalar subquery caching, a feature which you can activate in jOOQ to avoid costly PL/SQL context switches.

In the case of PostgreSQL’s enable_memoize, this can be particularly useful for nested loop joins in SQL, and to reference the above tweet, a lateral join is often executed via a nested loop join.

Turning the feature on and off

I’ve created a schema like this:

CREATE TABLE t AS
SELECT i, i % 5 AS j 
FROM generate_series(1, 100000) AS t(i);

CREATE TABLE u AS
SELECT i, i % 20000 as j 
FROM generate_series(1, 100000) AS t(i);

CREATE INDEX uj ON u(j);

In summary:

  • Both tables t and u have 100000 rows.
  • t.j has only 5 distinct values, each value appears 20000 times.
  • u.j has 20000 distinct values, each value appears 5 times.

When running this on PostgreSQL 14:

SELECTcurrent_setting('enable_memoize');

I get:

|current_setting|
|---------------|
|on             |

So, the feature is active, which I can also see in an EXPLAIN of the following query:

EXPLAIN
SELECT *
FROM t JOIN u ON t.j = u.j;

The plan is:

|QUERY PLAN                                                            |
|----------------------------------------------------------------------|
|Nested Loop  (cost=0.30..8945.41 rows=496032 width=16)                |
|  ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=8)         |
|  ->  Memoize  (cost=0.30..0.41 rows=5 width=8)                       | 
|        Cache Key: t.j                                                |
|        ->  Index Scan using uj on u  (cost=0.29..0.40 rows=5 width=8)|
|              Index Cond: (j = t.j)                                   |

Without memoization, when joining the two tables like that, then, for the 100000 rows in t, I have to look up 100000x the 5 matching rows in u. But if memoization kicks in, then I will have to perform the lookup only 5 times, because there are only 5 distinct values of t.j

We can play around with execution plans by turning the feature on or off:

SET enable_memoize = ON;
SET enable_memoize = OFF;

When turned off, PostgreSQL seems to choose a hash join or merge join instead, on my machine (between multiple executions, the plan might switch)

|QUERY PLAN                                                         |
|-------------------------------------------------------------------|
|Hash Join  (cost=3084.00..11568.51 rows=499351 width=16)           |
|  Hash Cond: (t.j = u.j)                                           |
|  ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=8)      |
|  ->  Hash  (cost=1443.00..1443.00 rows=100000 width=8)            |
|        ->  Seq Scan on u  (cost=0.00..1443.00 rows=100000 width=8)|


|QUERY PLAN                                                              |
|------------------------------------------------------------------------|
|Merge Join  (cost=9748.11..763846.11 rows=50000000 width=16)            |
|  Merge Cond: (u.j = t.j)                                               |
|  ->  Index Scan using uj on u  (cost=0.29..3848.29 rows=100000 width=8)|
|  ->  Sort  (cost=9747.82..9997.82 rows=100000 width=8)                 |
|        Sort Key: t.j                                                   |
|        ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=8)     |

Let’s Benchmark

We’re using the usual benchmark technique described here:

  • We repeat an operation 25x in mode A and mode B and compare (or more than 25, if it’s a fast operation)
  • We repeat the above 5x to mitigate any warmup and other caching effects

You can run the following benchmark on the above schema yourself, to verify:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 25;
  rec RECORD;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();
    SET enable_memoize = OFF;
  
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT t.*
        FROM t JOIN u ON t.j = u.j
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', r, (clock_timestamp() - v_ts);
    v_ts := clock_timestamp();
    SET enable_memoize = ON;

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT t.*
        FROM t JOIN u ON t.j = u.j
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', r, (clock_timestamp() - v_ts);
    RAISE INFO '';
  END LOOP;
END$$;

On my machine, the results are consistent, decent, not too impressive, but still significant:

Run 1, Statement 1: 00:00:03.763426
Run 1, Statement 2: 00:00:03.401346

Run 2, Statement 1: 00:00:03.769419
Run 2, Statement 2: 00:00:03.375677

Run 3, Statement 1: 00:00:03.771465
Run 3, Statement 2: 00:00:03.374413

Run 4, Statement 1: 00:00:03.769136
Run 4, Statement 2: 00:00:03.398734

Run 5, Statement 1: 00:00:03.772544
Run 5, Statement 2: 00:00:03.375272

I.e. a 10% speedup. Across the whole system, that alone would already be worth it.

Optimising LATERAL

Let’s try optimising LATERAL instead. We could run a query like this:

SELECT *
FROM 
  t, 
  LATERAL (
    SELECT count(*) 
    FROM u 
    WHERE t.j = u.j
  ) AS u(j)

The EXPLAIN of the above is

|QUERY PLAN                                                                       |
|---------------------------------------------------------------------------------|
|Nested Loop  (cost=4.40..3969.47 rows=100000 width=16)                           |
|  ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=8)                    |
|  ->  Memoize  (cost=4.40..4.42 rows=1 width=8)                                  |
|        Cache Key: t.j                                                           |
|        ->  Aggregate  (cost=4.39..4.40 rows=1 width=8)                          |
|              ->  Index Only Scan using uj on u  (cost=0.29..4.38 rows=5 width=0)|
|                    Index Cond: (j = t.j)                                        |

So, we can again cache the computation of the COUNT(*) value for each of the 5 distinct t.j input values, rather than re-calculating this every time. Surely, this must be even better than before?

Benchmark time!

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 25;
  rec RECORD;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();
    SET enable_memoize = OFF;
  
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT *
        FROM 
          t, 
          LATERAL (
            SELECT count(*) 
            FROM u 
            WHERE t.j = u.j
          ) AS u(j)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', r, (clock_timestamp() - v_ts);
    v_ts := clock_timestamp();
    SET enable_memoize = ON;

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT *
        FROM 
          t, 
          LATERAL (
            SELECT count(*) 
            FROM u 
            WHERE t.j = u.j
          ) AS u(j)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', r, (clock_timestamp() - v_ts);
    RAISE INFO '';
  END LOOP;
END$$;

And this time, we can see a significant speedup!

Run 1, Statement 1: 00:00:03.419728
Run 1, Statement 2: 00:00:01.083941

Run 2, Statement 1: 00:00:03.404954
Run 2, Statement 2: 00:00:01.098404

Run 3, Statement 1: 00:00:03.425725
Run 3, Statement 2: 00:00:01.093883

Run 4, Statement 1: 00:00:03.441691
Run 4, Statement 2: 00:00:01.127837

Run 5, Statement 1: 00:00:03.420172
Run 5, Statement 2: 00:00:01.097943

That’s great news! Wait, does this work also for ordinary correlated subqueries? Because the above LATERAL correlated subquery could be rewritten as:

SELECT 
  t.*, 
  (
    SELECT count(*)
    FROM u
    WHERE t.j = u.j
  ) j
FROM t;

Regrettably, the plan doesn’t show memoization:

|QUERY PLAN                                                                   |
|-----------------------------------------------------------------------------|
|Seq Scan on t  (cost=0.00..441693.00 rows=100000 width=16)                   |
|  SubPlan 1                                                                  |
|    ->  Aggregate  (cost=4.39..4.40 rows=1 width=8)                          |
|          ->  Index Only Scan using uj on u  (cost=0.29..4.38 rows=5 width=0)|
|                Index Cond: (j = t.j)                                        |

And the benchmark (you can paste the query into the benchmark logic yourself to reproduce) confirms there’s no memoization effect

Run 1, Statement 1: 00:00:03.617562
Run 1, Statement 2: 00:00:03.605765

Run 2, Statement 1: 00:00:03.610084
Run 2, Statement 2: 00:00:03.682064

Run 3, Statement 1: 00:00:03.725952
Run 3, Statement 2: 00:00:03.705622

Run 4, Statement 1: 00:00:03.672669
Run 4, Statement 2: 00:00:03.644612

Run 5, Statement 1: 00:00:03.645741
Run 5, Statement 2: 00:00:03.642717

It seems that with this new feature, correlated subqueries could be rewritten to nested loop outer joins in the future? Other optimisers already do this, and we would have effectively the same feature here as Oracle’s scalar subquery caching.

Conclusion

The feature is turned on in PostgreSQL 14. Apart from some additional memory consumption, which might be a small problem if the optimiser is wrong and statistics are off, I don’t see any drawback of this new feature. SQL is a side-effect free (in theory) 4GL, meaning the optimiser can replace any computation by a cached value that depends only on the computation’s input values.

A correlated subquery is a function, whose input parameters are the predicates and other references to the outer query’s columns. As such, the result of a correlated subquery can be cached, or memoized. As shown above, this has drastic effects on your present day SQL queries, from whcih you can profit just by upgrading to PostgreSQL 14.

Java’s Checked Exceptions Are Just Weird Union Types

This fun fact has been on my mind for a while, and a recent reddit thread about “Smuggling Checked Exceptions with Sealed Interfaces” made me write this post here. Namely, Java had union types before it was cool! (If you squint hard).

What are union types?

Ceylon is an underrated JVM language that never really took off, which is too bad, because the concepts it introduced are very elegant (see e.g. how they implemented nullable types as syntax sugar on top of union types, which IMO is much better than anything monadic using Option types or kotlin’s ad-hoc type system extension).

So, one of those concepts are union types. One of the most popular language that supports them, currently, is TypeScript, though C++, PHP, and Python also have something similar. (The fact whether the union type is tagged or not isn’t relevant to this post).

If you understand Java’s intersection types A & B (meaning that something is both a subtype of A and of B), then it’s easy to understand union types A | B (meaning that something is a subtype of any of A or B). TypeScript shows a simple example of this

function printId(id: number | string) {
  console.log("Your ID is: " + id);
}
// OK
printId(101);
// OK
printId("202");
// Error
printId({ myID: 22342 });

Structural vs nominal typing

Such a union type (or intersection type) is a structural type, as opposed to what we’ve been doing in Java via nominal types, where you have to declare a named type for this union every time you want to use it. E.g. in jOOQ, we have things like:

interface FieldOrRow {}
interface Field<T> extends FieldOrRow {}
interface Row extends FieldOrRow {}

Very soon, the Java 17 distribution will seal the above type hierarchy as follows (incomplete, subtypes of Field<T> and Row are omitted for brevity):

sealed interface FieldOrRow permits Field<T>, Row {}
sealed interface Field<T> extends FieldOrRow permits ... {}
sealed interface Row extends FieldOrRow permits ... {}

There are pros and cons for doing these things structurally or nominally:

Structural typing:

  • Pro: You can create any ad-hoc union of any set of types anywhere
  • Pro: You don’t have to change the existing type hierarchy, which is essential when you don’t have access to it, e.g. when you want to do something like the above number | string type. This kinda works like JSR 308 type annotations that were introduced in Java 8.

Nominal typing:

  • Pro: You can attach documentation to the type, and reuse it formally (rather than structurally). TypeScript and many other languages offer type aliases for this kind of stuff, so you can have a bit of both worlds, though the alias is erased, meaning you keep the complex structural type leading to curious error messages.
  • Pro: You can seal the type hierarchy to allow for exhaustiveness checking among subtypes (e.g. above, there can only be Field<T> or Row subtypes of FieldOrRow. A structurally typed union type is implicitly “sealed” ad-hoc by the union type description (not sure if that’s how it’s called), but with nominal types, you can make sure no one else can extend the type hierarchy, (except where you permit it explicitly using the non-sealed keyword)

Ultimately, as ever so often, things like structural and nominal typing are two sides of the same coin, pros and cons mostly depending on taste and on how much you control a code base.

So, how are checked exceptions union types?

When you declare a method that throws checked exceptions, the return type of the method is really such a union type. Look at this example in Java:

public String getTitle(int id) throws SQLException;

The call-site now has to “check” the result of this method call using try-catch, or declare re-throwing the checked exception(s):

try {
    String title = getTitle(1);
    doSomethingWith(title);
}
catch (SQLException e) {
    handle(e);
}

If early Java had union types rather than checked exceptions, we might have declared this as follows, instead:

public String|SQLException getTitle(int id);

Likewise, a caller of this method will have to “check” the result of this method call. There’s no simple way of re-throwing it, so if we do want to re-throw, we’d need some syntax sugar, or repeat the same code all the time, Go-style:

// Hypothetical Java syntax:
String|SQLException result = getTitle(1);

switch (result) {
    case String title -> doSomethingWith(title);
    case SQLException e -> handle(e);
}

It would be obvious how such a JEP 406 style switch pattern matching statement or expression could implement an exhaustiveness check, just like with the existing JEP 409 sealed classes approach, the only difference, again, being that everything is now structurally typed, rather than nominally typed.

In fact, if you declare multiple checked exceptions, such as the JDK’s reflection API:

public Object invoke(Object obj, Object... args)
throws 
    IllegalAccessException, 
    IllegalArgumentException,
    InvocationTargetException

With union types, this would just be this, instead:

// Hypothetical Java syntax:
public Object
    | IllegalAccessException
    | IllegalArgumentException
    | InvocationTargetException invoke(Object obj, Object... args)

And the union type syntax from the catch block, which checks for exhaustiveness (yes, we have union types in catch!)…

try {
    Object returnValue = method.invoke(obj);
    doSomethingWith(returnValue);
}
catch (IllegalAccessException | IllegalArgumentException e) {
    handle1(e);
}
catch (InvocationTargetException e) {
    handle2(e);
}

Could still check for exhaustiveness with the switch pattern matching approach:

// Hypothetical Java syntax:
Object
    | IllegalAccessException
    | IllegalArgumentException
    | InvocationTargetException result = method.invoke(obj);

switch (result) {
    case IllegalAccessException, 
         IllegalArgumentException e -> handle1(e);
    case InvocationTargetException e -> handle2(e);
    case Object returnValue = doSomethingWith(returnValue);
}

A subtle caveat here is that exceptions are subtypes of Object, so we must put that case at the end, as it “dominates” the others (see JEP 406 for a discussion about dominance). Again, we can prove exhaustiveness, because all types that are involved in the union type have a switch case.

Can we emulate union types with checked exceptions?

You know what Jeff Goldblum would say

But this blog is known to do it anyway. Assuming that for every possible type, we had a synthetic (code generated?) checked exception that wraps it (because in Java, exceptions are not allowed to be generic):

// Use some "protective" base class, so no one can introduce 
// RuntimeExceptions to the type hierarchy
class E extends Exception {

    // Just in case you're doing this in performance sensitive code...
    @Override
    public Throwable fillInStackTrace() {
        return this;
    }
}

// Now create a wrapper exception for every type you want to represent
class EString extends E {
    String s;
    EString(String s) {
        this.s = s;
    }
}
class Eint extends E {
    int i;
    Eint(int i) {
        this.i = i;
    }
}

The benefit of this is we don’t have to wait for Valhalla to support primitive types in generics, nor to reify them. We’ve already emulated that as you can see above.

Next, we need a switch emulation for arbitrary degrees (22 will probably be enough?). Here’s one for degree 2:

// Create an arbitrary number of switch utilities for each arity up 
// to, say 22 as is best practice
class Switch2<E1 extends E, E2 extends E> {
    E1 e1;
    E2 e2;

    private Switch2(E1 e1, E2 e2) {
        this.e1 = e1;
        this.e2 = e2;
    }

    static <E1 extends E, E2 extends E> Switch2<E1, E2> of1(E1 e1) {
        return new Switch2<>(e1, null);
    }

    static <E1 extends E, E2 extends E> Switch2<E1, E2> of2(E2 e2) {
        return new Switch2<>(null, e2);
    }

    void check() throws E1, E2 {
        if (e1 != null)
            throw e1;
        else
            throw e2;
    }
}

And finally, here’s how we can emulate our exhaustiveness checking switch with catch blocks!

// "Union type" emulating String|int
Switch2<EString, Eint> s = Switch2.of1(new EString("hello"));

// Doesn't compile, Eint isn't caught (catches aren't exhaustive)
try {
    s.check();
}
catch (EString e) {}

// Compiles fine
try {
    s.check();
}
catch (EString e) {}
catch (Eint e) {}

// Also compiles fine
try {
    s.check();
}
catch (EString | Eint e) {}

// Doesn't compile because Eint "partially dominates" EString | Eint
try {
    s.check();
}
catch (Eint e) {}
catch (EString | Eint e) {}

“Neat”, huh? We could even imagine destructuring within the catch block, such that we can automatically unwrap the value from the auxiliary “E” type.

Since we already have “union types” in Java (in catch blocks), and since checked exception declarations could be retrofitted to form a union type with the method’s actual return type, my hopes are still that in some distant future, a more powerful Java will be available where these “union types” (and also intersection types) will be made first class. APIs like jOOQ would greatly profit from this!

Functional Dependencies in SQL GROUP BY

The SQL standard knows an interesting feature where you can project any functional dependencies of a primary (or unique) key that is listed in the GROUP BY clause without having to add that functional dependency to the GROUP BY clause explicitly.

What does this mean? Consider this simple schema:

CREATE TABLE author (
  id INT NOT NULL PRIMARY KEY,
  name TEXT NOT NULL
);

CREATE TABLE book (
  id INT NOT NULL PRIMARY KEY,
  author_id INT NOT NULL REFERENCES author,
  title TEXT NOT NULL
);

In order to count the number of books by author, we tend to write:

SELECT a.name, count(b.id)
FROM author a
LEFT JOIN book b ON a.id = b.author_id
GROUP BY 
  a.id,  -- Required, because names aren't unique
  a.name -- Required in some dialects, but not in others

We have to group by something unique in this case, because if two authors are called John Doe, we still want them to produce separate groups. So GROUP BY a.id is a given.

We’re used to also GROUP BY a.name, especially in these dialects that require this, since we list a.name in the SELECT clause:

  • Db2
  • Derby
  • Exasol
  • Firebird
  • HANA
  • Informix
  • Oracle
  • SQL Server

But is it really required? It isn’t as per the SQL standard, because there is a functional dependency between author.id and author.name. In other words, for each value of author.id, there is exactly one possible value of author.name, or author.name is a function of author.id

This means that it does not matter if we GROUP BY both columns, or only the primary key. The result must be the same in both cases, hence this is possible:

SELECT a.name, count(b.id)
FROM author a
LEFT JOIN book b ON a.id = b.author_id
GROUP BY a.id

Which SQL dialects support this?

At least the following SQL dialects support this language feature:

  • CockroachDB
  • H2
  • HSQLDB
  • MariaDB
  • MySQL
  • PostgreSQL
  • SQLite
  • Yugabyte

It’s noteworthy that MySQL used to simply ignore whether a column could be projected unambiguously or not, in the presence of GROUP BY. While the following query was rejected in most dialects, it was not, in MySQL, prior to the introduction of the ONLY_FULL_GROUP_BY mode:

SELECT author_id, title, count(*)
FROM author
GROUP BY author_id

What should we display for author.title, if an author has written more than one book? It doesn’t make sense, yet MySQL still used to allow it, and would just project any arbitrary value from the group.

Today, MySQL only allows for projecting columns with a functional dependency on the GROUP BY clause, as is permitted by the SQL standard.

Pros & Cons

While the shorter syntax that avoids the extra columns might be easier to maintain (easy to project additional columns, if required), there is some risk of queries breaking in production, namely when underlying constraints are disabled, e.g. for a migration. While it is unlikely that a primary key is disabled in a live system, it could still be the case, and without the key, a previously valid query will no longer be valid for the same reason why MySQL’s old interpretation was invalid: There’s no longer a guarantee of functional dependency.

Other syntax

Starting from jOOQ 3.16, and #11834, it will be possible to reference tables directly in the GROUP BY clause, instead of individual columns. For example:

SELECT a.name, count(b.id)
FROM author a
LEFT JOIN book b ON a.id = b.author_id
GROUP BY a

The semantics will be:

  • If the table has a primary key (composite or not), use that in the GROUP BY clause, instead
  • If the table doesn’t have a primary key, list all the columns from the table instead.

Since none of the RDBMS supported by jOOQ currently supports this syntax, it is a purely synthetic jOOQ feature.

Write C-Style Local Static Variables in Java 16

Java 16 includes an improvement that makes the language a bit more regular via JEP 395. The JEP says:

Static members of inner classes

It is currently specified to be a compile-time error if an inner class declares a member that is explicitly or implicitly static, unless the member is a constant variable. This means that, for example, an inner class cannot declare a record class member, since nested record classes are implicitly static.

We relax this restriction in order to allow an inner class to declare members that are either explicitly or implicitly static. In particular, this allows an inner class to declare a static member that is a record class.

What sounds like a minor necessary evil to make a new feature (record classes) more versatile actually has a life of its own. We can use it to emulate C-style local static variables, i.e. local variables that are:

  • Initialised only once (and lazily at that)
  • Shared among multiple executions of a method

This sounds like a rather hairy feature, global variables that are visible only locally. But in fact, it’s something I’ve wanted for a long time, especially when I wanted to cache regular expression patterns without polluting the class namespace.

Consider this code:

package p;

import java.util.regex.Pattern;

public class Test {
    public static void main(String[] args) {
        check("a");
        check("b");
    }
    
    static Pattern compile(String pattern) {
        System.out.println("compile(" + pattern + ")");
        return Pattern.compile(pattern);
    }
    
    static void check(String string) {
        // Re-compiling the pattern every time: Bad
        // Keeping the pattern local to the method: Good
        System.out.println("check(" + string + "): " 
            + compile("a").matcher(string).find());
    }
}

It prints:

compile(a)
check(a): true
compile(a)
check(b): false

Compiling a pattern can be costly if done frequently, so we better cache it. We used to do that like this:

package p;

import java.util.regex.Pattern;

public class Test {
    public static void main(String[] args) {
        check("a");
        check("b");
    }
    
    static Pattern compile(String pattern) {
        System.out.println("compile(" + pattern + ")");
        return Pattern.compile(pattern);
    }
    
    // Compiling the pattern only once: Good
    // Placing the pattern in a class namespace: Bad
	static final Pattern P_CHECK = compile("a");
	
    static void check(String string) {
        System.out.println("check(" + string + "): " 
            + P_CHECK.matcher(string).find());
    }
}

This now prints a more optimal output:

compile(a)
check(a): true
check(b): false

I.e. the regular expression pattern is compiled only once. But unfortunately, we had to pollute the entire class’s namespace, which can quickly become cumbersome if we have dozens of such regular expressions. Can we scope the P_CHECK variable to the check() method only? We now can!

package p;

import java.util.regex.Pattern;

public class Test {
    public static void main(String[] args) {
        check("a");
        check("b");
    }
    
    static Pattern compile(String pattern) {
        System.out.println("compile(" + pattern + ")");
        return Pattern.compile(pattern);
    }
    
    static void check(String string) {

        // Compiling the pattern only once: Good
        // Keeping the pattern local to the method: Good
        // Capturing scope: Egh...
        var patterns = new Object() { 
            static final Pattern P_CHECK = compile("a");
        };
        
        System.out.println("check(" + string + "): " 
            + patterns.P_CHECK.matcher(string).find());
    }
}

This again prints the desired, optimal output:

compile(a)
check(a): true
check(b): false

The combination of using var to use a non-denotable type (whose members we can dereference) along with the ability of putting static members in inner classes effectively emulates local static variables, just like in C!

Since the inner class is unlikely to escape its scope, the fact that it may be capturing scope isn’t that big of a risk as illustrated previously in a criticism of the double curly brace anti pattern. You’re still creating an extra class and a use-less object that escape analysis hopefully prevents from allocating, so it’s not exactly a very clean solution, but nice to know this is now possible.

The jOOQ Parser Ignore Comment Syntax

jOOQ’s parser can’t parse every possible SQL syntax. Try this random PostgreSQL syntax:

ALTER SYSTEM RESET ALL

And the jOOQ parser will complain:

DOMAIN, INDEX, SCHEMA, SEQUENCE, SESSION, TABLE, TYPE, or VIEW expected: [1:7] ALTER [*]SYSTEM RESET ALL

That’s perfectly fine. The goal of the jOOQ parser isn’t to understand all vendor specific syntax. The goal is to offer parser functionality for all of the syntax supported by jOOQ so far, and to be able to translate it between dialects. For a lot of syntax like the above, there isn’t an equivalent in other dialects anyway.

But if you’re using the jOOQ parser to simulate a database migration, e.g. to create a schema diff, or to generate jOOQ code using the DDLDatabase, then you may have some bits of vendor specific SQL in your scripts that you want jOOQ to ignore. For example, the above command may be in the middle of a migration script:

CREATE TABLE a (i int);
ALTER SYSTEM RESET ALL;
CREATE TABLE b (i int);

Luckily, with jOOQ, you can add special markers around the commands you wish to tell jOOQ to ignore, and jOOQ will ignore them. For this, just enable the Settings.parseIgnoreComments flag, and now you can use a special comment syntax:

CREATE TABLE a (i int);

/* [jooq ignore start] */
ALTER SYSTEM RESET ALL;
/* [jooq ignore stop] */

CREATE TABLE b (i int);

The syntax is transparent to your RDBMS, because they’re just comments. So, the RDBMS will see and execute this, just like before. So, your actual database migration isn’t affected:

CREATE TABLE a (i int);

/*                     */
ALTER SYSTEM RESET ALL;
/*                    */

CREATE TABLE b (i int);

But jOOQ, on the other hand, will interpret the comments a bit differently, and see and execute this, instead:

CREATE TABLE a (i int);

/*

                      */

CREATE TABLE b (i int);

Note, if you don’t like the [jooq ignore start] and [jooq ignore stop] tokens, you can customise them as well via Settings.parseIgnoreCommentStart and Settings.parseIgnoreCommentStop.

Like any of these purely text-based preprocessor syntaxes, this works completely transparently, even within some command syntax. Assuming you’re using some vendor-specific syntax in a CREATE TABLE‘s DEFAULT expression, you can just exclude that DEFAULT expression from jOOQ’s parsing:

CREATE TABLE t (
  a int 
    /* [jooq ignore start] */ 
    DEFAULT some_fancy_expression() 
    /* [jooq ignore stop] */
);

Obviously, you don’t have to format things this way, that’s just for illustration purposes on this blog. Now, again, the RDBMS will see and execute this:

CREATE TABLE t (
  a int 
    /*                     */ 
    DEFAULT some_fancy_expression() 
    /*                    */
);

Whereas jOOQ’s parser will see and execute this:

CREATE TABLE t (
  a int 
    /*

                          */
);

For more information about this topic, please refer to the jOOQ manual.

Use jOO位’s Sneaky Throw to Avoid Checked Exceptions

Don’t you hate how you have to wrap checked exception throwing code in static initialisers? E.g. you cannot write this in Java:

public class Test {
    static final Class<?> klass = Class.forName("org.h2.Driver");
}

There’s an unhandled ClassNotFoundException, and you can’t catch / rethrow it simply. A static initialiser is needed:

public class Test {
    static final Class<?> klass;

    static {
        try {
            klass = Class.forName("org.h2.Driver");
        }
        catch (ClassNotFoundException e) {
            throw new RuntimeException(e);
        }
    }
}

Yuck.

Luckily, one of jOO位’s lesser known features is the Sneaky class, which contains a bunch of utility methods that wrap a JDK functional interface to an equivalent, “sneaky-throwing” functional interface that doesn’t declare the checked exception.

In short, you can write:

public class Test {
    static final Class<?> klass = Sneaky.supplier(
        () -> Class.forName("org.h2.Driver")
    ).get();
}

The exception is simply re-thrown “sneakily”, as the JVM doesn’t care about an exception’s checked-ness. If you don’t have H2 on your classpath, you’ll get:

Exception in thread "main" java.lang.ExceptionInInitializerError
Caused by: java.lang.ClassNotFoundException: org.h2.Driver
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:375)
	at org.jooq.test.util.Test.lambda$0(Test.java:44)
	at org.jooq.lambda.Unchecked.lambda$supplier$38(Unchecked.java:1695)
	at org.jooq.test.util.Test.<clinit>(Test.java:44)

You can use this approach wherever else a JDK functional interface is required, and you don’t care about an exception’s checked-ness, e.g. in streams:

// Doesn't compile:
Stream
    .generate(
        () -> Class.forName("org.h2.Driver"))
    .limit(1)
    .forEach(System.out::println);

// So ugly
Stream
    .generate(
        () -> {
            try {
                return Class.forName("org.h2.Driver");
            }
            catch (ClassNotFoundException e) {
                throw new RuntimeException(e);
            }
        })
    .limit(1)
    .forEach(System.out::println);

// Use jOO位's Sneaky supplier
Stream
    .generate(Sneaky.supplier(
        () -> Class.forName("org.h2.Driver")))
    .limit(1)
    .forEach(System.out::println);

Get jOO位 here: https://github.com/jOOQ/jOOL

Using Testcontainers to Generate jOOQ Code

Database first is at the core of jOOQ’s design. jOOQ has been made primarily for classic systems the database is always there and always has been and will never leave. This is because we think “data have mass”

This not only translates to moving logic closer to the data (see our previous posts about the cost of JDBC round trips or generating vendor agnostic procedural logic), but also avoiding moving data around between migrations (e.g. of RDBMS products).

Compared to how “heavy” data is, applications and UIs come and go. Speaking of go, maybe you’ll replace all of your Java code tomorrow for some go code. But you will keep the database if it isn’t trivial.

With this in mind, jOOQ assumes you have a pre-existing schema, which you mange with Flyway or Liquibase, and then you use jOOQ to reverse engineer your updated schema using the code generator.

The old days

In the old days, setting up an Oracle instance was very heavy, and also hard. I remember working at a company where we had shared development and test instances. The schema was always in flux. We couldn’t assume a stable dev version.

As such, pointing the jOOQ code generator towards a live database used to be a bit of a challenge, which is why jOOQ offers alternative, connection-free code generation modes, including:

  • The JPADatabase, if you have a pre-existing JPA entity based meta model.
  • The XMLDatabase, if you have some form of XML version of your schema, which you can XSL transform to jOOQ’s format
  • The DDLDatabase, which can interpret your DDL scripts, e.g. the ones you pass to Flyway, or the ones produced by pg_dump.
  • The LiquibaseDatabase, which simulates a Liquibase database migration and uses the simulated database output as a source for meta information of the code generator

But all of the above have the same limitation. You can’t really use many vendor-specific features, such as advanced stored procedures, data types, etc.

A modern approach using testcontainers

Ideally, unless you’re supporting several RDBMS products (most people don’t), you should work only with your production database product, say PostgreSQL.

Thanks to testcontainers.org, it’s very easy to programmatically, or configuratively, start up a PostgreSQL instance of any version in a Docker container. If you have a SQL script that contains your database, you can supply it to the testcontainers JDBC URL, e.g. like this:

jdbc:tc:postgresql:13:///sakila?TC_TMPFS=/testtmpfs:rw&TC_INITSCRIPT=file:${basedir}/src/main/resources/postgres-sakila-schema.sql

For more information, see their docs about JDBC support. Now, add the testcontainers dependency on your project classpath, e.g.

<dependency>
  <groupId>org.testcontainers</groupId>
  <artifactId>postgresql</artifactId>
</dependency>

And use the ContainerDatabaseDriver instead of the actual PostgreSQL driver for your code generation configuration in jOOQ, e.g. when using Maven:

<plugin>
  <groupId>org.jooq</groupId>
  <artifactId>jooq-codegen-maven</artifactId>

  <executions>
    <execution>
      <id>java-generator</id>
      <phase>generate-sources</phase>
      <goals>
        <goal>generate</goal>
      </goals>

      <configuration>
        <jdbc>
          <driver>
            org.testcontainers.jdbc.ContainerDatabaseDriver
          </driver>
          <url>${db.url}</url>
          <user>${db.username}</user>
          <password>${db.password}</password>
        </jdbc>
        <generator>
          <database>
            <inputSchema>public</inputSchema>
          </database>
          <target>
            <packageName>org.jooq.example.tc.db</packageName>
            <directory>src/main/java</directory>
          </target>
        </generator>
      </configuration>
    </execution>
  </executions>

  <dependencies>

    <!-- Junit seems a transitive dependency of testcontainers? -->
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.13.1</version>
    </dependency>
  </dependencies>
</plugin>

As simple as that! Check out the jOOQ-testcontainers-example for a runnable example that uses testcontainers for code generation using the above approach.

Adding database change management

A real world example would be using again Flyway or Liquibase, etc. to apply a complete database migration to your PostgreSQL instance inside of testcontainers, prior to generating code and/or running your integration tests.

This just slightly complicates things, but doesn’t produce any impossible problems. Instead of creating throwaway containers with a single TC_INITSCRIPT, you will now have to make sure the following steps are executed consecutively in your build somehow:

  1. A testcontainers instance of your database is started
  2. A Flyway or Liquibase migration is run inside of that database
  3. The jOOQ code generator reverse engineers that database
  4. Optionally, your integration tests also reuse the database of that container

Of course you should integration test your code! But for the sake of this discussion, that might be an optional step, as you may have different preferences on how to run those tests, e.g. more globally than just for this module. But in our example, let’s include the tests.

You can find the full example using testcontainers/flyway/jOOQ here.

Start the testcontainers instance

Unfortunately, testcontainers doesn’t ship any Maven / Gradle plugins yet to invoke container lifecycle management during a build. I’ve created a feature request for that here, which you should upvote: https://github.com/testcontainers/testcontainers-java/issues/4397.

But we can easily help ourselves by using the ever powerful Maven escape hatch that is the groovy-maven-plugin (the ideas are the same for gradle):

<plugin>
  <groupId>org.codehaus.gmaven</groupId>
  <artifactId>groovy-maven-plugin</artifactId>
  <version>2.1.1</version>
  <executions>
    <execution>
      <!-- Start the container in any phase before the actual code
           generation is required, i.e. at the latest in
           generate-sources -->
      <phase>generate-sources</phase>
      <goals>
        <goal>execute</goal>
      </goals>
      <configuration>
        <source>
          db = new org.testcontainers.containers.PostgreSQLContainer(
                  "postgres:latest")
            .withUsername("${db.username}")
            .withDatabaseName("postgres")
            .withPassword("${db.password}");
            
          db.start();

          // After you've started the container, collect its generated
          // JDBC URL (which contains a random port)
          project.properties.setProperty('db.url', db.getJdbcUrl());
        </source>
      </configuration>
    </execution>
  </executions>
  
  <dependencies>
    <dependency>
      <groupId>org.testcontainers</groupId>
      <artifactId>postgresql</artifactId>
      <version>1.15.2</version>
    </dependency>
  </dependencies>
</plugin>

So, that starts a container and keeps it running until the build terminates. I won’t show a graceful shutdown, because it’s not needed for the example, but you could implement that as well, of course.

Now, migrate your database

The above database is empty. Now to run the migration, in the example using Flyway, but it will be the same thing with Liquibase.

<plugin>
  <groupId>org.flywaydb</groupId>
  <artifactId>flyway-maven-plugin</artifactId>
  <version>7.14.0</version>
  <executions>
    <execution>

      <!-- We run the migration in the same phase, before jOOQ's
           code generation -->
      <phase>generate-sources</phase>
      <goals>
        <goal>migrate</goal>
      </goals>
      <configuration>

        <!-- This URL has been set by groovy, above -->
        <url>${db.url}</url>
        <user>${db.username}</user>
        <password>${db.password}</password>
        <locations>
          <location>
            filesystem:src/main/resources/db/migration
          </location>
        </locations>
      </configuration>
    </execution>
  </executions>
</plugin>

Add all the additional complexity of your migration in this configuration if you like. jOOQ wouldn’t know anything about it.

Now, generate the code

Again, nothing special here:

<plugin>
  <groupId>org.jooq</groupId>
  <artifactId>jooq-codegen-maven</artifactId>

  <executions>
    <execution>
      <id>java-generator</id>

      <!-- Same phase as above, but the previous plugins have already
           executed, so we're generating the db post migration -->
      <phase>generate-sources</phase>
      <goals>
        <goal>generate</goal>
      </goals>

      <configuration>
        <jdbc>

          <!-- Again, this URL has been set by groovy, above -->
          <url>${db.url}</url>
          <user>${db.username}</user>
          <password>${db.password}</password>
        </jdbc>
        <generator>
          <database>
            <inputSchema>public</inputSchema>
          </database>
          <target>
            <packageName>org.jooq.example.db</packageName>
          </target>
        </generator>
      </configuration>
    </execution>
  </executions>
</plugin>

And finally, optionally, integration test

If you want to re-use the above container with migrated database also in your integration tests, you could just pass along the generated JDBC URL to the maven-surefire-plugin as follows:

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-surefire-plugin</artifactId>
  <configuration>
    <systemPropertyVariables>

      <!-- Again, this URL has been set by groovy, above -->
      <db.url>${db.url}</db.url>
      <db.username>${db.username}</db.username>
      <db.password>${db.password}</db.password>
    </systemPropertyVariables>
  </configuration>
</plugin>

There are many ways to achieve the same thing, this is one of them that works decently out of the box. You can check out a full example from github here, and play around with it:

https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/jOOQ-testcontainers-flyway-example

More about testcontainers

To learn more about testcontainers, see our interview with Richard North here.

Using jOOQ to write vendor agnostic SQL with JPA’s native query or @Formula

If your legacy JPA application is using occasional native queries or Hibernate @Formula or Spring Data @Query annotation with vendor specific native SQL embedded in it, you can use jOOQ’s parsing connection and parsing data source to translate between dialects, without having to go all in on your jOOQ adoption – though I think it’s inevitable once you see what jOOQ can do for you.

Now, let’s design a table like this:

CREATE TABLE author (
  id INT NOT NULL,
  first_name TEXT,
  last_name TEXT NOT NULL,

  CONSTRAINT pk_author PRIMARY KEY (id)
);

Now, you may want to write a native query on this table using JPA’s EntityManager.createNativeQuery(), mapping that to entities. You could be using jOOQ’s DSL API for this, but let’s say you’re not ready for a migration to jOOQ yet, or you want to use actual SQL as provided by your DBA, not jOOQ’s DSL.

So, in MariaDB, you might be writing something like this:

List<Author> result =
em.createNativeQuery("""
    select a.id, nvl(a.first_name, 'N/A') as first_name, a.last_name
    from t_author as a
    order by a.id
    """, Author.class)
  .getResultList();

Where your entity is defined like this:

@Entity
@Table(name = "author")
public class Author {
    @Id
    public int id;

    @Column(name = "first_name")
    public String firstName;

    @Column(name = "last_name")
    public String lastName;

    // Constructors, getters, setters, equals, hashCode, etc
}

The above runs just fine and produces all the authors in MariaDB, which implemented support for Oracle’s NVL() function. But what about Oracle itself? The query fails on Oracle with:

ORA-00933: SQL command not properly ended

This is because in Oracle, you can’t use the AS keyword to alias tables, only to alias columns. Sure, you can remove that, but what about NVL()? You want this to work on MySQL and SQL Server, as well, but they complain:

MySQL

SQL Error [1305] [42000]: FUNCTION test.nvl does not exist

SQL Server

SQL Error [195] [S0010]: ‘nvl’ is not a recognized built-in function name.

Now, you have these options:

  • Use jOOQ to generate the SQL string for you, using the DSL
  • Use JPQL instead of a native query (but then heavily rewrite it, because JPQL is much less powerful than SQL)
  • Try your luck writing actual vendor agnostic SQL, manually
  • Or…

jOOQ’s parsing connection

You could use jOOQ’s parsing connection, which acts as a proxy to your actual connection, intercepting every SQL statement on the JDBC level in order to translate it to the target dialect.

This is as simple as wrapping your existing JDBC Connection or DataSource as follows:

DataSource originalDataSource = ...
DataSource parsingDataSource = DSL
    .using(originalDataSource, dialect)
    .parsingDataSource();

That’s it! I mean, you could pass some additional configuration Settings after the dialect, but that’s about as easy as it gets. The new DataSource can now run your SQL query on all the aforementioned dialects, e.g. you might see this in your DEBUG logs:

On MySQL:

-- org.hibernate.SQL
   select a.id, nvl(a.first_name, 'N/A') as first_name, a.last_name 
   from t_author as a 
   order by a.id
-- org.jooq.impl.ParsingConnection Translating from:
   select a.id, nvl(a.first_name, 'N/A') as first_name, a.last_name
   from t_author as a 
   order by a.id
-- org.jooq.impl.ParsingConnection Translation cache miss: 
   select a.id, nvl(a.first_name, 'N/A') as first_name, a.last_name
   from t_author as a 
   order by a.id
-- org.jooq.impl.ParsingConnection Translating to: 
   select a.id, ifnull(a.first_name, 'N/A') as first_name, a.last_name
   from t_author as a 
   order by a.id

On SQL Server:

-- org.hibernate.SQL
   select a.id, nvl(a.first_name, 'N/A') as first_name, a.last_name 
   from t_author as a 
   order by a.id
-- org.jooq.impl.ParsingConnection Translating from:
   select a.id, nvl(a.first_name, 'N/A') as first_name, a.last_name
   from t_author as a 
   order by a.id
-- org.jooq.impl.ParsingConnection Translation cache miss: 
   select a.id, nvl(a.first_name, 'N/A') as first_name, a.last_name
   from t_author as a 
   order by a.id
-- org.jooq.impl.ParsingConnection] Translating to: 
   select a.id, coalesce(a.first_name, 'N/A') first_name, a.last_name 
   from author a 
   order by a.id

Hibernate was tricked by jOOQ! The NVL function was translated to MySQL’s IFNULL or SQL Server COALESCE, and the AS keyword was removed from the SQL Server query. These are just simple examples, your actual SQL may be much more complex. Play around with the feature set online, here.

Also, the Settings.cacheParsingConnectionLRUCacheSize flag, which defaults to 8192, makes sure the same query doesn’t get re-translated all the time so you won’t spend too much time in jOOQ’s parser.

@Formula too, not just native queries

A quick win in Hibernate when you want to project additional values, akin to SQL’s own computed columns, which are available in many SQL dialects, is the @Formula annotation, which can be added to any entity, like so. Assuming this additional column:

ALTER TABLE author ADD year_of_birth INT;

We might have the following amended entity:

@Entity
@Table(name = "author")
public class Author {
    @Id
    public int id;

    @Column(name = "first_name")
    public String firstName;

    @Column(name = "last_name")
    public String lastName;

    @Column(name = "year_of_birth")
    public Integer yearOfBirth;

    @Formula("year_of_birth between 1981 and 1996")
    public Boolean millenial;

    // Constructors, getters, setters, equals, hashCode, etc
}

But unfortunately, there are still so many RDBMS that don’t actually support boolean types, and the @Formula annotation is purely static, and doesn’t allow for vendor-specific overrides. Are we going to rewrite that SQL query manually to make sure a SQL-92, vendor agnostic native SQL fragment is present that works on all dialects?

Or will we just plug in jOOQ’s parsing connection again? Let’s try the latter with:

Author author = em.find(Author.class, 1);

The MySQL log contains:

-- org.hibernate.SQL
   select 
     jpaauthorw0_.id as id1_4_0_,
     jpaauthorw0_.first_name as first_na2_4_0_, 
     jpaauthorw0_.last_name as last_nam3_4_0_, 
     jpaauthorw0_.year_of_birth between 1981 and 1996 as formula1_0_ 
   from author jpaauthorw0_ 
   where jpaauthorw0_.id=?
-- org.jooq.impl.ParsingConnection Translating from: [...]
-- org.jooq.impl.ParsingConnection Translation cache miss: [...]
-- org.jooq.impl.ParsingConnection Translating to: 
   select 
     jpaauthorw0_.id as id1_4_0_, 
     jpaauthorw0_.first_name as first_na2_4_0_, 
     jpaauthorw0_.last_name as last_nam3_4_0_, 
     jpaauthorw0_.year_of_birth between 1981 and 1996 as formula1_0_ 
   from author as jpaauthorw0_ 
   where jpaauthorw0_.id = ?

As you can see, jOOQ re-added the AS keyword to alias the on MySQL for good measure, because we like to be explicit about aliasing, and because that’s the default for Settings.renderOptionalAsKeywordForTableAliases

Whereas the SQL Server log contains:

-- org.hibernate.SQL 
   select 
     jpaauthorw0_.id as id1_4_0_, 
     jpaauthorw0_.first_name as first_na2_4_0_, 
     jpaauthorw0_.last_name as last_nam3_4_0_, 
     jpaauthorw0_.year_of_birth between 1981 and 1996 as formula1_0_ 
   from author jpaauthorw0_ 
   where jpaauthorw0_.id=?
-- org.jooq.impl.ParsingConnection Translating from: [...]
-- org.jooq.impl.ParsingConnection Translation cache miss: [...]
-- org.jooq.impl.ParsingConnection Translating to: 
   select
     jpaauthorw0_.id id1_4_0_, 
     jpaauthorw0_.first_name first_na2_4_0_, 
     jpaauthorw0_.last_name last_nam3_4_0_, 
     case 
       when jpaauthorw0_.year_of_birth between 1981 and 1996 
         then 1 
       when not (jpaauthorw0_.year_of_birth between 1981 and 1996) 
         then 0 
     end formula1_0_ 
   from author jpaauthorw0_ 
   where jpaauthorw0_.id = ?

A NULL-safe BOOLEAN type emulation (because if YEAR_OF_BIRTH is NULL (i.e. UNKNOWN), then MILLENIAL must be NULL, too (i.e. UNKNOWN))

Spring Data @Query annotation

Another case where native SQL can appear in JPA integrations is Spring Data JPA with its @Query annotation, especially when used with @Query(nativeQuery = true). Just like Hibernate’s @Formula, this annotation is compile-time static, with no way to override the value of the native query at runtime, short of maybe subtyping repositories per dialect.

But why go through all that hassle. It’s always the same thing. Just patch the DataSource with jOOQ’s parsing connection or parsing data source, and you’re set.

Conclusion

Even if you’re not using jOOQ’s DSL API, you can profit from jOOQ in many ways in your existing JDBC, R2DBC, JPA, MyBatis, etc. based applications by hooking in the jOOQ parsing connection and translating your vendor specific input dialect to any number of configurable output dialects.

If jOOQ’s parser can’t handle a feature, chances are, you can work around this limitation using the ParseListener SPI, e.g. when you want to support a hypothetical LOGICAL_XOR predicate (which is supported natively in MySQL):

Query query = configuration
    .derive(ParseListener.onParseCondition(ctx -> {
        if (ctx.parseFunctionNameIf("LOGICAL_XOR")) {
            ctx.parse('(');
            Condition c1 = ctx.parseCondition();
            ctx.parse(',');
            Condition c2 = ctx.parseCondition();
            ctx.parse(')');

            return CustomCondition.of(c -> {
                switch (c.family()) {
                    case MARIADB:
                    case MYSQL:
                        c.visit(condition("{0} xor {1}", c1, c2));
                        break;
                    default:
                        c.visit(c1.andNot(c2).or(c2.andNot(c1)));
                        break;
            });
        }

        // Let the parser take over if we don't know the token
        return null;
    })
    .dsl()
    .parser()
    .parseQuery(
        "select * from t where logical_xor(t.a = 1, t.b = 2)"
    );
  
System.out.println(DSL.using(SQLDialect.MYSQL).render(query));
System.out.println(DSL.using(SQLDialect.ORACLE).render(query));

The above program will print:

-- MYSQL:
select * 
from t
where (t.a = 1 xor t.b = 2);

-- ORACLE:
select * 
from t 
where (t.a = 1 and not (t.b = 2)) or (t.b = 2 and not (t.a = 1));

So, profit from using jOOQ to migrate your application’s vendor specific SQL off one RDBMS onto another, or to support multiple RDBMS products in a single application, with or without using jOOQ’s DSL!

Side note: Query transformation

This isn’t the topic of this blog post, but once you have jOOQ parse your every SQL statement, you can also use jOOQ to transform this SQL and tamper with the expression tree, for example by implementing client side row level security. The possibilities are endless!