Standard SQL/JSON – The Sobering Parts

It’s been almost 1 year now since jOOQ 3.14 was released in October 19, 2020 with SQL/JSON (and SQL/XML) support. Half a year later, we’ve released jOOQ 3.15 with MULTISET support, which builds on top of these features to offer type-safe nested collections, the way every ORDBMS should implement them.

Building (dogfooding) on top of our own SQL/JSON API has revealed a lot of caveats of the various SQL/JSON implementations across vendors, and to be frank, it’s been a bit of a sobering experience. Despite there now being the ISO/IEC TR 19075:6 standard (mostly driven by Oracle this time), many vendors have already implemented some kind of JSON support, and it looks differently in all dialects – to the extent where writing vendor agnostic SQL/JSON is almost impossible with hand written native SQL. You’ll need an API like jOOQ or any other abstraction to standardise the different dialects.

In this article, I’d like to share a few of the biggest caveats encountered over the past year. There are many more, just try to translate some standard SQL/JSON to various dialects with jOOQ.

JSON Type or String?

JSON documents can be seen as simple strings with formatted content. Sure, I mean you can parse a string into a JSON document, so why complicate the SQL type system with new types?

Unfortunately, this proves to be a poor decision by most vendors. PostgreSQL got this right by offering the JSON and JSONB types, the latter being quite performant even when storing and indexing JSON documents.

First off, and this comes to no surprise to anyone working with almost any other language than SQL: Types are semantic. Stringly typed environments are bad. Or as the quote goes:

The Bitterness of Poor Quality Remains Long After the Sweetness of Low Price is Forgotten — Benjamin Franklin

With “Low Price” being the low price of the quick and dirty stringly-typed feature addition without a formal type. We already had this time and again with BOOLEAN types, and dialects like MySQL that pretend that supporting this is a good idea:

SELECT *
FROM t
WHERE 1 OR 0 -- TRUE OR FALSE

Let’s look at examples of what I mean.

MariaDB and MySQL

Let’s look at MariaDB’s and MySQL’s syntax first. Here’s how to create a JSON array:

select json_array(1, 2)

Producing

[1, 2]

That’s great! And it even conforms to standard SQL syntax. A JSON array can also be nested easily:

select json_array(json_array(1, 2), 3)

To produce:

[[1, 2], 3]

Now, what if that nested array originates from a derived table?

select json_array(nested, 3)
from (select json_array(1, 2) nested) as t

It results in:

-- MySQL, MariaDB 10.6
[[1, 2], 3]

-- MariaDB 10.5
["[1, 2]", 3]

Bummer. In MariaDB 10.5, the nested array lost its “JSON type annotation” and went back to the stringly typed version of itself. As a string, it needs to be quoted. This seems to be a bug https://jira.mariadb.org/browse/MDEV-26134 which apparently has been fixed already under a different issue in MariaDB 10.6. It’s not the only one, though. The bug tracker is full of similar issues: https://jira.mariadb.org/browse/MDEV-13701.

MySQL seems to fare a bit better by now, though there are caveats when aggregating (see further down).

The only workaround I’ve found so far for the above bugs is extremely laborious:

select json_array(json_merge_preserve('[]', nested), 3)
from (select json_array(1, 2) nested) as t

Imagine doing that every time you nest JSON. It works, but it really needs to be automated (e.g. through jOOQ).

Oracle

Oracle defined most of the SQL standard, and I really like how SQL-idiomatic its syntax feels, just like SQL/XML. Unfortunately, they waited with the introduction of the new JSON type until Oracle 21c (can’t wait to play with it). As such, we have to choose whether to represent JSON as VARCHAR2, by default (limited to 4000 bytes on some systems or 32kb at most!) or CLOB. If you’re doing serious JSON, you can probably always add the RETURNING CLOB clause to every single JSON function call of yours. Effectively:

select json_array(1, 2 returning clob)
from dual

The nested version looks like this:

select json_array(nested, 3 returning clob)
from (select json_array(1, 2 returning clob) nested from dual) t;

This works much better in Oracle, but there are still many edge cases that it doesn’t get right. Try this on Oracle 18c:

select json_arrayagg(coalesce(json_object(), json_object()))
from dual

And it produces another instance of erroneously stringly-typed JSON:

["{}"]

The workaround is to add FORMAT JSON just about everywhere, just to be sure, e.g.

select json_arrayagg(
  coalesce(json_object(), json_object()) format json
)
from dual

Now, the result is as expected:

[{}]

To play it safe, you’ll probably have to write FORMAT JSON just about everywhere again, like RETURNING CLOB

Truncation

Truncation is probably the most sobering part when using JSON. Why do we still have size-limited data types in SQL? It hardly every makes sense from an application perspective. Yet, here we are.

Oracle

Run this query in Oracle:

select json_arrayagg(owner) from all_objects;

And you’ll get:

SQL Error [40478] [99999]: ORA-40478: output value too large (maximum: 4000)

Increasing the maximum VARCHAR2 size to 32kb will only postpone the problem. There is no “reasonable” size limit to such documents, so again, you’ll have to RETURNING CLOB all the time.

select json_arrayagg(owner returning clob) from all_objects;

The price is the same as always. CLOB are just a bit more annoying to work with than VARCHAR2, both within Oracle as well as in the client application (e.g. JDBC based), because you have to work with yet another resource when you just wanted a string. jOOQ will just add the clause everywhere for you, there’s hardly a reason not to do that with jOOQ, as the fetching of CLOB values is completely transparent to jOOQ users.

MySQL

Up until recently, you couldn’t call JSON_ARRAYAGG() in MySQL, and the MariaDB version crashed the server (https://jira.mariadb.org/browse/MDEV-21912). When this blog was written, neither implementation supported the ORDER BY clause in JSON_ARRAYAGG(), which I find quite essential, so the workaround is to use GROUP_CONCAT:

select concat('[', group_concat(id), ']')
from t_book

Of course, that is very wrong if the concatenated values are not numbers, so we need to also use JSON_QUOTE, e.g.

select concat('[', group_concat(json_quote(title)), ']')
from t_book

And, if you’re embedding that stuff in other JSON structures, you have to turn what really is a string now, back into JSON using JSON_MERGE (until recently, but now deprecated), or JSON_MERGE_PRESERVE, e.g.

select json_object(
  'titles', 
  json_merge_preserve(
    '[]', 
    concat('[', group_concat(json_quote(title)), ']')
  )
)
from t_book

To produce a document like this:

{"titles": ["1984", "Animal Farm", "O Alquimista", "Brida"]}

Without that JSON_MERGE_PRESERVE, you’d be getting:

{"titles": "[\"1984\",\"Animal Farm\",\"O Alquimista\",\"Brida\"]"}

Definitely not something you can ever remember.

Anyway. This section was about truncation! What happens with large, aggregated JSON documents in MySQL? Try this:

select 
  concat('[', group_concat(json_quote(table_name)), ']')
from information_schema.tables

It produces (on my machine):

["innodb_table_stats","innodb_index_stats","CHARACTER_SETS","CHECK_CONSTRAINTS","COLLATIONS","COLLATION_CHARACTER_SET_APPLICABILITY","COLUMNS","COLUMNS_EXTENSIONS","COLUMN_STATISTICS","EVENTS","FILES","INNODB_DATAFILES","INNODB_FOREIGN","INNODB_FOREIGN_COLS","INNODB_FIELDS","INNODB_TABLESPACES_BRIEF","KEY_COLUMN_USAGE","KEYWORDS","PARAMETERS","PARTITIONS","REFERENTIAL_CONSTRAINTS","RESOURCE_GROUPS","ROUTINES","SCHEMATA","SCHEMATA_EXTENSIONS","ST_SPATIAL_REFERENCE_SYSTEMS","ST_UNITS_OF_MEASURE","ST_GEOMETRY_COLUMNS","STATISTICS","TABLE_CONSTRAINTS","TABLE_CONSTRAINTS_EXTENSIONS","TABLES","TABLES_EXTENSIONS","TABLESPACES_EXTENSIONS","TRIGGERS","VIEW_ROUTINE_USAGE","VIEW_TABLE_USAGE","VIEWS","COLUMN_PRIVILEGES","ENGINES","OPTIMIZER_TRACE","PLUGINS","PROCESSLIST","PROFILING","SCHEMA_PRIVILEGES","TABLESPACES","TABLE_PRIVILEGES","USER_PRIVILEGES","cond_instances","error_log","events_waits_current","events_waits_history","events_waits_history_long","events_waits_summary_by_host_by_event_name","events_waits_summary_by]

Wait, what is that at the end

,"events_waits_summary_by]

Invalid JSON. Because the GROUP_CONCAT string got truncated. We could set the following

set @@group_concat_max_len = 4294967295;

And now the output is correct, and much longer:

["innodb_table_stats",...,"t_identity_pk","t_triggers"]

An API like jOOQ will be able to set this session variable automatically for you, but you probably don’t want to think about this all the time for native SQL?

Data Type Support

JSON knows a few data types. Namely:

  • string
  • number
  • boolean
  • null
  • object
  • array

That’s less than SQL, but often good enough (anything can be encoded as a string, after all). But when you don’t have BOOLEAN types in SQL (e.g. MySQL, Oracle), then you have to manually encode your already manual encoding to JSON BOOLEAN.

MySQL

MySQL makes you believe that this works:

select json_array(true, false)

Because the above produces

[true, false]

But it doesn’t really work. That seems to be hard coded in the parser. As soon as your true and false values are expressions instead of literals, e.g. originating from a derived table:

select json_array(t, f)
from (select true as t, false as f) as t

You’ll get:

[1, 0]

There are different ways to emulate this. One being:

select json_array(
  json_extract(case when t = 1 then 'true' when t = 0 then 'false' end, '$'),
  json_extract(case when f = 1 then 'true' when f = 0 then 'false' end, '$')
)
from (select true as t, false as f) as t

Now, we’re getting again:

[true, false]

Oracle

Unlike MySQL / MariaDB, Oracle SQL doesn’t pretend it has a BOOLEAN type. Instead, people encode it as a NUMBER(1) or CHAR(1) or some other thing. Irrespective of the encoding, this is the solution:

select json_array(
  case when t = 1 then 'true' when t = 0 then 'false' end format json,
  case when f = 1 then 'true' when f = 0 then 'false' end format json
)
from (
  select 1 as t, 0 as f, null as n
  from dual
) t

Producing

[true,false]

NULL handling

When using NULL with SQL/JSON, there are various caveats. First off, SQL NULL is not the same thing as JSON NULL. Using PostgreSQL:

select 
  a, 
  b, 
  a = b as equal, 
  a is null as a_is_null, 
  b is null as b_is_null
from (
  select null::jsonb as a, 'null'::jsonb as b
) as t

Produces:

|a  |b   |equal|a_is_null|b_is_null|
|---|----|-----|---------|---------|
|   |null|     |true     |false    |

Where the empty cells are SQL NULL values, and the null value is a JSON null value that is not “SQL NULL“. It’s the only reasonable thing to do, really. Do dialects agree?

MySQL

select 
  a, 
  b,
  a = b as equal, 
  a is null as a_is_null, 
  b is null as b_is_null
from (
  select null as a, json_extract('null', '$') as b
) as t

Producing also:

|a  |b   |equal|a_is_null|b_is_null|
|---|----|-----|---------|---------|
|   |null|     |1        |0        |

So, yes!

Oracle

Let’s see, this could be how I can create a JSON NULL value:

select 
  a, 
  b, 
  case when a = b then 1 else 0 end as equal, 
  case when a is null then 1 else 0 end as a_is_null, 
  case when b is null then 1 else 0 end as b_is_null
from (
  select null as a, json_value('[null]', '$[0]') as b
  from dual
) t

Yet, Oracle is known to have a funky relationship with NULL strings. This is the result:

|A  |B  |EQUAL|A_IS_NULL|B_IS_NULL|
|---|---|-----|---------|---------|
|   |   |0    |1        |1        |

There does not seem to be an actual JSON NULL representation! I couldn’t find a workaround for this yet. Probably I will. But this is very unfortunate, leading to many conversion edge cases.

On the other hand, Oracle is the only dialect among the ones presented in this article that introduced a very useful NULL handling clause for aggregate functions. Check this out:

select
  json_arrayagg(a),
  json_arrayagg(a absent on null),
  json_arrayagg(a null on null)
from (
  select 1 as a from dual union all
  select null from dual
) t

Producing

|A  |B  |C       |
|---|---|--------|
|[1]|[1]|[2,null]|

Note the above query doesn’t produce the correct result on Oracle 18c yet, because of a parser / optimiser bug. Use this to work around the bug for this example:

select
  json_arrayagg(a) a,
  json_arrayagg(a + 0 absent on null) b,
  json_arrayagg(a - 0 null on null) c
from (
  select 1 as a from dual union all
  select null from dual
) t

Other dialects have different opinions on how to aggregate NULL values to JSON documents. In SQL, aggregate functions tend to ignore NULL values, such as Oracle above, but with JSON, it is often essential to include the value, especially when creating a JSON_OBJECT, where an absent key is not strictly the same thing as an absent value.

Db2

Db2 has a very limited implementation of SQL/JSON. It is very standards compliant in terms of syntax, but suffers from severe bugs like these:

Which make it quite unusable for now. Surely, this will improve in the near future.

SQL Server

I’ve also left out SQL Server in this article. SQL Server had JSON and XML support for a while, but implemented it completely differently. You can’t just form arbitrary JSON objects or arrays easily, but you can transform result sets into JSON in an intuitive way.

This offers a quick win when streaming results, but doesn’t compose very well. E.g. you can’t create a JSON_ARRAY with scalar, non-object content, though a JSON_OBJECT can be created like this:

select 1 as a, 2 as b
for json path, without_array_wrapper

Producing

{"a":1,"b":2}

A limited number of JSON features can be emulated also in SQL Server, and as with all the dialects that don’t support an actual JSON type, constant escaping / unescaping may be required.

Conclusion

SQL/JSON was standardised relatively late, mostly by Oracle. The standard is very sound. But regrettably, a lot of dialects disagree on syntax, behaviour, and some are still quite buggy.

The soundest of all implementations is that of PostgreSQL, which introduced proper data types and offers a rich set of vendor-specific functions to manipulate JSON directly in SQL. In the near future, PostgreSQL will embrace standard SQL/JSON, just like Oracle, and enhance its own implementation. I don’t think there will be anything fundamentally new, just more portable standard syntax.

See this talk for a preview of “what’s next”:

There is nothing insurmountable for a library like jOOQ, which abstracts over all the dialects and offers the SQL/JSON (and SQL/XML) functionality using a uniform, standards-inspired API. As dialects improve, jOOQ will adopt new, better syntax by individual vendors, so you can implement your SQL/JSON queries forwards compatibly.

So, as always, use jOOQ to work around the subtle and boring differences of the various vendor syntaxes and start using SQL/JSON already today. Despite the caveats presented here, SQL/JSON is one of the most exciting things about modern SQL!

Read more:

jOOQ 3.15’s New Multiset Operator Will Change How You Think About SQL

This is how SQL should have been used all along.

They called it The Third Manifesto, ORDBMS, or other things. Regrettably, it never really took off. Because most vendors didn’t adopt it. And those who did, didn’t agree on syntax.

But this is about to change. Thanks to the now ubiquitous SQL/JSON support (which jOOQ 3.14 has already covered), we can now emulate the most powerful ORDBMS feature that you will want to use everywhere: Nested collections!

How We Used to do Things: With Joins

We’re going to be using the Sakila database for this example. It’s a DVD rental store with things like ACTOR, FILM, CATEGORY (of films) and other nice relational things. Let’s write a query for this requirement

Get me all the films with their actors and their categories

Classically, we’d go ahead and use jOOQ to write:

ctx.select(
      FILM.TITLE,
      ACTOR.FIRST_NAME,
      ACTOR.LAST_NAME,
      CATEGORY.NAME
   )
   .from(ACTOR)
   .join(FILM_ACTOR)
     .on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .join(FILM)
     .on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .join(FILM_CATEGORY)
     .on(FILM.FILM_ID.eq(FILM_CATEGORY.FILM_ID))
   .join(CATEGORY)
     .on(FILM_CATEGORY.CATEGORY_ID.eq(CATEGORY.CATEGORY_ID))
   .orderBy(1, 2, 3, 4)
   .fetch();

And the result? Not so nice. A denormalised, flat table containing tons of repetition:

+----------------+----------+---------+-----------+
|title           |first_name|last_name|name       |
+----------------+----------+---------+-----------+
|ACADEMY DINOSAUR|CHRISTIAN |GABLE    |Documentary|
|ACADEMY DINOSAUR|JOHNNY    |CAGE     |Documentary|
|ACADEMY DINOSAUR|LUCILLE   |TRACY    |Documentary|
|ACADEMY DINOSAUR|MARY      |KEITEL   |Documentary|
|ACADEMY DINOSAUR|MENA      |TEMPLE   |Documentary|
|ACADEMY DINOSAUR|OPRAH     |KILMER   |Documentary|
|ACADEMY DINOSAUR|PENELOPE  |GUINESS  |Documentary|
|ACADEMY DINOSAUR|ROCK      |DUKAKIS  |Documentary|
|ACADEMY DINOSAUR|SANDRA    |PECK     |Documentary|
|ACADEMY DINOSAUR|WARREN    |NOLTE    |Documentary|
|ACE GOLDFINGER  |BOB       |FAWCETT  |Horror     |
|ACE GOLDFINGER  |CHRIS     |DEPP     |Horror     |
|ACE GOLDFINGER  |MINNIE    |ZELLWEGER|Horror     |
 ...

If we don’t consume this result unmodified as it is (e.g. when displaying tabular data to a user), we’d then go and de-duplicate things, shoehorning them back again into some nested data structures (e.g. for consumption by some JSON based UI), and spending hours trying to untangle the cartesian products between the 2 nested collections FILM -> ACTOR and FILM -> CATEGORY (because ACTOR and CATEGORY now created a cartesian product, which we didn’t want!)

In the worst case, we don’t even notice! This example database only has 1 category per film, but it is designed to support multiple categories.

jOOQ can help with that deduplication, but just look at the number of questions for jOOQ many to many on Stack Overflow! You’ll probably still have to write at least 2 queries to separate the nested collections.

ENTER the Stage: Multiset

The standard SQL <multiset value constructor> operator allows for collecting the data from a correlated subquery into a nested data structure, a MULTISET. Everything in SQL is a MULTISET, so the operator isn’t too surprising. But the nesting is where it shines. The previous query can now be re-written in jOOQ as follows:

var result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
      ).as("actors"),
      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
      ).as("films")
   )
   .from(FILM)
   .orderBy(FILM.TITLE)
   .fetch();

Note, it’s not relevant to this task, but I’m using jOOQ’s type safe implicit to-one join feature, which further helps taming the joins, syntactically. A matter of taste.

How to read this query? Easy:

  • Get all the films
  • For each FILM, get all the actors as a nested collection
  • For each FILM, get all the categories as a nested collection

You’re going to like Java 10’s var keyword even more after this :) Because what type is result? It’s of this type:

Result<Record3<
  String,                          // FILM.TITLE
  Result<Record2<String, String>>, // ACTOR.FIRST_NAME, ACTOR.LAST_NAME
  Result<Record1<String>>          // CATEGORY.NAME
>>

That’s quite something. Not too complex if you think of it. There’s a result with 3 columns

  • TITLE
  • A first nested result that has 2 string columns: ACTOR.FIRST_NAME and ACTOR.LAST_NAME
  • A second nested result that has 1 string column: CATEGORY.NAME

Using var or other type inference mechanisms, you don’t have to denote this type. Or even better (stay tuned): We’ll type-safely map the structural type to our nominal DTO type hierarchy, with just a few additional lines of code. I’ll explain that later.

What Does the Result Look Like?

Calling toString() on the above Result type yields something like this:

+---------------------------+--------------------------------------------------+---------------+
|title                      |actors                                            |films          |
+---------------------------+--------------------------------------------------+---------------+
|ACADEMY DINOSAUR           |[(PENELOPE, GUINESS), (CHRISTIAN, GABLE), (LUCI...|[(Documentary)]|
|ACE GOLDFINGER             |[(BOB, FAWCETT), (MINNIE, ZELLWEGER), (SEAN, GU...|[(Horror)]     |
|ADAPTATION HOLES           |[(NICK, WAHLBERG), (BOB, FAWCETT), (CAMERON, ST...|[(Documentary)]|
|AFFAIR PREJUDICE           |[(JODIE, DEGENERES), (SCARLETT, DAMON), (KENNET...|[(Horror)]     |
|AFRICAN EGG                |[(GARY, PHOENIX), (DUSTIN, TAUTOU), (MATTHEW, L...|[(Family)]     |
|AGENT TRUMAN               |[(KIRSTEN, PALTROW), (SANDRA, KILMER), (JAYNE, ...|[(Foreign)]    |
 ...

Notice how we’re back to getting each FILM.TITLE entry only once (no duplication), and nested in each row are the subquery results. There’s no denormalisation happening!

When calling result.formatJSON() with the appropriate formatting options, we’ll get this representation:

[
  {
    "title": "ACADEMY DINOSAUR",
    "actors": [
      {
        "first_name": "PENELOPE",
        "last_name": "GUINESS"
      },
      {
        "first_name": "CHRISTIAN",
        "last_name": "GABLE"
      },
      {
        "first_name": "LUCILLE",
        "last_name": "TRACY"
      },
      {
        "first_name": "SANDRA",
        "last_name": "PECK"
      },

      ...
    ],
    "films": [
      { "name": "Documentary" }
    ]
  },
  {
    "title": "ACE GOLDFINGER",
    "actors": [
      {
        "first_name": "BOB",
        "last_name": "FAWCETT"
      },
  ...

Calling result.formatXML() would produce this:

<result>
  <record>
    <title>ACADEMY DINOSAUR</title>
    <actors>
      <result>
        <record>
          <first_name>PENELOPE</first_name>
          <last_name>GUINESS</last_name>
        </record>
        <record>
          <first_name>CHRISTIAN</first_name>
          <last_name>GABLE</last_name>
        </record>
        <record>
          <first_name>LUCILLE</first_name>
          <last_name>TRACY</last_name>
        </record>
        <record>
          <first_name>SANDRA</first_name>
          <last_name>PECK</last_name>
        </record>
        ...
      </result>
    </actors>
    <films>
      <result>
        <record>
          <name>Documentary</name>
        </record>
      </result>
    </films>
  </record>
  <record>
    <title>ACE GOLDFINGER</title>
    <actors>
      <result>
        <record>
          <first_name>BOB</first_name>
          <last_name>FAWCETT</last_name>
        </record>
        ...

You get the idea!

What’s the Generated SQL?

Just turn on jOOQ’s DEBUG logging and observe a query like this one (in PostgreSQL):

select
  film.title,
  (
    select coalesce(
      jsonb_agg(jsonb_build_object(
        'first_name', t.first_name,
        'last_name', t.last_name
      )),
      jsonb_build_array()
    )
    from (
      select 
        alias_78509018.first_name, 
        alias_78509018.last_name
      from (
        film_actor
          join actor as alias_78509018
            on film_actor.actor_id = alias_78509018.actor_id
        )
      where film_actor.film_id = film.film_id
    ) as t
  ) as actors,
  (
    select coalesce(
      jsonb_agg(jsonb_build_object('name', t.name)),
      jsonb_build_array()
    )
    from (
      select alias_130639425.name
      from (
        film_category
          join category as alias_130639425
            on film_category.category_id = alias_130639425.category_id
        )
      where film_category.film_id = film.film_id
    ) as t
  ) as films
from film
order by film.title

The Db2, MySQL, Oracle, SQL Server versions would look similar. Just try it on your Sakila database installation. It runs fast, too.

Mapping the Results to DTOs

Now, I promised to get rid of that lengthy structural type with all the generics. Check this out!

We used to call them POJOs (Plain Old Java Objects). Then DTOs (Data Transfer Objects). Now records. Yes, let’s try some Java 16 records here. (Note, records aren’t required for these examples. Any POJOs with appropriate constructors would do).

Wouldn’t it be nice, if result was of this type

record Actor(String firstName, String lastName) {}
record Film(
  String title,
  List<Actor> actors,
  List<String> categories
) {}

List<Film> result = ...

Structural typing is essential to jOOQ and its type safe query system, but in your code, you probably don’t want to have those merge-conflict lookalike generics all the time, and even var won’t help you if your data needs to be returned from a method.

So, let’s transform our jOOQ query to one that produces List<Film>, step by step. We’re starting with the original query, untouched:

Result<Record3<
    String, 
    Result<Record2<String, String>>, 
    Result<Record1<String>>
>> result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
      ).as("actors"),
      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
      ).as("films")
   )
   .from(FILM)
   .orderBy(FILM.TITLE)
   .fetch();

Now, we’re going to map the Actor in the first MULTISET expression. This can be done as follows, using the new Field.convertFrom() convenience method, which allows to turn a Field<T> into any read-only Field<U> for ad-hoc usage. A simple example would be this:

record Title(String title) {}

// Use this field in any query
Field<Title> title = FILM.TITLE.convertFrom(Title::new);

It’s just an easy, new way to attach a read-only Converter to a Field for single usage, instead of doing that with the code generator.

Applied to the original query:

Result<Record3<
    String, 
    List<Actor>,  // A bit nicer already
    Result<Record1<String>>
>> result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))

      // Magic here: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
      ).as("actors").convertFrom(r -> r.map(mapping(Actor::new))),

      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
      ).as("films")
   )
   .from(FILM)
   .orderBy(FILM.TITLE)
   .fetch();

What are we doing here?

  • The method convertFrom() takes a lambda Result<Record2<String, String>> -> Actor.
  • The Result type is the usual jOOQ Result, which has a Result.map(RecordMapper<R, E>) method.
  • The mapping() method is the new Records.mapping(), which isn’t doing much, just turning a constructor reference of type Function2<String, String, Actor> into a RecordMapper, which can then be used to turn a Result<Record2<String, String>> into a List<Actor>.

And it type checks! Try it yourself. If you add a column to the multiset, you’ll get a compilation error. If you add/remove an attribute from the Actor record, you’ll get a compilation error. No reflection here, just declarative mapping of jOOQ results/records to custom List<UserType>. If you prefer the “old” reflection approach using jOOQ’s ubiquitous into() methods, you can still do that, too:

Result<Record3<
    String, 
    List<Actor>,  // A bit nicer already
    Result<Record1<String>>
>> result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))

      // Magic here: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
      ).as("actors").convertFrom(r -> r.into(Actor.class))),

      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
      ).as("films")
   )
   .from(FILM)
   .orderBy(FILM.TITLE)
   .fetch();

The result still type checks, but the conversion from Result<Record2<String, String>> to List<Actor> no longer does, it uses reflection.

Let’s continue. Let’s remove the clumsy category Result<Record1<String>>. We could’ve added another record, but in this case, a List<String> will suffice.

Result<Record3<
    String, 
    List<Actor>,
    List<String>  // Begone, jOOQ structural type!
>> result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
      ).as("actors").convertFrom(r -> r.map(mapping(Actor::new))),
      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))

      // Magic.     vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
      ).as("films").convertFrom(r -> r.map(Record1::value1))
   )
   .from(FILM)
   .orderBy(FILM.TITLE)
   .fetch();

And finally, the outer-most Result<Record3<...>> to List<Film> conversion

List<Film> result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
      ).as("actors").convertFrom(r -> r.map(mapping(Actor::new))),
      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
      ).as("films").convertFrom(r -> r.map(Record1::value1))
   )
   .from(FILM)
   .orderBy(FILM.TITLE)

   //     vvvvvvvvvvvvvvvvvv grande finale
   .fetch(mapping(Film::new));

This time, we don’t need the Field.convertFrom() method. Just using the Records.mapping() auxiliary will be sufficient.

A More Complex Example

The previous example showed nesting of two independent collections, which is quite hard with classic JOIN based SQL or ORMs. How about a much more complex example, where we nest things 2 levels, one of which being an aggregation, even? The requirement is:

Give me all the films, the actors that played in the film, the categories that categorise the film, the customers that have rented the film, and all the payments per customer for that film

I won’t even show a JOIN based approach. Let’s dive directly into MULTISET and the also new, synthetic MULTISET_AGG aggregate function. Here’s how to do this with jOOQ. Now, check out that beautiful result type:

Result<Record4<
    String,                   // FILM.TITLE
    Result<Record2<
        String,               // ACTOR.FIRST_NAME
        String                // ACTOR.LAST_NAME
    >>,                       // "actors"
    Result<Record1<String>>,  // CATEGORY.NAME
    Result<Record4<
        String,               // CUSTOMER.FIRST_NAME
        String,               // CUSTOMER.LAST_NAME
        Result<Record2<
            LocalDateTime,    // PAYMENT.PAYMENT_DATE
            BigDecimal        // PAYMENT.AMOUNT
        >>, 
        BigDecimal            // "total"
    >>                        // "customers"
>> result = 
dsl.select(

        // Get the films
        FILM.TITLE,

        // ... and all actors that played in the film
        multiset(
            select(
                FILM_ACTOR.actor().FIRST_NAME,
                FILM_ACTOR.actor().LAST_NAME
            )
            .from(FILM_ACTOR)
            .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
        ).as("actors"),

        // ... and all categories that categorise the film
        multiset(
            select(FILM_CATEGORY.category().NAME)
            .from(FILM_CATEGORY)
            .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
        ).as("categories"),

        // ... and all customers who rented the film, as well
        // as their payments
        multiset(
            select(
                PAYMENT.rental().customer().FIRST_NAME,
                PAYMENT.rental().customer().LAST_NAME,
                multisetAgg(
                    PAYMENT.PAYMENT_DATE,
                    PAYMENT.AMOUNT
                ).as("payments"),
                sum(PAYMENT.AMOUNT).as("total"))
            .from(PAYMENT)
            .where(PAYMENT
                .rental().inventory().FILM_ID.eq(FILM.FILM_ID))
            .groupBy(
                PAYMENT.rental().customer().CUSTOMER_ID,
                PAYMENT.rental().customer().FIRST_NAME,
                PAYMENT.rental().customer().LAST_NAME)
        ).as("customers")
    )
    .from(FILM)
    .where(FILM.TITLE.like("A%"))
    .orderBy(FILM.TITLE)
    .limit(5)
    .fetch();

You’ll be using var, of course, rather than denoting this insane type but I wanted to denote the type explicitly for the sake of the example.

Note again how implicit joins were really helpful here since we wanted to aggregate all the payments per customer, we can just select from PAYMENT and group by the payment’s PAYMENT.rental().customer(), as well as correlate the subquery by PAYMENT.rental().inventory().FILM_ID without any extra effort.

The executed SQL looks like this, where you can see the generated implicit joins (run it on your PostgreSQL Sakila database!):

select
  film.title,
  (
    select coalesce(
      jsonb_agg(jsonb_build_object(
        'first_name', t.first_name,
        'last_name', t.last_name
      )),
      jsonb_build_array()
    )
    from (
      select alias_78509018.first_name, alias_78509018.last_name
      from (
        film_actor
          join actor as alias_78509018
            on film_actor.actor_id = alias_78509018.actor_id
        )
      where film_actor.film_id = film.film_id
    ) as t
  ) as actors,
  (
    select coalesce(
      jsonb_agg(jsonb_build_object('name', t.name)),
      jsonb_build_array()
    )
    from (
      select alias_130639425.name
      from (
        film_category
          join category as alias_130639425
            on film_category.category_id = 
               alias_130639425.category_id
        )
      where film_category.film_id = film.film_id
    ) as t
  ) as categories,
  (
    select coalesce(
      jsonb_agg(jsonb_build_object(
        'first_name', t.first_name,
        'last_name', t.last_name,
        'payments', t.payments,
        'total', t.total
      )),
      jsonb_build_array()
    )
    from (
      select
        alias_63965917.first_name,
        alias_63965917.last_name,
        jsonb_agg(jsonb_build_object(
          'payment_date', payment.payment_date,
          'amount', payment.amount
        )) as payments,
        sum(payment.amount) as total
      from (
        payment
          join (
            rental as alias_102068213
              join customer as alias_63965917
                on alias_102068213.customer_id = 
                   alias_63965917.customer_id
              join inventory as alias_116526225
                on alias_102068213.inventory_id = 
                   alias_116526225.inventory_id
          )
            on payment.rental_id = alias_102068213.rental_id
        )
      where alias_116526225.film_id = film.film_id
      group by 
        alias_63965917.customer_id, 
        alias_63965917.first_name, 
        alias_63965917.last_name
    ) as t
  ) as customers
from film
where film.title like 'A%'
order by film.title
fetch next 5 rows only

The result, in JSON, now looks something like this:

[
  {
    "title": "ACADEMY DINOSAUR",
    "actors": [
      {
        "first_name": "PENELOPE",
        "last_name": "GUINESS"
      },
      {
        "first_name": "CHRISTIAN",
        "last_name": "GABLE"
      },
      {
        "first_name": "LUCILLE",
        "last_name": "TRACY"
      },
      ...
    ],
    "categories": [{ "name": "Documentary" }],
    "customers": [
      {
        "first_name": "SUSAN",
        "last_name": "WILSON",
        "payments": [
          {
            "payment_date": "2005-07-31T22:08:29",
            "amount": 0.99
          }
        ],
        "total": 0.99
      },
      {
        "first_name": "REBECCA",
        "last_name": "SCOTT",
        "payments": [
          {
            "payment_date": "2005-08-18T18:36:16",
            "amount": 0.99
          }
        ],
        "total": 0.99
      },
      ...

That’s it. Nesting collections in arbitrary ways is completely effortless and intuitive. No N+1, no deduplication. Just declare the results in exactly the form you require in your client.

There is no other way to pull off this complexity with such ease, than letting your RDBMS do the heavy lifting of planning and running such a query, and letting jOOQ do the mapping.

Conclusion

We had this kind of functionality all along. We just never used it, or not enough. Why? Because client APIs did not make it accessible enough. Because RDBMS didn’t agree on syntax enough.

That’s now over. jOOQ uses standard SQL MULTISET syntax in its DSL API, enhances it with the synthetic MULTISET_AGG aggregate function, the way all RDBMS should have implemented it (go Informix, Oracle). We can wait for another 40 years for the other RDBMS to implement this, or we just use jOOQ today.

And, I cannot stress this enough:

  • This is all type safe
  • There is no reflection
  • The nesting is done in the database using SQL (via SQL/XML or SQL/JSON for now)
  • … So, execution planners can optimise your entire query
  • … No extra columns or extra queries or other extra work is performed in the database

This works on all dialects that have either SQL/XML or SQL/JSON support (or both), including the major popular dialects:

  • MySQL
  • Oracle
  • PostgreSQL
  • SQL Server

And it is offered under the usual license terms of jOOQ. So, happy nesting of collections.

Addendum: Using SQL/XML or SQL/JSON directly

You may be tempted to use this everywhere. And you rightfully do so. But beware of this, if your SQL client is consuming XML or JSON directly, there’s no need to use MULTISET. Use jOOQ’s native SQL/XML or SQL/JSON support that was introduced in jOOQ 3.14. That way, you won’t convert from JSON to jOOQ results to JSON, but stream the JSON (or XML) to your frontend directly.