Standard SQL/JSON – The Sobering Parts

It’s been almost 1 year now since jOOQ 3.14 was released in October 19, 2020 with SQL/JSON (and SQL/XML) support. Half a year later, we’ve released jOOQ 3.15 with MULTISET support, which builds on top of these features to offer type-safe nested collections, the way every ORDBMS should implement them.

Building (dogfooding) on top of our own SQL/JSON API has revealed a lot of caveats of the various SQL/JSON implementations across vendors, and to be frank, it’s been a bit of a sobering experience. Despite there now being the ISO/IEC TR 19075:6 standard (mostly driven by Oracle this time), many vendors have already implemented some kind of JSON support, and it looks differently in all dialects – to the extent where writing vendor agnostic SQL/JSON is almost impossible with hand written native SQL. You’ll need an API like jOOQ or any other abstraction to standardise the different dialects.

In this article, I’d like to share a few of the biggest caveats encountered over the past year. There are many more, just try to translate some standard SQL/JSON to various dialects with jOOQ.

JSON Type or String?

JSON documents can be seen as simple strings with formatted content. Sure, I mean you can parse a string into a JSON document, so why complicate the SQL type system with new types?

Unfortunately, this proves to be a poor decision by most vendors. PostgreSQL got this right by offering the JSON and JSONB types, the latter being quite performant even when storing and indexing JSON documents.

First off, and this comes to no surprise to anyone working with almost any other language than SQL: Types are semantic. Stringly typed environments are bad. Or as the quote goes:

The Bitterness of Poor Quality Remains Long After the Sweetness of Low Price is Forgotten — Benjamin Franklin

With “Low Price” being the low price of the quick and dirty stringly-typed feature addition without a formal type. We already had this time and again with BOOLEAN types, and dialects like MySQL that pretend that supporting this is a good idea:

SELECT *
FROM t
WHERE 1 OR 0 -- TRUE OR FALSE

Let’s look at examples of what I mean.

MariaDB and MySQL

Let’s look at MariaDB’s and MySQL’s syntax first. Here’s how to create a JSON array:

select json_array(1, 2)

Producing

[1, 2]

That’s great! And it even conforms to standard SQL syntax. A JSON array can also be nested easily:

select json_array(json_array(1, 2), 3)

To produce:

[[1, 2], 3]

Now, what if that nested array originates from a derived table?

select json_array(nested, 3)
from (select json_array(1, 2) nested) as t

It results in:

-- MySQL, MariaDB 10.6
[[1, 2], 3]

-- MariaDB 10.5
["[1, 2]", 3]

Bummer. In MariaDB 10.5, the nested array lost its “JSON type annotation” and went back to the stringly typed version of itself. As a string, it needs to be quoted. This seems to be a bug https://jira.mariadb.org/browse/MDEV-26134 which apparently has been fixed already under a different issue in MariaDB 10.6. It’s not the only one, though. The bug tracker is full of similar issues: https://jira.mariadb.org/browse/MDEV-13701.

MySQL seems to fare a bit better by now, though there are caveats when aggregating (see further down).

The only workaround I’ve found so far for the above bugs is extremely laborious:

select json_array(json_merge_preserve('[]', nested), 3)
from (select json_array(1, 2) nested) as t

Imagine doing that every time you nest JSON. It works, but it really needs to be automated (e.g. through jOOQ).

Oracle

Oracle defined most of the SQL standard, and I really like how SQL-idiomatic its syntax feels, just like SQL/XML. Unfortunately, they waited with the introduction of the new JSON type until Oracle 21c (can’t wait to play with it). As such, we have to choose whether to represent JSON as VARCHAR2, by default (limited to 4000 bytes on some systems or 32kb at most!) or CLOB. If you’re doing serious JSON, you can probably always add the RETURNING CLOB clause to every single JSON function call of yours. Effectively:

select json_array(1, 2 returning clob)
from dual

The nested version looks like this:

select json_array(nested, 3 returning clob)
from (select json_array(1, 2 returning clob) nested from dual) t;

This works much better in Oracle, but there are still many edge cases that it doesn’t get right. Try this on Oracle 18c:

select json_arrayagg(coalesce(json_object(), json_object()))
from dual

And it produces another instance of erroneously stringly-typed JSON:

["{}"]

The workaround is to add FORMAT JSON just about everywhere, just to be sure, e.g.

select json_arrayagg(
  coalesce(json_object(), json_object()) format json
)
from dual

Now, the result is as expected:

[{}]

To play it safe, you’ll probably have to write FORMAT JSON just about everywhere again, like RETURNING CLOB

Truncation

Truncation is probably the most sobering part when using JSON. Why do we still have size-limited data types in SQL? It hardly every makes sense from an application perspective. Yet, here we are.

Oracle

Run this query in Oracle:

select json_arrayagg(owner) from all_objects;

And you’ll get:

SQL Error [40478] [99999]: ORA-40478: output value too large (maximum: 4000)

Increasing the maximum VARCHAR2 size to 32kb will only postpone the problem. There is no “reasonable” size limit to such documents, so again, you’ll have to RETURNING CLOB all the time.

select json_arrayagg(owner returning clob) from all_objects;

The price is the same as always. CLOB are just a bit more annoying to work with than VARCHAR2, both within Oracle as well as in the client application (e.g. JDBC based), because you have to work with yet another resource when you just wanted a string. jOOQ will just add the clause everywhere for you, there’s hardly a reason not to do that with jOOQ, as the fetching of CLOB values is completely transparent to jOOQ users.

MySQL

Up until recently, you couldn’t call JSON_ARRAYAGG() in MySQL, and the MariaDB version crashed the server (https://jira.mariadb.org/browse/MDEV-21912). When this blog was written, neither implementation supported the ORDER BY clause in JSON_ARRAYAGG(), which I find quite essential, so the workaround is to use GROUP_CONCAT:

select concat('[', group_concat(id), ']')
from t_book

Of course, that is very wrong if the concatenated values are not numbers, so we need to also use JSON_QUOTE, e.g.

select concat('[', group_concat(json_quote(title)), ']')
from t_book

And, if you’re embedding that stuff in other JSON structures, you have to turn what really is a string now, back into JSON using JSON_MERGE (until recently, but now deprecated), or JSON_MERGE_PRESERVE, e.g.

select json_object(
  'titles', 
  json_merge_preserve(
    '[]', 
    concat('[', group_concat(json_quote(title)), ']')
  )
)
from t_book

To produce a document like this:

{"titles": ["1984", "Animal Farm", "O Alquimista", "Brida"]}

Without that JSON_MERGE_PRESERVE, you’d be getting:

{"titles": "[\"1984\",\"Animal Farm\",\"O Alquimista\",\"Brida\"]"}

Definitely not something you can ever remember.

Anyway. This section was about truncation! What happens with large, aggregated JSON documents in MySQL? Try this:

select 
  concat('[', group_concat(json_quote(table_name)), ']')
from information_schema.tables

It produces (on my machine):

["innodb_table_stats","innodb_index_stats","CHARACTER_SETS","CHECK_CONSTRAINTS","COLLATIONS","COLLATION_CHARACTER_SET_APPLICABILITY","COLUMNS","COLUMNS_EXTENSIONS","COLUMN_STATISTICS","EVENTS","FILES","INNODB_DATAFILES","INNODB_FOREIGN","INNODB_FOREIGN_COLS","INNODB_FIELDS","INNODB_TABLESPACES_BRIEF","KEY_COLUMN_USAGE","KEYWORDS","PARAMETERS","PARTITIONS","REFERENTIAL_CONSTRAINTS","RESOURCE_GROUPS","ROUTINES","SCHEMATA","SCHEMATA_EXTENSIONS","ST_SPATIAL_REFERENCE_SYSTEMS","ST_UNITS_OF_MEASURE","ST_GEOMETRY_COLUMNS","STATISTICS","TABLE_CONSTRAINTS","TABLE_CONSTRAINTS_EXTENSIONS","TABLES","TABLES_EXTENSIONS","TABLESPACES_EXTENSIONS","TRIGGERS","VIEW_ROUTINE_USAGE","VIEW_TABLE_USAGE","VIEWS","COLUMN_PRIVILEGES","ENGINES","OPTIMIZER_TRACE","PLUGINS","PROCESSLIST","PROFILING","SCHEMA_PRIVILEGES","TABLESPACES","TABLE_PRIVILEGES","USER_PRIVILEGES","cond_instances","error_log","events_waits_current","events_waits_history","events_waits_history_long","events_waits_summary_by_host_by_event_name","events_waits_summary_by]

Wait, what is that at the end

,"events_waits_summary_by]

Invalid JSON. Because the GROUP_CONCAT string got truncated. We could set the following

set @@group_concat_max_len = 4294967295;

And now the output is correct, and much longer:

["innodb_table_stats",...,"t_identity_pk","t_triggers"]

An API like jOOQ will be able to set this session variable automatically for you, but you probably don’t want to think about this all the time for native SQL?

Data Type Support

JSON knows a few data types. Namely:

  • string
  • number
  • boolean
  • null
  • object
  • array

That’s less than SQL, but often good enough (anything can be encoded as a string, after all). But when you don’t have BOOLEAN types in SQL (e.g. MySQL, Oracle), then you have to manually encode your already manual encoding to JSON BOOLEAN.

MySQL

MySQL makes you believe that this works:

select json_array(true, false)

Because the above produces

[true, false]

But it doesn’t really work. That seems to be hard coded in the parser. As soon as your true and false values are expressions instead of literals, e.g. originating from a derived table:

select json_array(t, f)
from (select true as t, false as f) as t

You’ll get:

[1, 0]

There are different ways to emulate this. One being:

select json_array(
  json_extract(case when t = 1 then 'true' when t = 0 then 'false' end, '$'),
  json_extract(case when f = 1 then 'true' when f = 0 then 'false' end, '$')
)
from (select true as t, false as f) as t

Now, we’re getting again:

[true, false]

Oracle

Unlike MySQL / MariaDB, Oracle SQL doesn’t pretend it has a BOOLEAN type. Instead, people encode it as a NUMBER(1) or CHAR(1) or some other thing. Irrespective of the encoding, this is the solution:

select json_array(
  case when t = 1 then 'true' when t = 0 then 'false' end format json,
  case when f = 1 then 'true' when f = 0 then 'false' end format json
)
from (
  select 1 as t, 0 as f, null as n
  from dual
) t

Producing

[true,false]

NULL handling

When using NULL with SQL/JSON, there are various caveats. First off, SQL NULL is not the same thing as JSON NULL. Using PostgreSQL:

select 
  a, 
  b, 
  a = b as equal, 
  a is null as a_is_null, 
  b is null as b_is_null
from (
  select null::jsonb as a, 'null'::jsonb as b
) as t

Produces:

|a  |b   |equal|a_is_null|b_is_null|
|---|----|-----|---------|---------|
|   |null|     |true     |false    |

Where the empty cells are SQL NULL values, and the null value is a JSON null value that is not “SQL NULL“. It’s the only reasonable thing to do, really. Do dialects agree?

MySQL

select 
  a, 
  b,
  a = b as equal, 
  a is null as a_is_null, 
  b is null as b_is_null
from (
  select null as a, json_extract('null', '$') as b
) as t

Producing also:

|a  |b   |equal|a_is_null|b_is_null|
|---|----|-----|---------|---------|
|   |null|     |1        |0        |

So, yes!

Oracle

Let’s see, this could be how I can create a JSON NULL value:

select 
  a, 
  b, 
  case when a = b then 1 else 0 end as equal, 
  case when a is null then 1 else 0 end as a_is_null, 
  case when b is null then 1 else 0 end as b_is_null
from (
  select null as a, json_value('[null]', '$[0]') as b
  from dual
) t

Yet, Oracle is known to have a funky relationship with NULL strings. This is the result:

|A  |B  |EQUAL|A_IS_NULL|B_IS_NULL|
|---|---|-----|---------|---------|
|   |   |0    |1        |1        |

There does not seem to be an actual JSON NULL representation! I couldn’t find a workaround for this yet. Probably I will. But this is very unfortunate, leading to many conversion edge cases.

On the other hand, Oracle is the only dialect among the ones presented in this article that introduced a very useful NULL handling clause for aggregate functions. Check this out:

select
  json_arrayagg(a),
  json_arrayagg(a absent on null),
  json_arrayagg(a null on null)
from (
  select 1 as a from dual union all
  select null from dual
) t

Producing

|A  |B  |C       |
|---|---|--------|
|[1]|[1]|[2,null]|

Note the above query doesn’t produce the correct result on Oracle 18c yet, because of a parser / optimiser bug. Use this to work around the bug for this example:

select
  json_arrayagg(a) a,
  json_arrayagg(a + 0 absent on null) b,
  json_arrayagg(a - 0 null on null) c
from (
  select 1 as a from dual union all
  select null from dual
) t

Other dialects have different opinions on how to aggregate NULL values to JSON documents. In SQL, aggregate functions tend to ignore NULL values, such as Oracle above, but with JSON, it is often essential to include the value, especially when creating a JSON_OBJECT, where an absent key is not strictly the same thing as an absent value.

Db2

Db2 has a very limited implementation of SQL/JSON. It is very standards compliant in terms of syntax, but suffers from severe bugs like these:

Which make it quite unusable for now. Surely, this will improve in the near future.

SQL Server

I’ve also left out SQL Server in this article. SQL Server had JSON and XML support for a while, but implemented it completely differently. You can’t just form arbitrary JSON objects or arrays easily, but you can transform result sets into JSON in an intuitive way.

This offers a quick win when streaming results, but doesn’t compose very well. E.g. you can’t create a JSON_ARRAY with scalar, non-object content, though a JSON_OBJECT can be created like this:

select 1 as a, 2 as b
for json path, without_array_wrapper

Producing

{"a":1,"b":2}

A limited number of JSON features can be emulated also in SQL Server, and as with all the dialects that don’t support an actual JSON type, constant escaping / unescaping may be required.

Conclusion

SQL/JSON was standardised relatively late, mostly by Oracle. The standard is very sound. But regrettably, a lot of dialects disagree on syntax, behaviour, and some are still quite buggy.

The soundest of all implementations is that of PostgreSQL, which introduced proper data types and offers a rich set of vendor-specific functions to manipulate JSON directly in SQL. In the near future, PostgreSQL will embrace standard SQL/JSON, just like Oracle, and enhance its own implementation. I don’t think there will be anything fundamentally new, just more portable standard syntax.

See this talk for a preview of “what’s next”:

There is nothing insurmountable for a library like jOOQ, which abstracts over all the dialects and offers the SQL/JSON (and SQL/XML) functionality using a uniform, standards-inspired API. As dialects improve, jOOQ will adopt new, better syntax by individual vendors, so you can implement your SQL/JSON queries forwards compatibly.

So, as always, use jOOQ to work around the subtle and boring differences of the various vendor syntaxes and start using SQL/JSON already today. Despite the caveats presented here, SQL/JSON is one of the most exciting things about modern SQL!

Read more:

jOOQ 3.15’s New Multiset Operator Will Change How You Think About SQL

This is how SQL should have been used all along.

They called it The Third Manifesto, ORDBMS, or other things. Regrettably, it never really took off. Because most vendors didn’t adopt it. And those who did, didn’t agree on syntax.

But this is about to change. Thanks to the now ubiquitous SQL/JSON support (which jOOQ 3.14 has already covered), we can now emulate the most powerful ORDBMS feature that you will want to use everywhere: Nested collections!

How We Used to do Things: With Joins

We’re going to be using the Sakila database for this example. It’s a DVD rental store with things like ACTOR, FILM, CATEGORY (of films) and other nice relational things. Let’s write a query for this requirement

Get me all the films with their actors and their categories

Classically, we’d go ahead and use jOOQ to write:

ctx.select(
      FILM.TITLE,
      ACTOR.FIRST_NAME,
      ACTOR.LAST_NAME,
      CATEGORY.NAME
   )
   .from(ACTOR)
   .join(FILM_ACTOR)
     .on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .join(FILM)
     .on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .join(FILM_CATEGORY)
     .on(FILM.FILM_ID.eq(FILM_CATEGORY.FILM_ID))
   .join(CATEGORY)
     .on(FILM_CATEGORY.CATEGORY_ID.eq(CATEGORY.CATEGORY_ID))
   .orderBy(1, 2, 3, 4)
   .fetch();

And the result? Not so nice. A denormalised, flat table containing tons of repetition:

+----------------+----------+---------+-----------+
|title           |first_name|last_name|name       |
+----------------+----------+---------+-----------+
|ACADEMY DINOSAUR|CHRISTIAN |GABLE    |Documentary|
|ACADEMY DINOSAUR|JOHNNY    |CAGE     |Documentary|
|ACADEMY DINOSAUR|LUCILLE   |TRACY    |Documentary|
|ACADEMY DINOSAUR|MARY      |KEITEL   |Documentary|
|ACADEMY DINOSAUR|MENA      |TEMPLE   |Documentary|
|ACADEMY DINOSAUR|OPRAH     |KILMER   |Documentary|
|ACADEMY DINOSAUR|PENELOPE  |GUINESS  |Documentary|
|ACADEMY DINOSAUR|ROCK      |DUKAKIS  |Documentary|
|ACADEMY DINOSAUR|SANDRA    |PECK     |Documentary|
|ACADEMY DINOSAUR|WARREN    |NOLTE    |Documentary|
|ACE GOLDFINGER  |BOB       |FAWCETT  |Horror     |
|ACE GOLDFINGER  |CHRIS     |DEPP     |Horror     |
|ACE GOLDFINGER  |MINNIE    |ZELLWEGER|Horror     |
 ...

If we don’t consume this result unmodified as it is (e.g. when displaying tabular data to a user), we’d then go and de-duplicate things, shoehorning them back again into some nested data structures (e.g. for consumption by some JSON based UI), and spending hours trying to untangle the cartesian products between the 2 nested collections FILM -> ACTOR and FILM -> CATEGORY (because ACTOR and CATEGORY now created a cartesian product, which we didn’t want!)

In the worst case, we don’t even notice! This example database only has 1 category per film, but it is designed to support multiple categories.

jOOQ can help with that deduplication, but just look at the number of questions for jOOQ many to many on Stack Overflow! You’ll probably still have to write at least 2 queries to separate the nested collections.

ENTER the Stage: Multiset

The standard SQL <multiset value constructor> operator allows for collecting the data from a correlated subquery into a nested data structure, a MULTISET. Everything in SQL is a MULTISET, so the operator isn’t too surprising. But the nesting is where it shines. The previous query can now be re-written in jOOQ as follows:

var result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
      ).as("actors"),
      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
      ).as("films")
   )
   .from(FILM)
   .orderBy(FILM.TITLE)
   .fetch();

Note, it’s not relevant to this task, but I’m using jOOQ’s type safe implicit to-one join feature, which further helps taming the joins, syntactically. A matter of taste.

How to read this query? Easy:

  • Get all the films
  • For each FILM, get all the actors as a nested collection
  • For each FILM, get all the categories as a nested collection

You’re going to like Java 10’s var keyword even more after this :) Because what type is result? It’s of this type:

Result<Record3<
  String,                          // FILM.TITLE
  Result<Record2<String, String>>, // ACTOR.FIRST_NAME, ACTOR.LAST_NAME
  Result<Record1<String>>          // CATEGORY.NAME
>>

That’s quite something. Not too complex if you think of it. There’s a result with 3 columns

  • TITLE
  • A first nested result that has 2 string columns: ACTOR.FIRST_NAME and ACTOR.LAST_NAME
  • A second nested result that has 1 string column: CATEGORY.NAME

Using var or other type inference mechanisms, you don’t have to denote this type. Or even better (stay tuned): We’ll type-safely map the structural type to our nominal DTO type hierarchy, with just a few additional lines of code. I’ll explain that later.

What Does the Result Look Like?

Calling toString() on the above Result type yields something like this:

+---------------------------+--------------------------------------------------+---------------+
|title                      |actors                                            |films          |
+---------------------------+--------------------------------------------------+---------------+
|ACADEMY DINOSAUR           |[(PENELOPE, GUINESS), (CHRISTIAN, GABLE), (LUCI...|[(Documentary)]|
|ACE GOLDFINGER             |[(BOB, FAWCETT), (MINNIE, ZELLWEGER), (SEAN, GU...|[(Horror)]     |
|ADAPTATION HOLES           |[(NICK, WAHLBERG), (BOB, FAWCETT), (CAMERON, ST...|[(Documentary)]|
|AFFAIR PREJUDICE           |[(JODIE, DEGENERES), (SCARLETT, DAMON), (KENNET...|[(Horror)]     |
|AFRICAN EGG                |[(GARY, PHOENIX), (DUSTIN, TAUTOU), (MATTHEW, L...|[(Family)]     |
|AGENT TRUMAN               |[(KIRSTEN, PALTROW), (SANDRA, KILMER), (JAYNE, ...|[(Foreign)]    |
 ...

Notice how we’re back to getting each FILM.TITLE entry only once (no duplication), and nested in each row are the subquery results. There’s no denormalisation happening!

When calling result.formatJSON() with the appropriate formatting options, we’ll get this representation:

[
  {
    "title": "ACADEMY DINOSAUR",
    "actors": [
      {
        "first_name": "PENELOPE",
        "last_name": "GUINESS"
      },
      {
        "first_name": "CHRISTIAN",
        "last_name": "GABLE"
      },
      {
        "first_name": "LUCILLE",
        "last_name": "TRACY"
      },
      {
        "first_name": "SANDRA",
        "last_name": "PECK"
      },

      ...
    ],
    "films": [
      { "name": "Documentary" }
    ]
  },
  {
    "title": "ACE GOLDFINGER",
    "actors": [
      {
        "first_name": "BOB",
        "last_name": "FAWCETT"
      },
  ...

Calling result.formatXML() would produce this:

<result>
  <record>
    <title>ACADEMY DINOSAUR</title>
    <actors>
      <result>
        <record>
          <first_name>PENELOPE</first_name>
          <last_name>GUINESS</last_name>
        </record>
        <record>
          <first_name>CHRISTIAN</first_name>
          <last_name>GABLE</last_name>
        </record>
        <record>
          <first_name>LUCILLE</first_name>
          <last_name>TRACY</last_name>
        </record>
        <record>
          <first_name>SANDRA</first_name>
          <last_name>PECK</last_name>
        </record>
        ...
      </result>
    </actors>
    <films>
      <result>
        <record>
          <name>Documentary</name>
        </record>
      </result>
    </films>
  </record>
  <record>
    <title>ACE GOLDFINGER</title>
    <actors>
      <result>
        <record>
          <first_name>BOB</first_name>
          <last_name>FAWCETT</last_name>
        </record>
        ...

You get the idea!

What’s the Generated SQL?

Just turn on jOOQ’s DEBUG logging and observe a query like this one (in PostgreSQL):

select
  film.title,
  (
    select coalesce(
      jsonb_agg(jsonb_build_object(
        'first_name', t.first_name,
        'last_name', t.last_name
      )),
      jsonb_build_array()
    )
    from (
      select 
        alias_78509018.first_name, 
        alias_78509018.last_name
      from (
        film_actor
          join actor as alias_78509018
            on film_actor.actor_id = alias_78509018.actor_id
        )
      where film_actor.film_id = film.film_id
    ) as t
  ) as actors,
  (
    select coalesce(
      jsonb_agg(jsonb_build_object('name', t.name)),
      jsonb_build_array()
    )
    from (
      select alias_130639425.name
      from (
        film_category
          join category as alias_130639425
            on film_category.category_id = alias_130639425.category_id
        )
      where film_category.film_id = film.film_id
    ) as t
  ) as films
from film
order by film.title

The Db2, MySQL, Oracle, SQL Server versions would look similar. Just try it on your Sakila database installation. It runs fast, too.

Mapping the Results to DTOs

Now, I promised to get rid of that lengthy structural type with all the generics. Check this out!

We used to call them POJOs (Plain Old Java Objects). Then DTOs (Data Transfer Objects). Now records. Yes, let’s try some Java 16 records here. (Note, records aren’t required for these examples. Any POJOs with appropriate constructors would do).

Wouldn’t it be nice, if result was of this type

record Actor(String firstName, String lastName) {}
record Film(
  String title,
  List<Actor> actors,
  List<String> categories
) {}

List<Film> result = ...

Structural typing is essential to jOOQ and its type safe query system, but in your code, you probably don’t want to have those merge-conflict lookalike generics all the time, and even var won’t help you if your data needs to be returned from a method.

So, let’s transform our jOOQ query to one that produces List<Film>, step by step. We’re starting with the original query, untouched:

Result<Record3<
    String, 
    Result<Record2<String, String>>, 
    Result<Record1<String>>
>> result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
      ).as("actors"),
      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
      ).as("films")
   )
   .from(FILM)
   .orderBy(FILM.TITLE)
   .fetch();

Now, we’re going to map the Actor in the first MULTISET expression. This can be done as follows, using the new Field.convertFrom() convenience method, which allows to turn a Field<T> into any read-only Field<U> for ad-hoc usage. A simple example would be this:

record Title(String title) {}

// Use this field in any query
Field<Title> title = FILM.TITLE.convertFrom(Title::new);

It’s just an easy, new way to attach a read-only Converter to a Field for single usage, instead of doing that with the code generator.

Applied to the original query:

Result<Record3<
    String, 
    List<Actor>,  // A bit nicer already
    Result<Record1<String>>
>> result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))

      // Magic here: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
      ).as("actors").convertFrom(r -> r.map(mapping(Actor::new))),

      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
      ).as("films")
   )
   .from(FILM)
   .orderBy(FILM.TITLE)
   .fetch();

What are we doing here?

  • The method convertFrom() takes a lambda Result<Record2<String, String>> -> Actor.
  • The Result type is the usual jOOQ Result, which has a Result.map(RecordMapper<R, E>) method.
  • The mapping() method is the new Records.mapping(), which isn’t doing much, just turning a constructor reference of type Function2<String, String, Actor> into a RecordMapper, which can then be used to turn a Result<Record2<String, String>> into a List<Actor>.

And it type checks! Try it yourself. If you add a column to the multiset, you’ll get a compilation error. If you add/remove an attribute from the Actor record, you’ll get a compilation error. No reflection here, just declarative mapping of jOOQ results/records to custom List<UserType>. If you prefer the “old” reflection approach using jOOQ’s ubiquitous into() methods, you can still do that, too:

Result<Record3<
    String, 
    List<Actor>,  // A bit nicer already
    Result<Record1<String>>
>> result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))

      // Magic here: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
      ).as("actors").convertFrom(r -> r.into(Actor.class))),

      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
      ).as("films")
   )
   .from(FILM)
   .orderBy(FILM.TITLE)
   .fetch();

The result still type checks, but the conversion from Result<Record2<String, String>> to List<Actor> no longer does, it uses reflection.

Let’s continue. Let’s remove the clumsy category Result<Record1<String>>. We could’ve added another record, but in this case, a List<String> will suffice.

Result<Record3<
    String, 
    List<Actor>,
    List<String>  // Begone, jOOQ structural type!
>> result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
      ).as("actors").convertFrom(r -> r.map(mapping(Actor::new))),
      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))

      // Magic.     vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
      ).as("films").convertFrom(r -> r.map(Record1::value1))
   )
   .from(FILM)
   .orderBy(FILM.TITLE)
   .fetch();

And finally, the outer-most Result<Record3<...>> to List<Film> conversion

List<Film> result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
      ).as("actors").convertFrom(r -> r.map(mapping(Actor::new))),
      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
      ).as("films").convertFrom(r -> r.map(Record1::value1))
   )
   .from(FILM)
   .orderBy(FILM.TITLE)

   //     vvvvvvvvvvvvvvvvvv grande finale
   .fetch(mapping(Film::new));

This time, we don’t need the Field.convertFrom() method. Just using the Records.mapping() auxiliary will be sufficient.

A More Complex Example

The previous example showed nesting of two independent collections, which is quite hard with classic JOIN based SQL or ORMs. How about a much more complex example, where we nest things 2 levels, one of which being an aggregation, even? The requirement is:

Give me all the films, the actors that played in the film, the categories that categorise the film, the customers that have rented the film, and all the payments per customer for that film

I won’t even show a JOIN based approach. Let’s dive directly into MULTISET and the also new, synthetic MULTISET_AGG aggregate function. Here’s how to do this with jOOQ. Now, check out that beautiful result type:

Result<Record4<
    String,                   // FILM.TITLE
    Result<Record2<
        String,               // ACTOR.FIRST_NAME
        String                // ACTOR.LAST_NAME
    >>,                       // "actors"
    Result<Record1<String>>,  // CATEGORY.NAME
    Result<Record4<
        String,               // CUSTOMER.FIRST_NAME
        String,               // CUSTOMER.LAST_NAME
        Result<Record2<
            LocalDateTime,    // PAYMENT.PAYMENT_DATE
            BigDecimal        // PAYMENT.AMOUNT
        >>, 
        BigDecimal            // "total"
    >>                        // "customers"
>> result = 
dsl.select(

        // Get the films
        FILM.TITLE,

        // ... and all actors that played in the film
        multiset(
            select(
                FILM_ACTOR.actor().FIRST_NAME,
                FILM_ACTOR.actor().LAST_NAME
            )
            .from(FILM_ACTOR)
            .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
        ).as("actors"),

        // ... and all categories that categorise the film
        multiset(
            select(FILM_CATEGORY.category().NAME)
            .from(FILM_CATEGORY)
            .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
        ).as("categories"),

        // ... and all customers who rented the film, as well
        // as their payments
        multiset(
            select(
                PAYMENT.rental().customer().FIRST_NAME,
                PAYMENT.rental().customer().LAST_NAME,
                multisetAgg(
                    PAYMENT.PAYMENT_DATE,
                    PAYMENT.AMOUNT
                ).as("payments"),
                sum(PAYMENT.AMOUNT).as("total"))
            .from(PAYMENT)
            .where(PAYMENT
                .rental().inventory().FILM_ID.eq(FILM.FILM_ID))
            .groupBy(
                PAYMENT.rental().customer().CUSTOMER_ID,
                PAYMENT.rental().customer().FIRST_NAME,
                PAYMENT.rental().customer().LAST_NAME)
        ).as("customers")
    )
    .from(FILM)
    .where(FILM.TITLE.like("A%"))
    .orderBy(FILM.TITLE)
    .limit(5)
    .fetch();

You’ll be using var, of course, rather than denoting this insane type but I wanted to denote the type explicitly for the sake of the example.

Note again how implicit joins were really helpful here since we wanted to aggregate all the payments per customer, we can just select from PAYMENT and group by the payment’s PAYMENT.rental().customer(), as well as correlate the subquery by PAYMENT.rental().inventory().FILM_ID without any extra effort.

The executed SQL looks like this, where you can see the generated implicit joins (run it on your PostgreSQL Sakila database!):

select
  film.title,
  (
    select coalesce(
      jsonb_agg(jsonb_build_object(
        'first_name', t.first_name,
        'last_name', t.last_name
      )),
      jsonb_build_array()
    )
    from (
      select alias_78509018.first_name, alias_78509018.last_name
      from (
        film_actor
          join actor as alias_78509018
            on film_actor.actor_id = alias_78509018.actor_id
        )
      where film_actor.film_id = film.film_id
    ) as t
  ) as actors,
  (
    select coalesce(
      jsonb_agg(jsonb_build_object('name', t.name)),
      jsonb_build_array()
    )
    from (
      select alias_130639425.name
      from (
        film_category
          join category as alias_130639425
            on film_category.category_id = 
               alias_130639425.category_id
        )
      where film_category.film_id = film.film_id
    ) as t
  ) as categories,
  (
    select coalesce(
      jsonb_agg(jsonb_build_object(
        'first_name', t.first_name,
        'last_name', t.last_name,
        'payments', t.payments,
        'total', t.total
      )),
      jsonb_build_array()
    )
    from (
      select
        alias_63965917.first_name,
        alias_63965917.last_name,
        jsonb_agg(jsonb_build_object(
          'payment_date', payment.payment_date,
          'amount', payment.amount
        )) as payments,
        sum(payment.amount) as total
      from (
        payment
          join (
            rental as alias_102068213
              join customer as alias_63965917
                on alias_102068213.customer_id = 
                   alias_63965917.customer_id
              join inventory as alias_116526225
                on alias_102068213.inventory_id = 
                   alias_116526225.inventory_id
          )
            on payment.rental_id = alias_102068213.rental_id
        )
      where alias_116526225.film_id = film.film_id
      group by 
        alias_63965917.customer_id, 
        alias_63965917.first_name, 
        alias_63965917.last_name
    ) as t
  ) as customers
from film
where film.title like 'A%'
order by film.title
fetch next 5 rows only

The result, in JSON, now looks something like this:

[
  {
    "title": "ACADEMY DINOSAUR",
    "actors": [
      {
        "first_name": "PENELOPE",
        "last_name": "GUINESS"
      },
      {
        "first_name": "CHRISTIAN",
        "last_name": "GABLE"
      },
      {
        "first_name": "LUCILLE",
        "last_name": "TRACY"
      },
      ...
    ],
    "categories": [{ "name": "Documentary" }],
    "customers": [
      {
        "first_name": "SUSAN",
        "last_name": "WILSON",
        "payments": [
          {
            "payment_date": "2005-07-31T22:08:29",
            "amount": 0.99
          }
        ],
        "total": 0.99
      },
      {
        "first_name": "REBECCA",
        "last_name": "SCOTT",
        "payments": [
          {
            "payment_date": "2005-08-18T18:36:16",
            "amount": 0.99
          }
        ],
        "total": 0.99
      },
      ...

That’s it. Nesting collections in arbitrary ways is completely effortless and intuitive. No N+1, no deduplication. Just declare the results in exactly the form you require in your client.

There is no other way to pull off this complexity with such ease, than letting your RDBMS do the heavy lifting of planning and running such a query, and letting jOOQ do the mapping.

Conclusion

We had this kind of functionality all along. We just never used it, or not enough. Why? Because client APIs did not make it accessible enough. Because RDBMS didn’t agree on syntax enough.

That’s now over. jOOQ uses standard SQL MULTISET syntax in its DSL API, enhances it with the synthetic MULTISET_AGG aggregate function, the way all RDBMS should have implemented it (go Informix, Oracle). We can wait for another 40 years for the other RDBMS to implement this, or we just use jOOQ today.

And, I cannot stress this enough:

  • This is all type safe
  • There is no reflection
  • The nesting is done in the database using SQL (via SQL/XML or SQL/JSON for now)
  • … So, execution planners can optimise your entire query
  • … No extra columns or extra queries or other extra work is performed in the database

This works on all dialects that have either SQL/XML or SQL/JSON support (or both), including the major popular dialects:

  • MySQL
  • Oracle
  • PostgreSQL
  • SQL Server

And it is offered under the usual license terms of jOOQ. So, happy nesting of collections.

Addendum: Using SQL/XML or SQL/JSON directly

You may be tempted to use this everywhere. And you rightfully do so. But beware of this, if your SQL client is consuming XML or JSON directly, there’s no need to use MULTISET. Use jOOQ’s native SQL/XML or SQL/JSON support that was introduced in jOOQ 3.14. That way, you won’t convert from JSON to jOOQ results to JSON, but stream the JSON (or XML) to your frontend directly.

Nesting Collections With jOOQ 3.14’s SQL/XML or SQL/JSON support

One of the main features of ORMs is M as in Mapping. Libraries like jOOQ help auto-mapping flat or nested database records onto Java classes that have the same structure as the SQL result set.

The following has always been possible in jOOQ, assuming PostgreSQL’s INFORMATION_SCHEMA (using the generated code from the jOOQ-meta module):

class Column {
    String tableSchema;
    String tableName;
    String columnName;
}

for (Column c :
    ctx.select(
            COLUMNS.TABLE_SCHEMA, 
            COLUMNS.TABLE_NAME, 
            COLUMNS.COLUMN_NAME)
       .from(COLUMNS)
       .where(COLUMNS.TABLE_NAME.eq("t_author"))
       .orderBy(COLUMNS.ORDINAL_POSITION)
       .fetchInto(Column.class))
    System.out.println(
        c.tableSchema + "." + c.tableName + "." + c.columnName
    );

The above resulting in something like:

public.t_author.id
public.t_author.first_name
public.t_author.last_name
public.t_author.date_of_birth
public.t_author.year_of_birth
public.t_author.address

The mapping is straight forward, as explained in jOOQ’s DefaultRecordMapper.

Nested mappings

A lesser known feature that we’ve offered for a while was to use a dot notation to emulate nesting records into nested Java classes. Assuming you want to use a re-usable data type description in your columns and elsewhere:

class Type {
    String name;
    int precision;
    int scale;
    int length;
}

class Column {
    String tableSchema;
    String tableName;
    String columnName;
    Type type;
}

You can now write this query where you’ll alias some columns using the dot notation to type.name, for example (several nesting levels are possible):

for (Column c :
    ctx.select(
            COLUMNS.TABLE_SCHEMA,
            COLUMNS.TABLE_NAME,
            COLUMNS.COLUMN_NAME,
            COLUMNS.DATA_TYPE.as("type.name"),
            COLUMNS.NUMERIC_PRECISION.as("type.precision"),
            COLUMNS.NUMERIC_SCALE.as("type.scale"),
            COLUMNS.CHARACTER_MAXIMUM_LENGTH.as("type.length")
       )
       .from(COLUMNS)
       .where(COLUMNS.TABLE_NAME.eq("t_author"))
       .orderBy(COLUMNS.ORDINAL_POSITION)
       .fetchInto(Column.class))

    System.out.println(String.format(
        "%1$-30s: %2$s",
        c.tableSchema + "." + c.tableName + "." + c.columnName,
        c.type.name + (c.type.precision != 0
               ? "(" + c.type.precision + ", " + c.type.scale + ")"
               :       c.type.length != 0
               ? "(" + c.type.length + ")"
               : "")
    ));

The above will print:

public.t_author.id            : integer(32, 0)
public.t_author.first_name    : character varying(50)
public.t_author.last_name     : character varying(50)
public.t_author.date_of_birth : date
public.t_author.year_of_birth : integer(32, 0)
public.t_author.address       : USER-DEFINED

Using XML or JSON

Using XML or JSON, starting from jOOQ 3.14, you can also nest collections in your result set mapping very easily. First, let’s look again at how to use a JSON query using jOOQ, e.g. to find all columns per table:

for (Record1<JSON> record :
    ctx.select(
            jsonObject(
                key("tableSchema").value(COLUMNS.TABLE_SCHEMA),
                key("tableName").value(COLUMNS.TABLE_NAME),
                key("columns").value(jsonArrayAgg(
                    jsonObject(
                        key("columnName").value(COLUMNS.COLUMN_NAME),
                        key("type").value(jsonObject(
                            "name", COLUMNS.DATA_TYPE)
                        )
                    )
                ).orderBy(COLUMNS.ORDINAL_POSITION))
            )
       )
       .from(COLUMNS)
       .where(COLUMNS.TABLE_NAME.in("t_author", "t_book"))
       .groupBy(COLUMNS.TABLE_SCHEMA, COLUMNS.TABLE_NAME)
       .orderBy(COLUMNS.TABLE_SCHEMA, COLUMNS.TABLE_NAME)
       .fetch())
    System.out.println(record.value1());

The following JSON documents are returned:

{
  "tableSchema": "public", 
  "tableName": "t_author", 
  "columns": [{
    "columnName": "id", 
    "type": {"name": "integer"}
  }, {
    "columnName": "first_name", 
    "type": {"name": "character varying"}
  }, {...}]
}

{
  "tableSchema": "public", 
  "tableName": "t_book", 
  "columns": [{...}, ...]
}

That’s already awesome, isn’t it? We’ve blogged about this previously here and here. Starting with jOOQ 3.14, you can remove all the other middleware and mapping and what not, and produce your XML or JSON documents directly from your database using standard SQL/XML or SQL/JSON API!

But that’s not all!

Maybe, you don’t actually need the JSON document, you just want to use JSON to allow for nesting data structures, mapping them back to Java.

What about these nested Java classes:

public static class Type {
    public String name;
}

public static class Column {
    public String columnName;
    public Type type;
}

public static class Table {
    public String tableSchema;
    public String tableName;

    public List<Column> columns;
}

Assuming you have gson or Jackson or JAXB on your classpath (or you configure them directly), you can write the exact same query as before, and use jOOQ’s DefaultRecordMapper using the fetchInto(Table.class) call:

for (Table t :
    ctx.select(
            jsonObject(
                key("tableSchema").value(COLUMNS.TABLE_SCHEMA),
                key("tableName").value(COLUMNS.TABLE_NAME),
                key("columns").value(jsonArrayAgg(
                    jsonObject(
                        key("columnName").value(COLUMNS.COLUMN_NAME),
                        key("type").value(jsonObject(
                            "name", COLUMNS.DATA_TYPE)
                        )
                    )
                ).orderBy(COLUMNS.ORDINAL_POSITION))
            )
       )
       .from(COLUMNS)
       .where(COLUMNS.TABLE_NAME.in("t_author", "t_book"))
       .groupBy(COLUMNS.TABLE_SCHEMA, COLUMNS.TABLE_NAME)
       .orderBy(COLUMNS.TABLE_SCHEMA, COLUMNS.TABLE_NAME)
       .fetchInto(Table.class))
    System.out.println(t.tableName + ":\n" + t.columns
       .stream()
       .map(c -> c.columnName + " (" + c.type.name + ")")
       .collect(joining("\n  ")));

The output being:

t_author:
  id (integer)
  first_name (character varying)
  last_name (character varying)
  date_of_birth (date)
  year_of_birth (integer)
  address (USER-DEFINED)
t_book:
  id (integer)
  author_id (integer)
  co_author_id (integer)
  details_id (integer)
  title (character varying)
  published_in (integer)
  language_id (integer)
  content_text (text)
  content_pdf (bytea)
  status (USER-DEFINED)
  rec_version (integer)
  rec_timestamp (timestamp without time zone)

No join magic. No cartesian products. No data deduplication. Just SQL-native nested collections, using an intuitive, declarative approach to creating the document data structure, combined with the usual awesomeness of SQL.

Using this without the jOOQ DSL

Of course, this also works without the jOOQ API, e.g. using our parser. Check out our translator tool. Plug in this native SQL beauty:

SELECT
  json_object(
    KEY 'tableSchema' VALUE columns.table_schema,
    KEY 'tableName' VALUE columns.table_name,
    KEY 'columns' VALUE json_arrayagg(
      json_object(
        KEY 'columnName' VALUE columns.column_name,
        KEY 'type' VALUE json_object(
          KEY 'name' VALUE columns.data_type
        )
      )
    )
  )
FROM columns
WHERE columns.table_name IN ('t_author', 't_book')
GROUP BY columns.table_schema, columns.table_name
ORDER BY columns.table_schema, columns.table_name

And, because the devil of SQL agnosticity and translation is in the detail, take out the vendor-specific version, e.g. for PostgreSQL:

SELECT json_build_object(
  'tableSchema', columns.table_schema,
  'tableName', columns.table_name,
  'columns', json_agg(json_build_object(
    'columnName', columns.column_name,
    'type', json_build_object('name', columns.data_type)
  ))
)
FROM columns
WHERE columns.table_name IN (
  't_author', 't_book'
)
GROUP BY
  columns.table_schema,
  columns.table_name
ORDER BY
  columns.table_schema,
  columns.table_name

You might need to run this, before:

SET search_path = 'information_schema'

Conclusion

We’ve waited way too long with this game changing feature. I truly think this approach will change how we perceive ORMs in the future. The database first approach, where we can use SQL and only SQL to map SQL data onto any hierarchical data structure is very commpelling.

On the jOOQ side, we’re far from done yet. What if we can auto-generate some of the JSON document declaration from other types of meta data for you? What if you can do that, yourself? E.g. to map a GraphQL specification to jOOQ API based JSON queries? On all the SQL dialects that support these features!

The future of mapping nested data structures from SQL to any client, XML, JSON, objects is bright. jOOQ 3.14 is around the corner and will be released within the next 2 weeks. You can already build it from github: https://github.com/jOOQ/jOOQ, or if you have a license, download a nightly build from here: https://www.jooq.org/download/versions

Looking forward to your feedback.

Using SQL Server FOR XML and FOR JSON Syntax on Other RDBMS With jOOQ

SQL Server supports transforming flat tabular SQL result sets into hierarchical structures by convention using the convenient FOR XML or FOR JSON syntaxes. This is really convenient and less verbose than the standard SQL/XML or SQL/JSON APIs – although the standard ones are more powerful.

In this blog post, I’d like to show a few core features of the SQL Server syntax, and what they correspond to in standard SQL. jOOQ 3.14 will support both SQL Server’s syntax and the standard syntax, and will be able to translate from one to the other, such that you can use SQL Server syntax also on Db2, MariaDB, MySQL, Oracle, PostgreSQL. You can play around with the current state of development on our website here.

As always, using the Sakila database, here’s a simple example as a teaser:

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
FOR XML RAW;

-- Db2, Oracle, PostgreSQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlattributes(
    t.first_name AS first_name,
    t.last_name AS last_name,
    t.title AS title
  )
))
FROM (
  -- Original query here
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
) AS t

Producing in both cases something like:

<row first_name="PENELOPE" last_name="GUINESS" title="OKLAHOMA JUMANJI"/>
<row first_name="PENELOPE" last_name="GUINESS" title="RULES HUMAN"/>
<row first_name="PENELOPE" last_name="GUINESS" title="SPLASH GUMP"/>
<row first_name="PENELOPE" last_name="GUINESS" title="VERTIGO NORTHWEST"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WESTWARD SEABISCUIT"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WIZARD COLDBLOODED"/>
<row first_name="NICK" last_name="WAHLBERG" title="ADAPTATION HOLES"/>
<row first_name="NICK" last_name="WAHLBERG" title="APACHE DIVINE"/>

FOR XML and FOR JSON concepts

As could be seen in the above teaser, the SQL Server syntax is far less verbose and concise, and it seems to produce a reasonable default behaviour, where the Db2, Oracle, PostgreSQL (and SQL Standard) SQL/XML APIs are more verbose, but also more powerful. For example, it is possible to map column a to attribute x and column b to a nested XML element y very easily.

The advantages of both approaches are clear. SQL Server’s approach is much more usable in the general case. But what is the general case? Let’s summarise a few key parallels between SQL result sets, and XML/JSON data structures:


Tables are XML elements or JSON arrays

Tables (i.e. sets of data) are not a foreign concept to both XML and JSON documents. The most natural way to represent a set of data in XML is a set of elements using the same element name, optionally wrapped by a wrapper element. For example:

<!-- With wrapper element -->
<films>
  <film title="OKLAHOMA JUMANJI"/>
  <film title="RULES HUMAN"/>
  <film title="SPLASH GUMP"/>
</films>

<!-- Without wrapper element -->
<film title="OKLAHOMA JUMANJI"/>
<film title="RULES HUMAN"/>
<film title="SPLASH GUMP"/>

The distinction of whether a wrapper element is added is mostly significant when nesting data.

With JSON, the obvious choice of data structure to represent a table is an array. For example:

[
  {"title": "OKLAHOMA JUMANJI"},
  {"title": "RULES HUMAN"},
  {"title": "SPLASH GUMP"}
]


Rows are XML elements or JSON objects

As we’ve already seen above, a SQL row is represented in XML using an element.

<film title="OKLAHOMA JUMANJI"/>

The question is only what the element name should be. It can usually be any of:

  • A standard name, such as “row”
  • The name of the table the row stems from
  • A custom name

In JSON, it is an object.

{"title": "OKLAHOMA JUMANJI"}

Unlike in XML, there is no such thing as an element name, so the row is “anonymous”. The row type is defined by what table / array the JSON object is contained in.


Column values are XML elements or attributes, or JSON attributes

We have a bit more choices of how to represent SQL column values in XML. Mainly two choices:

  • Represent values as attributes
  • Represent values as elements

Scalar values can easily be represented as attributes. If a value needs further nesting (e.g. an array, user defined type, etc.), then elements are a better choice. In most cases, the choice is not relevant, so we can pick both:

<!-- Using attributes -->
<film film_id="635" title="OKLAHOMA JUMANJI"/>

<!-- Using elements from table and column names -->
<film>
  <film_id>635</film_id>
  <title>OKLAHOMA JUMANJI</title>
</film>

<!-- Using standard element names
<row>
  <value name="film_id" value="635"/>
  <value name="title" value="OKLAHOMA JUMANJI"/>
</row>

There are a few other reasonable default options for the representation of a column value in XML.

In JSON, on the other hand, there are two main reasonable approaches. In most cases, an object will be chosen, where column values are identified by column name. But just like SQL records are a mixture between “structs” and “tuples”, we could imagine a representation that maps column values to array indexes as well:

// Using objects
{"film_id": 635, "title": "OKLAHOMA JUMANJI"}

// Using arrays
[635, "OKLAHOMA JUMANJI"]


GROUP BY and ORDER BY can be seen as a way to nest data

So far, all data was represented in a flat way, just like the SQL table. There was some nesting when wrapping XML elements of JSON arrays in some wrapper element or object, or when representing XML data with more elements rather than attributes, but the data was still always tabular.

Very often, we want to consume data in a hierarchical form, though. An actor played in films, so we’d like to group the films by actor, rather than repeating the actor information for every film. In general, operations like GROUP BY or ORDER BY serve this purpose. GROUP BY allows for aggregating all data into nested data structures per group (e.g. into strings, arrays, XML elements, JSON arrays, JSON objects). ORDER BY does the same, “visually” – perhaps a bit less formally. When we look at this set of XML elements, we can see visually that they’re “grouped” (i.e. ordered) by actor:

<row first_name="PENELOPE" last_name="GUINESS" title="OKLAHOMA JUMANJI"/>
<row first_name="PENELOPE" last_name="GUINESS" title="RULES HUMAN"/>
<row first_name="PENELOPE" last_name="GUINESS" title="SPLASH GUMP"/>
<row first_name="PENELOPE" last_name="GUINESS" title="VERTIGO NORTHWEST"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WESTWARD SEABISCUIT"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WIZARD COLDBLOODED"/>
<row first_name="NICK" last_name="WAHLBERG" title="ADAPTATION HOLES"/>
<row first_name="NICK" last_name="WAHLBERG" title="APACHE DIVINE"/>

SQL Server supports such grouping in at least two ways:

  • Implicitly by convention, using ORDER BY
  • Explicity by creating correlated subqueries

The implicit approach could transform the above flat representation into something like this:

<a first_name="PENELOPE" last_name="GUINESS">
    <f title="OKLAHOMA JUMANJI"/>
    <f title="RULES HUMAN"/>
    <f title="SPLASH GUMP"/>
    <f title="VERTIGO NORTHWEST"/>
    <f title="WESTWARD SEABISCUIT"/>
    <f title="WIZARD COLDBLOODED"/>
</a>
<a first_name="NICK" last_name="WAHLBERG">
    <f title="ADAPTATION HOLES"/>
    <f title="APACHE DIVINE"/>
</a>

… where “a” and “f” are the table names in the query (actor a and film f).

How do FOR XML and FOR JSON work in detail?

There are several features that can be combined in SQL Server. The complete picture can be seen from the docs. We’ll omit a few features in this blog post here.

  • The transformation algorithm RAW (flat results, only in XML), AUTO (hierarchical, automatic results), PATH (hierarchical, explicit results)
  • The “root” name, which corresponds to an XML wrapper element, or a JSON wrapper object
  • XML only: Whether values should be placed in ELEMENTS or attributes
  • JSON only: INCLUDE_NULL_VALUES specifies whether NULL values are explicit, or implicit (absent from the JSON object).
  • JSON only: WITHOUT_ARRAY_WRAPPER specifies whether the set of JSON objects should be listed as a JSON array, or a comma separated list of objects (which could be combined with other queries)

This is not complete, there are more flags and features, but instead of discussing them in theory, let’s look at a few examples:

FOR XML RAW

Producing flat results with attributes for values

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML RAW;

-- Standard SQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlattributes(
    t.first_name AS first_name,
    t.last_name AS last_name,
    t.title AS title
  )
))
FROM (
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) AS t

This produces

<row first_name="NICK" last_name="WAHLBERG" title="SMILE EARRING"/>
<row first_name="NICK" last_name="WAHLBERG" title="WARDROBE PHANTOM"/>
<row first_name="PENELOPE" last_name="GUINESS" title="ACADEMY DINOSAUR"/>
<row first_name="PENELOPE" last_name="GUINESS" title="ANACONDA CONFESSIONS"/>

FOR XML RAW, ROOT

Producing flat results with attributes for values, and a root element to wrap the listed elements

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML RAW, ROOT('rows');

-- Standard SQL
SELECT xmlelement(
  NAME rows,
  xmlagg(xmlelement(
    NAME row,
    xmlattributes(
      t.first_name AS first_name,
      t.last_name AS last_name,
      t.title AS title
    )
  ))
)
FROM (
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) AS t

This produces

<rows>
  <row first_name="NICK" last_name="WAHLBERG" title="SMILE EARRING"/>
  <row first_name="NICK" last_name="WAHLBERG" title="WARDROBE PHANTOM"/>
  <row first_name="PENELOPE" last_name="GUINESS" title="ACADEMY DINOSAUR"/>
  <row first_name="PENELOPE" last_name="GUINESS" title="ANACONDA CONFESSIONS"/>
</rows>

FOR XML RAW, ELEMENTS

Producing flat results with elements for values.

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML RAW, ELEMENTS;

-- Standard SQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlelement(
    NAME first_name,
    first_name
  ),
  xmlelement(
    NAME last_name,
    last_name
  ),
  xmlelement(
    NAME title,
    title
  )
))
FROM (
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
  FOR XML RAW, ELEMENTS
) AS t

This produces

<row>
    <first_name>NICK</first_name>
    <last_name>WAHLBERG</last_name>
    <title>SMILE EARRING</title>
</row>
<row>
    <first_name>NICK</first_name>
    <last_name>WAHLBERG</last_name>
    <title>WARDROBE PHANTOM</title>
</row>
<row>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <title>ACADEMY DINOSAUR</title>
</row>
<row>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <title>ANACONDA CONFESSIONS</title>
</row>

This could also be combined with ROOT, which we’re omitting for brevity.

FOR XML/JSON AUTO

This approach derives results completely automatically from your query structure. Mainly:

  • The SELECT clause defines in what order XML or JSON data is nested.
  • The FROM clause defines the table names (via aliasing), which are translated to XML element or JSON object attribute names.
  • The ORDER BY clause produces the “grouping”, which is translated to nesting XML elements or JSON objects.
-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML AUTO;

-- Standard SQL
SELECT xmlagg(e)
FROM (
  SELECT xmlelement(
    NAME a,
    xmlattributes(
      t.first_name AS first_name,
      t.last_name AS last_name
    ),
    xmlagg(xmlelement(
      NAME f,
      xmlattributes(t.title AS title)
    ))
  ) AS e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) AS t
  GROUP BY
    first_name,
    last_name
) AS t

Notice how this emulation requires two steps of XMLAGG with GROUP BY. It gets more hairy with more tables being joined and projected! I won’t add more complex examples here, but try it online!

This produces

<a first_name="NICK" last_name="WAHLBERG">
    <f title="SMILE EARRING"/>
    <f title="WARDROBE PHANTOM"/>
</a>
<a first_name="PENELOPE" last_name="GUINESS">
    <f title="ACADEMY DINOSAUR"/>
    <f title="ANACONDA CONFESSIONS"/>
</a>

Let’s try the same thing again with JSON:

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR JSON AUTO;

-- Standard SQL
SELECT json_arrayagg(e)
FROM (
  SELECT JSON_OBJECT(
    KEY 'FIRST_NAME' VALUE first_name,
    KEY 'LAST_NAME' VALUE last_name,
    KEY 'F' VALUE JSON_ARRAYAGG(JSON_OBJECT(
      KEY 'TITLE' VALUE title
      ABSENT ON NULL
    ))
    ABSENT ON NULL
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

The result being:

[
    {
        "first_name": "NICK",
        "last_name": "WAHLBERG",
        "f": [
            {
                "title": "SMILE EARRING"
            },
            {
                "title": "WARDROBE PHANTOM"
            }
        ]
    },
    {
        "first_name": "PENELOPE",
        "last_name": "GUINESS",
        "f": [
            {
                "title": "ACADEMY DINOSAUR"
            },
            {
                "title": "ANACONDA CONFESSIONS"
            }
        ]
    }
]

FOR XML/JSON AUTO, ROOT

Like before, we could wrap this in a root XML element or a root JSON object if need be.

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML AUTO, ROOT;

-- Standard SQL
SELECT xmlelement(
  NAME join,
  xmlagg(e)
)
FROM (
  SELECT xmlelement(
    NAME a,
    xmlattributes(
      t.first_name AS first_name,
      t.last_name AS last_name
    ),
    xmlagg(xmlelement(
      NAME f,
      xmlattributes(t.title AS title)
    ))
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

This does the same thing as before, but just wraps the previous root XMLAGG() element in another XMLELEMENT() function call.

This produces

<root>
    <a first_name="NICK" last_name="WAHLBERG">
        <f title="SMILE EARRING"/>
        <f title="WARDROBE PHANTOM"/>
    </a>
    <a first_name="PENELOPE" last_name="GUINESS">
        <f title="ACADEMY DINOSAUR"/>
        <f title="ANACONDA CONFESSIONS"/>
    </a>
</root>

Let’s try the same thing again with JSON:

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR JSON AUTO, ROOT;

-- Standard SQL
SELECT JSON_OBJECT(KEY 'a' VALUE json_arrayagg(e))
FROM (
  SELECT JSON_OBJECT(
    KEY 'FIRST_NAME' VALUE first_name,
    KEY 'LAST_NAME' VALUE last_name,
    KEY 'F' VALUE JSON_ARRAY_AGG(JSON_OBJECT(
      KEY 'TITLE' VALUE title
      ABSENT ON NULL
    ))
    ABSENT ON NULL
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

The result being:

{
    "a": [
        {
            "first_name": "NICK",
            "last_name": "WAHLBERG",
            "f": [
                {
                    "title": "SMILE EARRING"
                },
                {
                    "title": "WARDROBE PHANTOM"
                }
            ]
        },
        {
            "first_name": "PENELOPE",
            "last_name": "GUINESS",
            "f": [
                {
                    "title": "ACADEMY DINOSAUR"
                },
                {
                    "title": "ANACONDA CONFESSIONS"
                }
            ]
        }
    ]
}

FOR XML AUTO, ELEMENTS

Like before, instead of producing attributes, we might decide to produce elements instead (in XML only):

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML AUTO, ELEMENTS;

-- Standard SQL
SELECT xmlagg(e)
FROM (
  SELECT xmlelement(
    NAME a,
    xmlelement(
      NAME first_name,
      first_name
    ),
    xmlelement(
      NAME last_name,
      last_name
    ),
    xmlagg(xmlelement(
      NAME f,
      xmlelement(
        NAME title,
        title
      )
    ))
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

Not much has changed, except for the fact that a set of XMLELEMENT() calls are made, rather than XMLATTRIBUTES() callse.

This produces

<a>
    <first_name>NICK</first_name>
    <last_name>WAHLBERG</last_name>
    <f>
        <title>SMILE EARRING</title>
    </f>
    <f>
        <title>WARDROBE PHANTOM</title>
    </f>
</a>
<a>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <f>
        <title>ACADEMY DINOSAUR</title>
    </f>
    <f>
        <title>ANACONDA CONFESSIONS</title>
    </f>
</a>

FOR XML/JSON PATH

The PATH strategy is my personal favourite. It is used to create nested XML or JSON path structures more explicitly, and also allows for additional nesting levels when grouping projections together. This is best shown by example. Notice, how I’m now using aliases for my columns, and the alias looks like an XPath expression using '/' (slashes):

-- SQL Server
SELECT 
  a.first_name AS [author/first_name], 
  a.last_name AS [author/last_name], 
  f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML PATH;

-- Standard SQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlelement(
    NAME author,
    xmlelement(
      NAME first_name,
      "author/first_name"
    ),
    xmlelement(
      NAME last_name,
      "author/last_name"
    )
  ),
  xmlelement(
    NAME title,
    title
  )
))
FROM (
  SELECT 
    a.first_name AS "author/first_name", 
    a.last_name AS "author/last_name", 
    f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) t

Check out how by convention, we’re now getting an additional level of nesting for author related columns under the row/author element:

<row>
    <author>
        <first_name>NICK</first_name>
        <last_name>WAHLBERG</last_name>
    </author>
    <title>SMILE EARRING</title>
</row>
<row>
    <author>
        <first_name>NICK</first_name>
        <last_name>WAHLBERG</last_name>
    </author>
    <title>WARDROBE PHANTOM</title>
</row>
<row>
    <author>
        <first_name>PENELOPE</first_name>
        <last_name>GUINESS</last_name>
    </author>
    <title>ACADEMY DINOSAUR</title>
</row>
<row>
    <author>
        <first_name>PENELOPE</first_name>
        <last_name>GUINESS</last_name>
    </author>
    <title>ANACONDA CONFESSIONS</title>
</row>

This is really neat! The SQL Server syntax is definitely much more convenient for this common use-case.

Let’s try the same thing again with JSON. The only thing we change is we now use a JSON-path-ish syntax using dots ('.') rather than slashes ('/'):

-- SQL Server
SELECT 
  a.first_name AS [author.first_name], 
  a.last_name AS [author.last_name], 
  f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR JSON PATH;

-- Standard SQL
SELECT JSON_ARRAYAGG(JSON_OBJECT(
  KEY 'author' VALUE JSON_OBJECT(
    KEY 'first_name' VALUE author.first_name,
    KEY 'last_name' VALUE author.last_name
  ),
  KEY 'TITLE' VALUE title
  ABSENT ON NULL
))
FROM (
  SELECT 
    a.first_name AS "author.first_name", 
    a.last_name AS "author.last_name", 
    f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) t

The result being (again, with nested objects):

[
    {
        "author": {
            "first_name": "NICK",
            "last_name": "WAHLBERG"
        },
        "title": "SMILE EARRING"
    },
    {
        "author": {
            "first_name": "NICK",
            "last_name": "WAHLBERG"
        },
        "title": "WARDROBE PHANTOM"
    },
    {
        "author": {
            "first_name": "PENELOPE",
            "last_name": "GUINESS"
        },
        "title": "ACADEMY DINOSAUR"
    },
    {
        "author": {
            "first_name": "PENELOPE",
            "last_name": "GUINESS"
        },
        "title": "ANACONDA CONFESSIONS"
    }
]

For more sophisticated nesting, including nesting of collections, a correlated subquery is needed in SQL Server, also with a FOR XML or FOR JSON syntax.

Conclusion

XML and JSON are popular document formats outside and inside of the database. SQL Server has some of the most conventient syntax for most cases, while standard SQL supports much more basic, and thus more powerful constructs. In standard SQL, almost any kind of XML or JSON projection is possible, and with XMLTABLE() and JSON_TABLE(), the documents can be transformed back to SQL tables, as well. In many applications, using these XML or JSON features natively would lead to much less boilerplate code, as many applications do not need middleware between the database and some client, just to transform data between formats.

Most ORMs don’t expose this functionality for a variety of reasons, the main one being that the devil is in the details. While both XML and JSON are nicely standardised, the implementations differ greatly:

  • The SQL/XML standard is implemented mostly by DB2, Oracle, and PostgreSQL. Many dialects offer some XML capabilities, but not as impressive as the standard and the previous three. SQL Server has FOR XML which is very powerful for standard XML serialisations, but may be a bit difficult to use for edge cases
  • The SQL/JSON standard was added late and is implemented again to large extents by DB2 and Oracle, but inceasingly also by MariaDB and MySQL. PostgreSQL (and by consequence, compatible dialects, like CockroachDB) had their own proprietary functions and APIs, which are not compatible with the standard. And again, SQL Server has FOR JSON which works well for standard serialisations, but a bit less well for edge cases

These technologies are poorly adopted in clients because of the many subtle differences. jOOQ has been leveling out these minor differences for many years without hiding the core functionality. SQL/XML and SQL/JSON are perfect use-cases for jOOQ 3.14 (due in Q2 2020), which now allows for using both the standard SQL/XML and SQL/JSON syntaxes as well as the SQL Server FOR XML and FOR JSON syntax in the jOOQ Professional and Enterprise Editions.

Before jOOQ 3.14 is out, you can already play with the current functionality on our website: https://www.jooq.org/translate

Stop Mapping Stuff in Your Middleware. Use SQL’s XML or JSON Operators Instead

It’s been a while since I’ve ranted on this blog, but I was recently challenged by a reddit thread to write about this topic, so here goes…

So, you’re writing a service that produces some JSON from your database model. What do you need? Let’s see:

  • Read a book on DDD
  • Read another book on DDD
  • Write some entities, DTOs, factories, and factory builders
  • Discuss whether your entities, DTOs, factories, and factory builders should be immutable, and use Lombok, Autovalue, or Immutables to ease the pain of construction of said objects
  • Discuss whether you want to use standard JPA, or Hibernate specific features for your mapping
  • Plug in Jackson, the XML and JSON mapper library, because you’ve read a nice blog post about it
  • Debug 1-2 problems arising from combining Jackson, JAXB, Lombok, and JPA annotations. Minor thing
  • Debug 1-2 N+1 cases

STOP IT

No, seriously. Just stop it right there!

What you needed was this kind of JSON structure, exported form your favourite Sakila database:

[{
  "first_name": "PENELOPE",
  "last_name": "GUINESS",
  "categories": [{
    "name": "Animation",
    "films": [{
      "title": "ANACONDA CONFESSIONS"
    }]
   }, {
    "name": "Family",
    "films": [{
      "title": "KING EVOLUTION"
    }, {
      "title": "SPLASH GUMP"
    }]
  }]
}, {
   ...

In English: We need a list of actors, and the film categories they played in, and grouped in each category, the individual films they played in.

Let me show you how easy this is with SQL Server SQL (all other database dialects can do it these days, I just happen to have a SQL Server example ready:

-- 1) Produce actors
SELECT
  a.first_name,
  a.last_name, (

    -- 2) Nest categories in each actor
    SELECT
      c.name, (

        -- 3) Nest films in each category
        SELECT title
        FROM film AS f
        JOIN film_category AS fc ON f.film_id = fc.film_id
        JOIN film_actor AS fa ON fc.film_id = fa.film_id
        WHERE fc.category_id = c.category_id
        AND a.actor_id = fa.actor_id
        FOR JSON PATH -- 4) Turn into JSON
      ) AS films
    FROM category AS c
    JOIN film_category AS fc ON c.category_id = fc.category_id
    JOIN film_actor AS fa ON fc.film_id = fa.film_id
    WHERE fa.actor_id = a.actor_id
    GROUP BY c.category_id, c.name
    FOR JSON PATH -- 4) Turn into JSON
  ) AS categories
FROM
  actor AS a 
FOR JSON PATH, ROOT ('actors') -- 4) Turn into JSON

That’s it. That’s all there is to it. Only basic SQL-92, enhanced with some vendor-specific JSON export syntax. (There are also SQL standard JSON APIs as implemented in other RDBMS). Let’s discuss it quickly:

  1. The outer most query produces a set of actors. As you would have expected
  2. For each actor, a correlated subquery produces a nested JSON array of categories
  3. For each category, another correlated subquery finds all the films per actor and category
  4. Finally, turn all the result structures into JSON

That’s it.

Want to change the result structure? Super easy. Just modify the query accordingly. No need to modify:

  • Whatever you thought your DDD “root aggregate was”
  • Your gazillion entities, DTOs, factories, and factory builders
  • Your gazillion Lombok, Autovalue, or Immutables annotations
  • Your hacks and workarounds to get this stuff through your standard JPA, or Hibernate specific features for your mapping
  • Your gazilion Jackson, the XML and JSON mapper library annotations
  • Debugging another 1-2 problems arising from combining Jackson, JAXB, Lombok, and JPA annotations
  • Debugging another 1-2 N+1 cases

No! No need! It’s so simple. Just stream the JSON directly from the database to the client using whatever SQL API of your preference: JDBC, jOOQ, JdbcTemplate, MyBatis, or even JPA native query. Just don’t go mapping that stuff in the middleware if you’re not consuming it in the middleware. Let me repeat that for emphasis:

Don’t go mapping that stuff in the middleware if you’re not consuming it in the middleware.

Oh, want to switch to XML? Easy. In SQL Server, this amounts to almost nothing but replacing JSON by XML:

SELECT
  a.first_name,
  a.last_name, (
    SELECT
      c.name, (
	    SELECT title
	    FROM film AS f
	    JOIN film_category AS fc ON f.film_id = fc.film_id
	    JOIN film_actor AS fa ON fc.film_id = fa.film_id
	    WHERE fc.category_id = c.category_id
	    AND a.actor_id = fa.actor_id
	    FOR XML PATH ('film'), TYPE
      ) AS films
    FROM category AS c
    JOIN film_category AS fc ON c.category_id = fc.category_id
    JOIN film_actor AS fa ON fc.film_id = fa.film_id
    WHERE fa.actor_id = a.actor_id
    GROUP BY c.category_id, c.name
    FOR XML PATH ('category'), TYPE
  ) AS categories
FROM
  actor AS a 
FOR XML PATH ('actor'), ROOT ('actors')

And now, you’re getting:

<actors>
  <actor>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <categories>
      <category>
        <name>Animation</name>
        <films>
          <film>
            <title>ANACONDA CONFESSIONS</title>
          </film>
        </films>
      </category>
      <category>
        <name>Family</name>
        <films>
          <film>
            <title>KING EVOLUTION</title>
          </film>
          <film>
            <title>SPLASH GUMP</title>
          </film>
        </films>
      </category>
      ...

It’s so easy with SQL!

Want to support both without rewriting too much logic? Produce XML and use XSLT to automatically generate the JSON. Whatever.

FAQ, Q&A

But my favourite Java SQL API can’t handle it

So what. Write a view and query that instead. By the way, jOOQ 3.14 supports all of it.

But this doesn’t fit our architecture

Then fix the architecture

But SQL is bad

No, it’s great. It’s based on relational algebra and augments it in many many useful ways. It’s a declarative 4GL, the optimiser produces way better execution plans than you could ever imagine (see my talk), and it’s way more fun than your gazillion 3GL mapping libraries.

But SQL is evil because of Oracle

Then use PostgreSQL. It can do JSON.

But what about testing

Just spin up a test database with https://www.testcontainers.org, install your schema with some migration framework like Flyway or Liquibase in it, fill in some sample data, and write your simple integration tests.

But mocking is better

It is not. The more you mock away the database, the more you’re writing your own database.

But I’m paid by the lines of code

Well, good riddance, then.

But what if we have to change the RDBMS

So what? Your management paid tens of millions for the new licensing. They can pay you tens of hundreds to spend 20 minutes rewriting your 5-10 SQL queries. You already wrote the integration tests above.

Anyway. It won’t happen. And if it will, then those few JSON queries will not be your biggest problem.

What was that talk of yours again?

Here, highly recommended:

But we’ve already spent so many person years implementing our middleware

It has a name

But I’ve read this other blog post…

And now you’ve read mine.

But that’s like 90s style 2 tier architecture

So what? You’ve spent 5% the time to implement it. That’s 95% more time adding value to your customers, rather than bikeshedding mapping technology. I call that a feature.

What about ingestion? We need abstraction over ingestion

No, you don’t. You can send the JSON directly into your database, and transform / normalise it from there, using the same technique. You don’t need middleware abstraction and mapping, you just want middleware abstraction and mapping.

How to Aggregate an Archive Log’s Deltas into a Snapshot with SQL

A customer of my popular SQL training (which you should book!) has recently challenged me to optimise a hierarchical query that merges an archive log’s deltas in order to obtain a snapshot of some record at a given point in time. In this article, I will reproduce their problem statement in a simplified version and show how this can be done with SQL Server, using a few cool SQL features:

All of these are topics covered in the training, which were immediately applicable to this problem statement.

The problem statement

This was their archive design. They designed for uncertainty, meaning that for some entities in their system, they did not know what kinds of attributes will be part of the entity in the future. Given their application design, users could even add their own custom attributes to an entity.

This kind of thing is typically solved with the EAV (Entity Attribute Value) model, a “workaround” to denormalise data sets in SQL databases in the event of such schema uncertainty.

EAV can be implemented in several ways:

Through classic SQL tables only

An example implementation is this:

CREATE TABLE eav_classic (
  entity_type     VARCHAR (100) NOT NULL,
  entity_id       BIGINT        NOT NULL,
  attribute_name  VARCHAR (100) NOT NULL,
  attribute_type  VARCHAR (100) NOT NULL,
  attribute_value VARCHAR (100)     NULL,

  CONSTRAINT eav_classic_pk 
    PRIMARY KEY (entity_type, entity_id, attribute_name)
);

The drawbacks of this non-normalised design are immediately obvious. Most specifically, there is no simple way to establish referential integrity. But this may be totally OK, especially for archive logs, and for smaller databases (datomic does something similar)

Through tables containing JSON or XML data

Whenever you have schema-on-read data, JSON or XML data types may be appropriate, so this is a perfectly valid alternative:

CREATE TABLE eav_json (
  entity_type     VARCHAR (100)   NOT NULL,
  entity_id       BIGINT          NOT NULL,
  attributes      VARCHAR (10000) NOT NULL 
    CHECK (ISJSON(attributes) = 1),

  CONSTRAINT eav_json_pk 
    PRIMARY KEY (entity_type, entity_id)
);

If your database supports a JSON data type, obviously, you will prefer that over the above emulation

For the rest of this article, I will use the JSON

Versioning the EAV table

Versioning data in an EAV model is quite easier than in a normalised schema. We can just add a version number and/or timestamp to the record. In their case, something like this may make sense:

CREATE TABLE history (
  id          BIGINT IDENTITY (1, 1) NOT NULL PRIMARY KEY,
  ts          DATETIME               NOT NULL,
  entity_type VARCHAR(100)           NOT NULL,
  entity_id   BIGINT                 NOT NULL,
  delta       VARCHAR(8000)          NOT NULL 
    CHECK (ISJSON(delta) = 1)
);

INSERT INTO history (entity_type, entity_id, ts, delta)
VALUES ('Person', 1, '2000-01-01 00:00:00', '{"first_name": "John", "last_name": "Doe"}'),
       ('Person', 1, '2000-01-01 01:00:00', '{"age": 37}'),
       ('Person', 1, '2000-01-01 02:00:00', '{"age": 38}'),
       ('Person', 1, '2000-01-01 03:00:00', '{"city": "New York"}'),
       ('Person', 1, '2000-01-01 04:00:00', '{"city": "Zurich", "age": null}')
;

This table now contains a set of deltas applied to the Person entity with ID = 1. It corresponds to the following sequence of SQL statements on an ordinary entity:

INSERT INTO person (id, first_name, last_name) 
  VALUES ('John', 'Doe');
UPDATE person SET age = 37 WHERE id = 1;
UPDATE person SET age = 38 WHERE id = 1;
UPDATE person SET city = 'New York' WHERE id = 1;
UPDATE person SET city = 'Zurich', age = null WHERE id = 1;

You could even see their hand-written log like a transaction log of the database system, kinda like what you can extract using products like Golden Gate or Debezium. If you think of the transaction log as an event stream, the RDBMS’s current data representation is like a snapshot that you can get when applying any number of deltas to your tables.

Sometimes, you don’t want to completely change your architecture and go full “event sourcing”, but just need this kind of log for a specific set of auditable entities. And e.g. for reasons like still supporting very old SQL Server versions, as well as supporting other databases, you may choose also not to use the SQL:2011 temporal table feature, which has also been implemented in SQL Server 2016 and more recent versions.

With that out of our way…

How to access any arbitrary snapshot version?

When we visually process our HISTORY table, we can see that Person ID = 1 had the following values at any given time:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York
04:00:00    John          Doe                 Zurich

Remember, this is always the same record of Person ID = 1, its snapshots represented at different times in the time axis. The goal here is to be able to find the record of John Doe at any given time.

Again, if we had been using the SQL:2011 temporal table feature, we could write

-- SQL Server
SELECT * 
FROM Person
FOR SYSTEM_TIME AS OF '2000-01-01 02:00:00.0000000'; 

-- Oracle (flashback query)
SELECT *
FROM Person
AS OF TIMESTAMP TIMESTAMP '2000-01-01 02:00:00'

Side note: Do note that Oracle’s flashback query needs to be properly configured:

  • Not all data is “flashbackable”
  • DDL tends to destroy the archive
  • Proper grants are needed to access the flashback archive

Similar limitations may apply in SQL Server.

What if the RDBMS can’t help us?

If again for some reason, we cannot use the RDBMS’s temporal table features, we’ll roll our own as we’ve seen. So, our query in SQL Server to access the snapshot at any given time may be this:

SELECT 
  '{' 
+ string_agg(
    CASE type WHEN 0 THEN NULL ELSE 
      '"' + [key] + '": ' + 
      CASE type WHEN 1 THEN '"' + value + '"' ELSE value END
    END, ', ') 
+ '}'
FROM (
  SELECT *, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history
  OUTER APPLY openjson(delta)
  
  -- Apply all deltas prior to any given snapshot
  WHERE ts <= '2000-01-01 02:00:00'
) t
WHERE rn = 1;

What does this query do? Consider again our deltas at 04:00:00:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York
04:00:00    John          Doe          -      Zurich

Observe how each value has some color encoding:

  • Strong, red: The current snapshot’s attribute value, when the last delta was applied to any given attribute
  • Strong, black: A previous snapshot’s attribute value, when a previous, superseded delta was applied to any given attribute
  • Light grey: A previous snapshot’s attribute value that was inherited from another previous delta

For any given snapshot, we want to find the Strong, red values. E.g. at a previous snapshot time, the color encoding would have been:

At 03:00:00

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York

04:00:00    John          Doe          -      Zurich

At 02:00:00

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38

03:00:00    John          Doe          38     New York
04:00:00    John          Doe          -      Zurich

So, our query needs to find the delta that was applied last for any given attribute.

With SQL, we can find that easily. We can assign a row number to each delta per attribute in reverse order, something like this:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John (1)      Doe (1)
01:00:00    John          Doe          37 (3)
02:00:00    John          Doe          38 (2)
03:00:00    John          Doe          38     New York (2)
04:00:00    John          Doe          - (1)  Zurich (1)

Once we have that row number, we just filter out only those deltas whose row number is 1. Something like:

SELECT [key], value, row_number() OVER (
  PARTITION BY [key] ORDER BY ts DESC) rn
FROM history OUTER APPLY openjson(delta)
ORDER BY [key], ts;

Notice the OUTER APPLY openjson(delta) syntax. This just expands the JSON structure into key/value/type columns, which we can use more easily in a SQL query. Other database systems may have similar syntax for similar purposes. The result of the above query is:

key        |value    |rn 
-----------|---------|---
age        |37       |3  
age        |38       |2  
age        |         |1  
city       |New York |2  
city       |Zurich   |1  
first_name |John     |1  
last_name  |Doe      |1  

Filtering the ones whose row number is 1:

SELECT [key], value
FROM (
  SELECT ts, [key], value, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history OUTER APPLY openjson(delta)
) t
WHERE rn = 1
ORDER BY ts, [key]

This yields:

key        |value  
-----------|-------
first_name |John   
last_name  |Doe    
age        |       
city       |Zurich 

Exactly the data we wanted, in key/value form. Notice that this filtering step could have been done with DISTINCT ON in PostgreSQL, or with KEEP (DENSE_RANK FIRST ORDER BY ..) in Oracle – an exercise which I shall leave to the reader (feel free to leave the solution in the comments!)

And now, finally, just re-assemble the JSON using SQL Server 2017 STRING_AGG. PostgreSQL would offer us JSON_AGG here, Oracle has JSON_OBJECTAGG. With STRING_AGG, you have to take care of manually escaping all values according to JSON syntax rules, which is bad. In my example, I just replaced ” by \”. Other characters need escaping too, so if there is a built-in feature, use that instead of string processing.

The STRING_AGG function aggregates a CASE expression which translates different JSON data types into different formats, where:

  • 0 is NULL (and nulls are not aggregated)
  • 1 is string
  • everything else can be taken at its value for simplicity, e.g. numbers or booleans

Every value (except nulls) are prefixed by the JSON object’s attribute name (“key”).

SELECT 
  '{' 
+ string_agg(
    CASE type WHEN 0 THEN NULL ELSE 
      '"' + replace([key], '"', '\"') + '": ' + 
      CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
    END, ', ') 
+ '}'
FROM (
  SELECT *, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history
  OUTER APPLY openjson(delta)
  
  -- Apply all deltas prior to any given snapshot
  WHERE ts <= '2000-01-01 04:00:00'
) t
WHERE rn = 1;

This produces

{"city": "Zurich", "first_name": "John", "last_name": "Doe"}

A final query, that gets us the entire history of snapshots (watch the performance on this one, could definitely be optimised):

SELECT ts, (
  SELECT 
    '{' 
  + string_agg(
      CASE type WHEN 0 THEN NULL ELSE 
        '"' + replace([key], '"', '\"') + '": ' + 
        CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
      END, ', ') 
  + '}'
  FROM (
    SELECT *, row_number() OVER (
      PARTITION BY [key] ORDER BY ts DESC) rn
    FROM history
    OUTER APPLY openjson(delta)
    
    -- Apply all deltas prior to any given snapshot
    WHERE ts <= x.ts
  ) t
  WHERE rn = 1
)
FROM history x
GROUP BY ts;

It yields:

ts       |                                                                          
---------|--------------------------------------------------------------------------
00:00:00 |{"first_name": "John", "last_name": "Doe"}                                
01:00:00 |{"age": 37, "first_name": "John", "last_name": "Doe"}                     
02:00:00 |{"age": 38, "first_name": "John", "last_name": "Doe"}                     
03:00:00 |{"age": 38, "city": "New York", "first_name": "John", "last_name": "Doe"} 
04:00:00 |{"city": "Zurich", "first_name": "John", "last_name": "Doe"}              

So, the complete history of all the snapshot versions of the Person with ID = 1.

Very cool, and definitely good enough for their archive / audit query requirements.

jOOQ Tuesdays: Gerald Sangudi and Keshav Murthy Reveal the Secrets of N1QL (SQL on JSON)

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

I’m very excited to feature today Gerald Sangudi and Keshav Murthy, creators of the the N1QL query language, CouchBase’s SQL-based JSON querying language.

Hi Gerald and Keshav – It’s great to have two people on the jOOQ Tuesdays interview at once! How did you guys meet?

Keshav: Gerald interviewed me for my the job at Couchbase.

Gerald: Best decision we ever made.

You’re both working at CouchBase, one of the leading “JSON databases” in the market. What drove you towards JSON?

Keshav: I worked for IBM, specifically on the Informix database. In 2013, Websphere teams needed JSON support. I was skeptical of JSON for data management except for managing metadata.  Years before, I had successfully opposed implementing XML and XQuery inside Informix. Then I went to a NoSQL conference where I saw enterprises using JSON in real-world applications.  I saw enterprises like eBay, Cisco developing their enterprise applications on NoSQL and JSON.

RDBMS customers had been asking for ALTER TABLE ONLINE when they needed to add new columns. They typically would have additional unused  columns which they’d rename them when needed.  Obviously, this wasn’t an optimal or a good solution because you could run out these columns, types or structure wasn’t flexible.

For years, customers had been ALTER TABLE while application were online.  I realized, JSON was a good way to provide it. That’s when we added JSON as a type, added sharding, and started to extend SQL to support JSON.  A good relational database can deliver lot of value JSON.  So, we developed a mechanism to exploit JSON to provide schema flexibility on relational tables and we received a patent for it: http://bit.ly/2og1uXe.

At Couchbase, data is stored as JSON and N1QL was designed as SQL for JSON.  So, we keep calm and JSON.

Gerald: I am not at Couchbase any more, but I was there for four years. Couchbase has its roots in memcached and CouchDB, so Couchbase was already storing and managing JSON data before N1QL and before we arrived. N1QL was developed to enable our users to query and manipulate their JSON data.

You’re both working on, or have worked on N1QL, CouchBase’s innovative SQL-esque query language for JSON. Has the SQL language come full circle? Why is the SQL syntax such a great fit for you?

Keshav: Don Chamberlin, co-inventor of SQL said this. During XQuery discussion, he argued against using SQL for manipulating XML. Now, with N1QL and SQL++ on JSON, he sees enormous possibilities for SQL to manipulate JSON effectively.  So, you could say, it has come a full circle.

From the Application point of view, requirement for SQL to support complex data models has been there.  SQL-99 added the structured type into the language, but didn’t recognize need for schema flexibility.

What Gerald has very nicely is to inherit SQL and  extend not just the language but the
underlying boolean logic.  

SQL has three valued logic: TRUE, FALSE, NULL.  N1QL has 4-valued logic: TRUE, FALSE, NULL and MISSING.  SQL has the select-join-project operations. N1QL has those and adds NEST and UNNEST operations for handling arrays.  Once we have the logic and operations, the type system, rest of the expressions for handling nested objects and arrays can be added.

There is another important change in N1QL compared to SQL.  SQL  is the query language to manipulate data.  N1QL can discover the document metadata (names, structure and types) dynamically and operate on it.

Gerald: SQL is the greatest and most successful query language of all time. Our job was to enable our users to query data that is sometimes different from what standard SQL expects. We hope N1QL does that.

As an aside, the “N1” in N1QL stands for non-first normal form. SQL geeks like you Lukas may know that “non-first” is one primary difference between JSON data and relational data. The other primary differences are schema and uniformity.

Will your relational competition steal features from N1QL? Or will you steal more features from them?

Keshav: I do hope relational databases steal features from N1QL. Having common approaches solve problems makes it easier for customers to choose the right database for the right problem. I do hope they choose Couchbase more often than RDBMS!

Gerald always says, we don’t differ from SQL unless there is a good reason. In that sense, we’ve taken lot of the features from relational databases already.  In addition, we learn from successful models in relational databases for things like index design, query optimization, security and monitoring.  We stand on shoulders of giants.

Gerald: We hope that both relational and non-relational vendors steal features from each other. There is a collaborative effort on something called SQL++, which is a superset of both SQL and N1QL. We hope SQL++ is the convergence point. One lesson from the success of SQL is that standards are great for both users and vendors.

CouchBase doesn’t have what you call a “static schema”. The biggest advantage of a static schema for the database optimiser is the many ways such a schema can be used to predict performance and choose optimal execution plans. How does optimisation work in a “schema-less” database?

Keshav: Actually, static schema gives you the structure, but not the data distribution, which is a major factor for calculating the execution cost. N1QL uses the information within the query and available indexes to glean the structure and decide on the plan. For example, if you have a query with a predicate:  WHERE state = “CA” and zipcode = “94040”, and there is an index on either state, zipcode or both, we’d assume these key-values exist in the document and push down the predicates to index scan.  We given further details on an article on DZone:  https://dzone.com/articles/a-deep-dive-into-couchbase-n1ql-query-optimization

Separately, we do have a mechanism to INFER the schema by sampling.  Right now, we show the inferred schema so users can understand the structure.  We also use the inferred schema to make query editing easier with hints and validations within the workbench.  We do have plans to use this, collect additional statistics to improve decisions in the optimizer.

Gerald: The N1QL optimizer is one of the joys of working on N1QL. As Keshav said, most of the SQL optimization techniques carry over to N1QL. The data may not have a static schema, but the query has an implicit schema, and the indexes have implicit schemas. That is, the query has predicates and other characteristics, and each index has keys and other characteristics. That is enough to keep an optimizer busy.

Working for a newer database vendor, you don’t have as much legacy, so you can innovate faster and more freely. What will be the next big thing in the database market?

Keshav: Our metric for innovation is the progress customer trying to make in their business. So, we innovate within the constraint of a customer job.  That helps us to innovate and measure its success from customer point of view.

We see customers deploying NoSQL databases for newer patterns like systems of engagement.

When you plan your vacation, you search a lot before you buy.  Search for places, hotels, airlines, things-to-do.  Then you compare costs, ratings, before you make the final purchase. For a travel company, these require significant database infrastructure to support high number of queries with low latencies and at a very low cost.

We see customers deploying NoSQL databases to support lot of the customer engagement and information requirement use cases.  RDBMS is still used as system of record for buy-sell-cancel-checkin-etc operations but will integrate with system of engagement databases to enhance customer experience.

Doing this effectively requires innovations in every area: data platform designs, scale out, query processing, index designs and manageability.

Gerald: I read Jeff Bezos’ latest annual letter to shareholders.  He says that the things that don’t change are more important than the things that do change. Databases should store your data reliably and give you answers and updates quickly. If database vendors continue to do that, we’ll be ok.

When to Choose SQL and When to Choose NoSQL

Some people make architecture decisions purely based on the loudest consultant:

For most others, however, decisions are not as simple as this. For instance: When should we start evaluating NoSQL storage systems as an alternative to RDBMS?

RDBMS as the hammer for all nails

This question obviously assumes that you’re starting out with an RDBMS, which is classically the database system that solves pretty much any problem decently enough not to be replaced easily. What does this mean? Simply put:

  • RDBMS have been around forever, so they have a huge advantage compared to “newcomers” in the market, who don’t have all the excellent tooling, community, support, maturity yet
  • E.F. Codd’s work may have been the single biggest influence on our whole industry. There has hardly been anything as revolutionary as the relational model ever since. It’s hard for an alternative database to be equally universal, i.e. they’re mostly solving niche problems

Having said so, sometimes you do have a niche problem. For instance a graph database problem. In fact, a graph is nothing fundamentally different from what you can represent in the relational model. It is easy to model a graph with a many-to-many relationship table.

The same is true for XML/JSON in the database (don’t forget, JSON is just XML with less syntax and less features, so it’s less awesome than XML). Sometimes you need to actually store document structures (hierarchical data) in their hierarchical form in your database, instead of normalising them. Sure, you could normalise a document, but that’s a lot of work and not necessarily beneficial. This was actually the question on Twitter by Gareth, that triggered the subsequent discussion:

Most modern RDBMS offer means to store (and much more importantly, query) XML/JSON data structures in simple forms. This includes PostgreSQL, Oracle, DB2, SQL Server and others.

So, when do we decide to switch?

As developers, we may be inclined to switch quite quickly. E.g. when working with graphs, we’d love to use Neo4j which implements the awesome Cypher query language. When working with JSON, we’d love to use something like Couchbase which implements the interesting N1QL query language. Both of these query languages are heavily inspired by SQL, which were wise vendor choices in my opinion (as opposed to MongoDB’s weird JSON-based query language), because in the end, the SQL language is still the most powerful and popular 4GL that was ever created.

But as developers, we shouldn’t make that decision lightly. Yes, at first, these specialised databases seem like a better fit. But the additional cost for operations teams to maintain, monitor, patch, tune production systems must not be underestimated. This is even true within the RDBMS ecosystem. A prominent recent example was Uber’s switch (back!) from PostgreSQL to MySQL:

Why Uber Engineering Switched from Postgres to MySQL

Do note that they switched the other way round before, only to regret things. Truth is, there are tons of reasons why your operations teams prefer to always use the same database, even if that’s quite expensive in terms of licensing. In many cases, however, it would be even more expensive to:

  • Engage in additional license and/or support contracts with new database vendors
  • Find skilled DBA for the new technology (can be very hard with niche databases)
  • Maintain two data silos and possibly sustain the cost of keeping them in sync

Ultimately, there’s a threshold:

In terms of using JSON in the database, it’s simple:

  • Occasional JSON storage: Stick with RDBMS.
  • Everything is JSON: Perhaps not RDBMS.

The same is true for graph problems. SQL can totally handle graphs and traverse them with recursive SQL. Here’s a funky statement that recursively calculates the subset sum problem:

subset-sum

Full article here: How to Find the Closest Subset Sum with SQL

If you only have a few tree/graph traversals to calculate (e.g. for a simple menu structure), don’t jump ship with RDBMS yet. If graphs are your business and that’s all you do, then probably RDBMS aren’t the right choice.

Conclusion

Whatever problem you’re solving. Do remember. Yes, all you have is a hammer, every problem starts looking like a nail. But don’t dismiss RDBMS as a silly hammer. It is a very mighty hammer and in 2016, it can do a lot of non-relational niche things decently well.

RDBMS is still the best default choice for all sorts of data problems. Only once you have gone above a certain threshold (or if you can foresee doing that), then you should look for alternatives. Because those alternatives will give you a much rougher time when you go outside of that niche (JSON, graphs, etc.) back to your “ordinary” relational business.