Using SQL Server FOR XML and FOR JSON Syntax on Other RDBMS With jOOQ

SQL Server supports transforming flat tabular SQL result sets into hierarchical structures by convention using the convenient FOR XML or FOR JSON syntaxes. This is really convenient and less verbose than the standard SQL/XML or SQL/JSON APIs – although the standard ones are more powerful.

In this blog post, I’d like to show a few core features of the SQL Server syntax, and what they correspond to in standard SQL. jOOQ 3.14 will support both SQL Server’s syntax and the standard syntax, and will be able to translate from one to the other, such that you can use SQL Server syntax also on Db2, MariaDB, MySQL, Oracle, PostgreSQL. You can play around with the current state of development on our website here.

As always, using the Sakila database, here’s a simple example as a teaser:

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
FOR XML RAW;

-- Db2, Oracle, PostgreSQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlattributes(
    t.first_name AS first_name,
    t.last_name AS last_name,
    t.title AS title
  )
))
FROM (
  -- Original query here
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
) AS t

Producing in both cases something like:

<row first_name="PENELOPE" last_name="GUINESS" title="OKLAHOMA JUMANJI"/>
<row first_name="PENELOPE" last_name="GUINESS" title="RULES HUMAN"/>
<row first_name="PENELOPE" last_name="GUINESS" title="SPLASH GUMP"/>
<row first_name="PENELOPE" last_name="GUINESS" title="VERTIGO NORTHWEST"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WESTWARD SEABISCUIT"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WIZARD COLDBLOODED"/>
<row first_name="NICK" last_name="WAHLBERG" title="ADAPTATION HOLES"/>
<row first_name="NICK" last_name="WAHLBERG" title="APACHE DIVINE"/>

FOR XML and FOR JSON concepts

As could be seen in the above teaser, the SQL Server syntax is far less verbose and concise, and it seems to produce a reasonable default behaviour, where the Db2, Oracle, PostgreSQL (and SQL Standard) SQL/XML APIs are more verbose, but also more powerful. For example, it is possible to map column a to attribute x and column b to a nested XML element y very easily.

The advantages of both approaches are clear. SQL Server’s approach is much more usable in the general case. But what is the general case? Let’s summarise a few key parallels between SQL result sets, and XML/JSON data structures:


Tables are XML elements or JSON arrays

Tables (i.e. sets of data) are not a foreign concept to both XML and JSON documents. The most natural way to represent a set of data in XML is a set of elements using the same element name, optionally wrapped by a wrapper element. For example:

<!-- With wrapper element -->
<films>
  <film title="OKLAHOMA JUMANJI"/>
  <film title="RULES HUMAN"/>
  <film title="SPLASH GUMP"/>
</films>

<!-- Without wrapper element -->
<film title="OKLAHOMA JUMANJI"/>
<film title="RULES HUMAN"/>
<film title="SPLASH GUMP"/>

The distinction of whether a wrapper element is added is mostly significant when nesting data.

With JSON, the obvious choice of data structure to represent a table is an array. For example:

[
  {"title": "OKLAHOMA JUMANJI"},
  {"title": "RULES HUMAN"},
  {"title": "SPLASH GUMP"}
]


Rows are XML elements or JSON objects

As we’ve already seen above, a SQL row is represented in XML using an element.

<film title="OKLAHOMA JUMANJI"/>

The question is only what the element name should be. It can usually be any of:

  • A standard name, such as “row”
  • The name of the table the row stems from
  • A custom name

In JSON, it is an object.

{"title": "OKLAHOMA JUMANJI"}

Unlike in XML, there is no such thing as an element name, so the row is “anonymous”. The row type is defined by what table / array the JSON object is contained in.


Column values are XML elements or attributes, or JSON attributes

We have a bit more choices of how to represent SQL column values in XML. Mainly two choices:

  • Represent values as attributes
  • Represent values as elements

Scalar values can easily be represented as attributes. If a value needs further nesting (e.g. an array, user defined type, etc.), then elements are a better choice. In most cases, the choice is not relevant, so we can pick both:

<!-- Using attributes -->
<film film_id="635" title="OKLAHOMA JUMANJI"/>

<!-- Using elements from table and column names -->
<film>
  <film_id>635</film_id>
  <title>OKLAHOMA JUMANJI</title>
</film>

<!-- Using standard element names
<row>
  <value name="film_id" value="635"/>
  <value name="title" value="OKLAHOMA JUMANJI"/>
</row>

There are a few other reasonable default options for the representation of a column value in XML.

In JSON, on the other hand, there are two main reasonable approaches. In most cases, an object will be chosen, where column values are identified by column name. But just like SQL records are a mixture between “structs” and “tuples”, we could imagine a representation that maps column values to array indexes as well:

// Using objects
{"film_id": 635, "title": "OKLAHOMA JUMANJI"}

// Using arrays
[635, "OKLAHOMA JUMANJI"]


GROUP BY and ORDER BY can be seen as a way to nest data

So far, all data was represented in a flat way, just like the SQL table. There was some nesting when wrapping XML elements of JSON arrays in some wrapper element or object, or when representing XML data with more elements rather than attributes, but the data was still always tabular.

Very often, we want to consume data in a hierarchical form, though. An actor played in films, so we’d like to group the films by actor, rather than repeating the actor information for every film. In general, operations like GROUP BY or ORDER BY serve this purpose. GROUP BY allows for aggregating all data into nested data structures per group (e.g. into strings, arrays, XML elements, JSON arrays, JSON objects). ORDER BY does the same, “visually” – perhaps a bit less formally. When we look at this set of XML elements, we can see visually that they’re “grouped” (i.e. ordered) by actor:

<row first_name="PENELOPE" last_name="GUINESS" title="OKLAHOMA JUMANJI"/>
<row first_name="PENELOPE" last_name="GUINESS" title="RULES HUMAN"/>
<row first_name="PENELOPE" last_name="GUINESS" title="SPLASH GUMP"/>
<row first_name="PENELOPE" last_name="GUINESS" title="VERTIGO NORTHWEST"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WESTWARD SEABISCUIT"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WIZARD COLDBLOODED"/>
<row first_name="NICK" last_name="WAHLBERG" title="ADAPTATION HOLES"/>
<row first_name="NICK" last_name="WAHLBERG" title="APACHE DIVINE"/>

SQL Server supports such grouping in at least two ways:

  • Implicitly by convention, using ORDER BY
  • Explicity by creating correlated subqueries

The implicit approach could transform the above flat representation into something like this:

<a first_name="PENELOPE" last_name="GUINESS">
    <f title="OKLAHOMA JUMANJI"/>
    <f title="RULES HUMAN"/>
    <f title="SPLASH GUMP"/>
    <f title="VERTIGO NORTHWEST"/>
    <f title="WESTWARD SEABISCUIT"/>
    <f title="WIZARD COLDBLOODED"/>
</a>
<a first_name="NICK" last_name="WAHLBERG">
    <f title="ADAPTATION HOLES"/>
    <f title="APACHE DIVINE"/>
</a>

… where “a” and “f” are the table names in the query (actor a and film f).

How do FOR XML and FOR JSON work in detail?

There are several features that can be combined in SQL Server. The complete picture can be seen from the docs. We’ll omit a few features in this blog post here.

  • The transformation algorithm RAW (flat results, only in XML), AUTO (hierarchical, automatic results), PATH (hierarchical, explicit results)
  • The “root” name, which corresponds to an XML wrapper element, or a JSON wrapper object
  • XML only: Whether values should be placed in ELEMENTS or attributes
  • JSON only: INCLUDE_NULL_VALUES specifies whether NULL values are explicit, or implicit (absent from the JSON object).
  • JSON only: WITHOUT_ARRAY_WRAPPER specifies whether the set of JSON objects should be listed as a JSON array, or a comma separated list of objects (which could be combined with other queries)

This is not complete, there are more flags and features, but instead of discussing them in theory, let’s look at a few examples:

FOR XML RAW

Producing flat results with attributes for values

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML RAW;

-- Standard SQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlattributes(
    t.first_name AS first_name,
    t.last_name AS last_name,
    t.title AS title
  )
))
FROM (
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) AS t

This produces

<row first_name="NICK" last_name="WAHLBERG" title="SMILE EARRING"/>
<row first_name="NICK" last_name="WAHLBERG" title="WARDROBE PHANTOM"/>
<row first_name="PENELOPE" last_name="GUINESS" title="ACADEMY DINOSAUR"/>
<row first_name="PENELOPE" last_name="GUINESS" title="ANACONDA CONFESSIONS"/>

FOR XML RAW, ROOT

Producing flat results with attributes for values, and a root element to wrap the listed elements

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML RAW, ROOT('rows');

-- Standard SQL
SELECT xmlelement(
  NAME rows,
  xmlagg(xmlelement(
    NAME row,
    xmlattributes(
      t.first_name AS first_name,
      t.last_name AS last_name,
      t.title AS title
    )
  ))
)
FROM (
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) AS t

This produces

<rows>
  <row first_name="NICK" last_name="WAHLBERG" title="SMILE EARRING"/>
  <row first_name="NICK" last_name="WAHLBERG" title="WARDROBE PHANTOM"/>
  <row first_name="PENELOPE" last_name="GUINESS" title="ACADEMY DINOSAUR"/>
  <row first_name="PENELOPE" last_name="GUINESS" title="ANACONDA CONFESSIONS"/>
</rows>

FOR XML RAW, ELEMENTS

Producing flat results with elements for values.

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML RAW, ELEMENTS;

-- Standard SQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlelement(
    NAME first_name,
    first_name
  ),
  xmlelement(
    NAME last_name,
    last_name
  ),
  xmlelement(
    NAME title,
    title
  )
))
FROM (
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
  FOR XML RAW, ELEMENTS
) AS t

This produces

<row>
    <first_name>NICK</first_name>
    <last_name>WAHLBERG</last_name>
    <title>SMILE EARRING</title>
</row>
<row>
    <first_name>NICK</first_name>
    <last_name>WAHLBERG</last_name>
    <title>WARDROBE PHANTOM</title>
</row>
<row>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <title>ACADEMY DINOSAUR</title>
</row>
<row>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <title>ANACONDA CONFESSIONS</title>
</row>

This could also be combined with ROOT, which we’re omitting for brevity.

FOR XML/JSON AUTO

This approach derives results completely automatically from your query structure. Mainly:

  • The SELECT clause defines in what order XML or JSON data is nested.
  • The FROM clause defines the table names (via aliasing), which are translated to XML element or JSON object attribute names.
  • The ORDER BY clause produces the “grouping”, which is translated to nesting XML elements or JSON objects.
-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML AUTO;

-- Standard SQL
SELECT xmlagg(e)
FROM (
  SELECT xmlelement(
    NAME a,
    xmlattributes(
      t.first_name AS first_name,
      t.last_name AS last_name
    ),
    xmlagg(xmlelement(
      NAME f,
      xmlattributes(t.title AS title)
    ))
  ) AS e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) AS t
  GROUP BY
    first_name,
    last_name
) AS t

Notice how this emulation requires two steps of XMLAGG with GROUP BY. It gets more hairy with more tables being joined and projected! I won’t add more complex examples here, but try it online!

This produces

<a first_name="NICK" last_name="WAHLBERG">
    <f title="SMILE EARRING"/>
    <f title="WARDROBE PHANTOM"/>
</a>
<a first_name="PENELOPE" last_name="GUINESS">
    <f title="ACADEMY DINOSAUR"/>
    <f title="ANACONDA CONFESSIONS"/>
</a>

Let’s try the same thing again with JSON:

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR JSON AUTO;

-- Standard SQL
SELECT json_arrayagg(e)
FROM (
  SELECT JSON_OBJECT(
    KEY 'FIRST_NAME' VALUE first_name,
    KEY 'LAST_NAME' VALUE last_name,
    KEY 'F' VALUE JSON_ARRAYAGG(JSON_OBJECT(
      KEY 'TITLE' VALUE title
      ABSENT ON NULL
    ))
    ABSENT ON NULL
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

The result being:

[
    {
        "first_name": "NICK",
        "last_name": "WAHLBERG",
        "f": [
            {
                "title": "SMILE EARRING"
            },
            {
                "title": "WARDROBE PHANTOM"
            }
        ]
    },
    {
        "first_name": "PENELOPE",
        "last_name": "GUINESS",
        "f": [
            {
                "title": "ACADEMY DINOSAUR"
            },
            {
                "title": "ANACONDA CONFESSIONS"
            }
        ]
    }
]

FOR XML/JSON AUTO, ROOT

Like before, we could wrap this in a root XML element or a root JSON object if need be.

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML AUTO, ROOT;

-- Standard SQL
SELECT xmlelement(
  NAME join,
  xmlagg(e)
)
FROM (
  SELECT xmlelement(
    NAME a,
    xmlattributes(
      t.first_name AS first_name,
      t.last_name AS last_name
    ),
    xmlagg(xmlelement(
      NAME f,
      xmlattributes(t.title AS title)
    ))
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

This does the same thing as before, but just wraps the previous root XMLAGG() element in another XMLELEMENT() function call.

This produces

<root>
    <a first_name="NICK" last_name="WAHLBERG">
        <f title="SMILE EARRING"/>
        <f title="WARDROBE PHANTOM"/>
    </a>
    <a first_name="PENELOPE" last_name="GUINESS">
        <f title="ACADEMY DINOSAUR"/>
        <f title="ANACONDA CONFESSIONS"/>
    </a>
</root>

Let’s try the same thing again with JSON:

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR JSON AUTO, ROOT;

-- Standard SQL
SELECT JSON_OBJECT(KEY 'a' VALUE json_arrayagg(e))
FROM (
  SELECT JSON_OBJECT(
    KEY 'FIRST_NAME' VALUE first_name,
    KEY 'LAST_NAME' VALUE last_name,
    KEY 'F' VALUE JSON_ARRAY_AGG(JSON_OBJECT(
      KEY 'TITLE' VALUE title
      ABSENT ON NULL
    ))
    ABSENT ON NULL
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

The result being:

{
    "a": [
        {
            "first_name": "NICK",
            "last_name": "WAHLBERG",
            "f": [
                {
                    "title": "SMILE EARRING"
                },
                {
                    "title": "WARDROBE PHANTOM"
                }
            ]
        },
        {
            "first_name": "PENELOPE",
            "last_name": "GUINESS",
            "f": [
                {
                    "title": "ACADEMY DINOSAUR"
                },
                {
                    "title": "ANACONDA CONFESSIONS"
                }
            ]
        }
    ]
}

FOR XML AUTO, ELEMENTS

Like before, instead of producing attributes, we might decide to produce elements instead (in XML only):

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML AUTO, ELEMENTS;

-- Standard SQL
SELECT xmlagg(e)
FROM (
  SELECT xmlelement(
    NAME a,
    xmlelement(
      NAME first_name,
      first_name
    ),
    xmlelement(
      NAME last_name,
      last_name
    ),
    xmlagg(xmlelement(
      NAME f,
      xmlelement(
        NAME title,
        title
      )
    ))
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

Not much has changed, except for the fact that a set of XMLELEMENT() calls are made, rather than XMLATTRIBUTES() callse.

This produces

<a>
    <first_name>NICK</first_name>
    <last_name>WAHLBERG</last_name>
    <f>
        <title>SMILE EARRING</title>
    </f>
    <f>
        <title>WARDROBE PHANTOM</title>
    </f>
</a>
<a>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <f>
        <title>ACADEMY DINOSAUR</title>
    </f>
    <f>
        <title>ANACONDA CONFESSIONS</title>
    </f>
</a>

FOR XML/JSON PATH

The PATH strategy is my personal favourite. It is used to create nested XML or JSON path structures more explicitly, and also allows for additional nesting levels when grouping projections together. This is best shown by example. Notice, how I’m now using aliases for my columns, and the alias looks like an XPath expression using '/' (slashes):

-- SQL Server
SELECT 
  a.first_name AS [author/first_name], 
  a.last_name AS [author/last_name], 
  f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML PATH;

-- Standard SQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlelement(
    NAME author,
    xmlelement(
      NAME first_name,
      "author/first_name"
    ),
    xmlelement(
      NAME last_name,
      "author/last_name"
    )
  ),
  xmlelement(
    NAME title,
    title
  )
))
FROM (
  SELECT 
    a.first_name AS "author/first_name", 
    a.last_name AS "author/last_name", 
    f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) t

Check out how by convention, we’re now getting an additional level of nesting for author related columns under the row/author element:

<row>
    <author>
        <first_name>NICK</first_name>
        <last_name>WAHLBERG</last_name>
    </author>
    <title>SMILE EARRING</title>
</row>
<row>
    <author>
        <first_name>NICK</first_name>
        <last_name>WAHLBERG</last_name>
    </author>
    <title>WARDROBE PHANTOM</title>
</row>
<row>
    <author>
        <first_name>PENELOPE</first_name>
        <last_name>GUINESS</last_name>
    </author>
    <title>ACADEMY DINOSAUR</title>
</row>
<row>
    <author>
        <first_name>PENELOPE</first_name>
        <last_name>GUINESS</last_name>
    </author>
    <title>ANACONDA CONFESSIONS</title>
</row>

This is really neat! The SQL Server syntax is definitely much more convenient for this common use-case.

Let’s try the same thing again with JSON. The only thing we change is we now use a JSON-path-ish syntax using dots ('.') rather than slashes ('/'):

-- SQL Server
SELECT 
  a.first_name AS [author.first_name], 
  a.last_name AS [author.last_name], 
  f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR JSON PATH;

-- Standard SQL
SELECT JSON_ARRAYAGG(JSON_OBJECT(
  KEY 'author' VALUE JSON_OBJECT(
    KEY 'first_name' VALUE author.first_name,
    KEY 'last_name' VALUE author.last_name
  ),
  KEY 'TITLE' VALUE title
  ABSENT ON NULL
))
FROM (
  SELECT 
    a.first_name AS "author.first_name", 
    a.last_name AS "author.last_name", 
    f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) t

The result being (again, with nested objects):

[
    {
        "author": {
            "first_name": "NICK",
            "last_name": "WAHLBERG"
        },
        "title": "SMILE EARRING"
    },
    {
        "author": {
            "first_name": "NICK",
            "last_name": "WAHLBERG"
        },
        "title": "WARDROBE PHANTOM"
    },
    {
        "author": {
            "first_name": "PENELOPE",
            "last_name": "GUINESS"
        },
        "title": "ACADEMY DINOSAUR"
    },
    {
        "author": {
            "first_name": "PENELOPE",
            "last_name": "GUINESS"
        },
        "title": "ANACONDA CONFESSIONS"
    }
]

For more sophisticated nesting, including nesting of collections, a correlated subquery is needed in SQL Server, also with a FOR XML or FOR JSON syntax.

Conclusion

XML and JSON are popular document formats outside and inside of the database. SQL Server has some of the most conventient syntax for most cases, while standard SQL supports much more basic, and thus more powerful constructs. In standard SQL, almost any kind of XML or JSON projection is possible, and with XMLTABLE() and JSON_TABLE(), the documents can be transformed back to SQL tables, as well. In many applications, using these XML or JSON features natively would lead to much less boilerplate code, as many applications do not need middleware between the database and some client, just to transform data between formats.

Most ORMs don’t expose this functionality for a variety of reasons, the main one being that the devil is in the details. While both XML and JSON are nicely standardised, the implementations differ greatly:

  • The SQL/XML standard is implemented mostly by DB2, Oracle, and PostgreSQL. Many dialects offer some XML capabilities, but not as impressive as the standard and the previous three. SQL Server has FOR XML which is very powerful for standard XML serialisations, but may be a bit difficult to use for edge cases
  • The SQL/JSON standard was added late and is implemented again to large extents by DB2 and Oracle, but inceasingly also by MariaDB and MySQL. PostgreSQL (and by consequence, compatible dialects, like CockroachDB) had their own proprietary functions and APIs, which are not compatible with the standard. And again, SQL Server has FOR JSON which works well for standard serialisations, but a bit less well for edge cases

These technologies are poorly adopted in clients because of the many subtle differences. jOOQ has been leveling out these minor differences for many years without hiding the core functionality. SQL/XML and SQL/JSON are perfect use-cases for jOOQ 3.14 (due in Q2 2020), which now allows for using both the standard SQL/XML and SQL/JSON syntaxes as well as the SQL Server FOR XML and FOR JSON syntax in the jOOQ Professional and Enterprise Editions.

Before jOOQ 3.14 is out, you can already play with the current functionality on our website: https://www.jooq.org/translate

The Many Flavours of the Arcane SQL MERGE Statement

The SQL MERGE statement is a device whose mystery is only exceeded by its power.

A simple example shows its full power according to standard SQL. Imagine you have a production table for product prices, and a staging table from which you want to load the latest prices. For once, I’m using the Db2 LuW MERGE syntax, because that’s the most standards compliant syntax out there (among the dialects we support in jOOQ):

DROP TABLE IF EXISTS prices;
DROP TABLE IF EXISTS staging;

CREATE TABLE prices (
  product_id BIGINT NOT NULL PRIMARY KEY,
  price DECIMAL(10, 2) NOT NULL,
  price_date DATE NOT NULL,
  update_count BIGINT NOT NULL
);

CREATE TABLE staging (
  product_id BIGINT NOT NULL PRIMARY KEY,
  price DECIMAL(10, 2) NOT NULL
);

DELETE FROM prices;
DELETE FROM staging;
INSERT INTO staging 
VALUES (1, 100.00),
       (2, 125.00),
       (3, 150.00);

So, we’ve loaded a few records in our staging table, which we now want to merge over to the prices table. We could just insert them, easily, but we’ll stage more prices later on, e.g. these ones:

DELETE FROM staging;
INSERT INTO staging 
VALUES (1, 100.00),
       (2,  99.00),
       (4, 300.00);

We want our logic to be this:

  • All new IDs in the staging table should just be inserted into the prices table.
  • Existing IDs should be updated if and only if the price has changed. In that case, the update_count should increase.
  • Prices that are no longer encountered in the staging table should be deleted from the prices table, to implement a full sync, rather than a delta sync, for the sake of the example. We could also add a “command” column that contains the instruction on whether data should be updated or deleted, to implement a delta sync.

So, this is the Db2 (and standards compliant) MERGE statement that we use for the job:

MERGE INTO prices AS p
USING (
  SELECT COALESCE(p.product_id, s.product_id) AS product_id, s.price
  FROM prices AS p
  FULL JOIN staging AS s ON p.product_id = s.product_id
) AS s
ON (p.product_id = s.product_id)
WHEN MATCHED AND s.price IS NULL THEN DELETE
WHEN MATCHED AND p.price != s.price THEN UPDATE SET 
  price = s.price,
  price_date = CURRENT_DATE,
  update_count = update_count + 1
WHEN NOT MATCHED THEN INSERT 
  (product_id, price, price_date, update_count)
VALUES 
  (s.product_id, s.price, CURRENT_DATE, 0);

Easy eh?

Hm, not so simple if you haven’t written MERGE statements all your life. If so, don’t be scared. As most of SQL, the scary part is syntax (keywords, UPPER CASE, etc.). The underlying concepts are simpler than they may seem at first. Let’s go through it step by step. It has 4 parts:

1. The target table

Just like with INSERT statements, we can define where we want to MERGE the data INTO. This is the simple part:

MERGE INTO prices AS p
-- ...

2. The source table

The USING keyword wraps a source table that we want to merge. We could have just placed the staging table here, but I wanted to enrich the source data with some additional data first. I’m using a FULL JOIN to produce the matching between old data (prices) and new data (staging). If, after filling the staging table for the second time, but before running the MERGE statement, we ran the USING clause alone (with some minor modifications for illustration):

SELECT 
  COALESCE(p.product_id, s.product_id) AS product_id, 
  p.price AS old_price, 
  s.price AS new_price
FROM prices AS p
FULL JOIN staging AS s ON p.product_id = s.product_id
ORDER BY product_id

Then we would get this:

PRODUCT_ID|OLD_PRICE|NEW_PRICE|
----------|---------|---------|
         1|   100.00|   100.00| <-- same price
         2|   125.00|    99.00| <-- updated price
         3|   150.00|         | <-- deleted price
         4|         |   300.00| <-- added price

Neat!

3. The ON clause

Next, we RIGHT JOIN the target table and source tables using an ON clause, just like with an ordinary JOIN:

ON (p.product_id = s.product_id)

MERGE always uses RIGHT JOIN semantics, which is why I placed a FULL JOIN in the source table, the USING clause. It is totally possible to write things a bit differently, such that we can avoid accessing the prices table twice, but I want to show the full power of this statement. Notice, SQL Server joins source and target tables using a FULL JOIN, as I’ll explain further down. I’ll also explain why RIGHT JOIN, right away.

4. The WHEN clauses

Now comes the interesting part! There can either be a match between the two tables (target and source), like when we get the result of an INNER JOIN, or no such match because the source table contains a row that is not matched by the target table (the RIGHT JOIN semantics). In our example, PRODUCT_ID IN (1, 2, 3) will produce a match (contained in both source and target tables), whereas PRODUCT_ID = 4 will not produce a match (not contained in the target table yet).

Colouring our source data set:

PRODUCT_ID|OLD_PRICE|NEW_PRICE|
----------|---------|---------|
         1|   100.00|   100.00| <-- matched (but ignored)
         2|   125.00|    99.00| <-- matched
         3|   150.00|         | <-- matched
         4|         |   300.00| <-- not matched

The following are a sequence of matching instructions that will be executed in the order of appearance, for each row resulting from the previous RIGHT JOIN:

-- With my FULL JOIN, I've produced NULL price values
-- whenever a PRODUCT_ID is in the target table, but not
-- in the source table. These rows, we want to DELETE
WHEN MATCHED AND s.price IS NULL THEN DELETE

-- When there is a price change (and only then), we 
-- want to update the price information in the target table.
WHEN MATCHED AND p.price != s.price THEN UPDATE SET 
  price = s.price,
  price_date = CURRENT_DATE,
  update_count = update_count + 1

-- Finally, when we don't have a match, i.e. a row is
-- in the source table, but not in the target table, then
-- we simply insert it into the target table.
WHEN NOT MATCHED THEN INSERT 
  (product_id, price, price_date, update_count)
VALUES 
  (s.product_id, s.price, CURRENT_DATE, 0);

It’s not too complicated, just a lot of keywords and syntax. So, after running this MERGE on the second set of data of the staging table, we’ll get this result in the price table:

PRODUCT_ID|PRICE |PRICE_DATE|UPDATE_COUNT|
----------|------|----------|------------|
         1|100.00|2020-04-09|           0|
         2| 99.00|2020-04-09|           1|
         4|300.00|2020-04-09|           0|

The way I expressed this MERGE statement, it is idempotent, i.e. I can run it again on the same staging table content, and it won’t modify any data in the price table – because none of the WHEN clauses applies. Idempotence is not a property of MERGE, I just wrote my statement this way.

Dialect specifics

A few dialects support MERGE. Among the ones that jOOQ 3.13 supports, there are at least:

  • Db2
  • Derby
  • Firebird
  • H2
  • HSQLDB
  • Oracle
  • SQL Server
  • Sybase SQL Anywhere
  • Teradata
  • Vertica

For once, regrettably, this list does not include PostgreSQL. But even the dialects in this list do not all agree on what MERGE really is. The SQL standard specifies 3 features, each one optional:

  • F312 MERGE statement
  • F313 Enhanced MERGE statement
  • F314 MERGE statement with DELETE branch

But instead of looking at the standards and what they require, let’s look at what the dialects offer, and how it can be emulated if something is not available.

The AND clause

As you may have noticed, this post uses the syntax:

WHEN MATCHED AND <some predicate> THEN

It’s also possible to specify

WHEN NOT MATCHED AND <some predicate> THEN

With the exception of Teradata, most dialects support these AND clauses (Oracle has a specific syntax using WHERE for this, which I’ll get to later).

The point of these clauses is to be able to have several of these WHEN MATCHED or WHEN NOT MATCHED clauses, in fact an arbitrary number of them. Unfortunately, this isn’t supported by all dialects. Some dialects only support one clause of each type (INSERT, UPDATE, DELETE). It isn’t strictly necessary to support several clauses, but it is a lot more convenient as we’ll see below.

These dialects do not support multiple WHEN MATCHED or WHEN NOT MATCHED clauses:

  • HSQLDB
  • Oracle
  • SQL Server
  • Teradata

If a dialect does not support AND, or if it does not support multiple WHEN MATCHED clauses, simply translate the clauses to case expressions. Instead of our previous WHEN clauses, we’d get:

-- The DELETE clause doesn't make much sense without AND,
-- So there's not much we can do about this emulation in Teradata.
WHEN MATCHED AND s.price IS NULL THEN DELETE

-- Repeat the AND clause in every branch of the CASE
-- Expression where it applies
WHEN MATCHED THEN UPDATE SET 
  price = CASE

    -- Update the price if the AND clause applies
    WHEN p.price != s.price THEN s.price,

    -- Otherwise, leave it untouched
    ELSE p.price
  END

  -- Repeat for all columns
  price_date = CASE
    WHEN p.price != s.price THEN CURRENT_DATE
    ELSE p.price_date
  END,
  update_count = CASE
    WHEN p.price != s.price THEN update_count + 1
    ELSE p.update_count
  END

-- Unchanged, in this case
WHEN NOT MATCHED THEN INSERT 
  (product_id, price, price_date, update_count)
VALUES 
  (s.product_id, s.price, CURRENT_DATE, 0);

The formalism is this one:

If there is no AND, add AND

These are the same:

WHEN MATCHED THEN [ UPDATE | DELETE ]
WHEN MATCHED AND 1 = 1 THEN [ UPDATE | DELETE ]

This replacement may be needed in Firebird (which has a bug in this area) and SQL Server (which does not allow for WHEN MATCHED clauses after a WHEN MATCHED clause without AND clause, which is kind of a linting error). Instead of emulating stuff, you can just skip all the subsequent WHEN MATCHED branches, because they will not apply. Every row is updated only once, i.e. by only one WHEN clause:

Every row is updated only once

Make sure no row is updated more than once in the emulation, as required by the standard. When writing this:

WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND p2 THEN DELETE
WHEN MATCHED AND p3 THEN UPDATE SET c1 = 3, c2 = 3
WHEN MATCHED AND p4 THEN DELETE

This really means the same as:

WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND NOT p1 AND p2 THEN DELETE
WHEN MATCHED AND NOT p1 AND NOT p2 AND p3 THEN UPDATE SET c1 = 3,c2 = 3
WHEN MATCHED AND NOT p1 AND NOT p2 AND NOT p3 AND p4 THEN DELETE

To emulate the above, write this instead:

WHEN MATCHED AND 
  p1 OR
  NOT p1 AND NOT p2 AND p3
THEN UPDATE SET 
  c1 = CASE 
    WHEN p1                       THEN 1
    WHEN NOT p1 AND NOT p2 AND p3 THEN 3
                                  ELSE c1
  END,
  c2 = CASE
    WHEN NOT p1 AND NOT p2 AND p3 THEN 3
                                  ELSE c2
  END
WHEN MATCHED AND 
  NOT p1 AND p2 OR
  NOT p1 AND NOT p2 AND NOT p3 AND p4
THEN DELETE

Quite laborious, but that’s how it is.

H2 and HSQLDB

Notice that both H2 and HSQLDB do not follow the “every row is updated only once” rule. I’ve reported this to H2 already: https://github.com/h2database/h2database/issues/2552. If you want to be standards compliant (jOOQ 3.14 will emulate this for you, don’t worry), then you have to do the above CASE expression madness in these dialects, or, in H2 (HSQLDB doesn’t support multiple WHEN MATCHED clauses of the same type) enhance all the WHEN MATCHED AND clauses as I’ve illustrated before:

WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND NOT p1 AND p2 THEN DELETE
WHEN MATCHED AND NOT p1 AND NOT p2 AND p3 THEN UPDATE SET c1 = 3,c2 = 3
WHEN MATCHED AND NOT p1 AND NOT p2 AND NOT p3 AND p4 THEN DELETE

Oracle

Oracle doesn’t support AND here, but some interesting vendor specific syntax. It looks reasonable at first, but it’s really funky.

  • After UPDATE, you can add a WHERE clause, which is the same thing as AND. So far so good.
  • You can also add a DELETE WHERE clause, but only together with an UPDATE. So you cannot DELETE without updating. Fine, we weren’t going to, in our example.
  • However, the interesting thing is that the UPDATE / DELETE command is executed together, and DELETE happens after UPDATE. So the same row gets processed twice. If you use WHERE in UPDATE, only rows included in the UPDATE can also be included in DELETE. I mean, why would you update the rows first, prior to deletion?

This means that our standard clauses:

WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND p2 THEN DELETE
WHEN MATCHED AND p3 THEN UPDATE SET c1 = 3, c2 = 3
WHEN MATCHED AND p4 THEN DELETE

Would need to be emulated like this:

WHEN MATCHED 
THEN UPDATE SET 
  c1 = CASE 
    WHEN p1 THEN 1  -- Normal update for WHEN MATCHED AND p1 clause
    WHEN p2 THEN c1 -- "Touch" record for later deletion
    WHEN p3 THEN 3  -- Normal update for WHEN MATCHED AND p3 clause
    WHEN p4 THEN c1 -- "Touch" record for later deletion
            ELSE c1
  END,
  c2 = CASE
    WHEN p1 THEN c2 -- p1 is not affecting c2
    WHEN p2 THEN c2 -- "Touch" record for later deletion
    WHEN p3 THEN 3  -- Normal update for WHEN MATCHED AND p3 clause
    WHEN p4 THEN c2 -- "Touch" record for later deletion
            ELSE c2
  END

-- Any predicate from any AND clause, regardless if UPDATE or DELETE
WHERE p1 OR p2 OR p3 OR p4

-- Repeat the predicates required for deletion
DELETE WHERE 
  NOT p1 AND p2 OR
  NOT p1 AND NOT p2 AND NOT p3 AND p4

It was just a simple MERGE statement in standard SQL syntax!

There is an additional level of trickiness here, which I will not cover in this blog post (but we might handle it in jOOQ). In Oracle, the DELETE WHERE clause can already see the updates performed by the UPDATE clause. This means that if, for example, p2 depends on the value of c1:

  c1 = CASE 
    ...
    WHEN p2 THEN c1 -- "Touch" record for later deletion
    ...
  END,

Then the evaluation of p2 in DELETE WHERE will be affected by this

DELETE WHERE 
  NOT p1 AND p2 OR
  NOT p1 AND NOT p2 AND NOT p3 AND p4

The c1 in these p2 expressions will not be the same c1 as the c1 in the UPDATE clause. It’s obviously possible to manage this as well to some extent, through variable substitution.

SQL Server BY SOURCE and BY TARGET

SQL Server has a useful extension to the WHEN NOT MATCHED clause, which I think belongs in the SQL standard!

With this extension, you can specify whether you want to perform INSERT actions WHEN NOT MATCHED [ BY TARGET ] (the default that everyone else supports as well), or WHEN NOT MATCHED BY SOURCE (in case of which you can perform another UPDATE or DELETE action.

The BY TARGET clause means that we found a row in the source table but not in the target table. The BY SOURCE clause means that we found a row in the target table but not in the source table. This means that in SQL Server, the target and source tables are FULL OUTER JOINed, not RIGHT OUTER JOINed, which would mean that our original statement can be greatly simplified:

MERGE INTO prices AS p
USING staging AS s
ON (p.product_id = s.product_id)
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN MATCHED AND p.price != s.price THEN UPDATE SET 
  price = s.price,
  price_date = getdate(),
  update_count = update_count + 1
WHEN NOT MATCHED BY TARGET THEN INSERT 
  (product_id, price, price_date, update_count)
VALUES 
  (s.product_id, s.price, getdate(), 0);

We could colour again the rows encountered here:

PRODUCT_ID|  P.PRICE|  S.PRICE|
----------|---------|---------|
         1|   100.00|   100.00| <-- matched (but ignored)
         2|   125.00|    99.00| <-- matched
         3|   150.00|         | <-- not matched by source
         4|         |   300.00| <-- not matched by target

As can be seen, this is really just how a FULL OUTER JOIN works.

The emulation of these clauses back into standard SQL are laborious too, as we’d have to emulate this FULL OUTER JOIN explicitly. I think it’s possible, but we might not implement it in jOOQ.

Vertica

Only Vertica seems to not support the DELETE branch, meaning you cannot use a MERGE statement to DELETE data from your target table. You can use it only to INSERT or UPDATE data, which is good enough in almost all cases. Curiously, Teradata supports DELETE, but not AND, which seems kinda pointless, as DELETE and UPDATE cannot be combined this way.

Conclusion

The MERGE statement is a device whose mystery is only exceeded by its power. In a simple form (no AND or WHERE clauses, no DELETE clauses), all dialects pretty much agree, and that’s already a very useful feature set, which jOOQ has supported for a long time. Starting from jOOQ 3.14, we’re tackling also all the other features listed in this article to help you write complex, vendor agnostic MERGE statements and emulate them on all the dialects that have some MERGE support.

Want to play around with it already now? Check out our free online SQL translation tool.

What’s a “String” in the jOOQ API?

One of jOOQ’s biggest strength is the fact that it is a type safe SQL API. “Type safe”, in this context, means that every object that you put in a jOOQ query has a well defined type, such as:

These can be used in jOOQ in a type safe way as such:

ctx.select(T.A)      // A Field
   .from(T)          // A Table
   .where(T.B.eq(1)) // A Condition
   .fetch();

No Strings here, no string concatenation.

However, there are a few cases where you want to bypass type safety, for whatever reason, including to extend jOOQ e.g. with plain SQL templating. In those cases, you would pass a “String” object to the jOOQ API. But not every such String object is the same. What different types of strings are there? There are 4 main types of String in the jOOQ API:

1. The bind value

The most obvious type of string is the bind value or literal. You can create those explicitly using:

// As always, this static import is implied
import static org.jooq.impl.DSL.*;

Field<String> bind = val("abc");
Field<String> literal = inline("xyz");

By default, the first value will produce a bind parameter marker "?" in the generated SQL, whereas the second one will produce an escaped string literal 'xyz'. You’re probably already doing this implicitly. Whenever you pass a String value where the jOOQ API expects a T type, you’re implicitly wrapping your String value using DSL.val(T):

ctx.select(T.A)
   .from(T)
   .where(T.C.eq("xyz")) // Implicit bind value
   .fetch();

This is still type safe usage of a String value, as it is really wrapped as a Field<String>

2. The plain SQL template

Whenever jOOQ is missing some vendor specific functionality, the backdoor is to use plain SQL templating. You can create plain SQL templates explicitly for the most popular types as shown above, like this:

Field<Integer> field = field("(1 + 2)", SQLDataType.INTEGER);
Table<?> table       = table("generate_series(1, 10)");
Condition condition  = condition("some_function() = 1");

These expressions can now be embedded in your queries, like any other:

ctx.select(field)
   .from(table)
   .where(condition)
   .fetch();

Alternatively, there also exists convenience overloads on some query methods to make this simpler:

ctx.select(field("(1 + 2)", SQLDataType.INTEGER)) // Not on SELECT
   .from("generate_series(1, 10)")
   .where("some_function() = 1")
   .fetch();

Notice that the select() method doesn’t have such convenience API yet, as of jOOQ 3.13

Important disclaimer: Using these APIs, you’re going to expose yourself to the usual SQL injection risk that is present also with JDBC or JPQL usage, when composing SQL from strings. NEVER concatenate plain SQL templates, nor use user input in these strings. Use the templating language instead, and turn every user input into bind variables. Examples:

ctx.select(...)
   .from(...)
   .where("some_function() = ?", 1) // Bind variable
   .fetch();

ctx.select(...)
   .from(...)
   .where("some_function() = {0}", val(1)) // Templating
   .fetch();

If you encounter String types on most query API in jOOQ, it’s for plain SQL templating. All of this API is annotated with @org.jooq.PlainSQL for additional documentation purposes, and to be preprocessed with a static checker that can be used to disallow such API usage by default, for added security.

3. Names (identifiers)

But on some query API in jOOQ, Strings are not convenience for plain SQL templating, but for names and identifiers. Namely, all DDL statements use strings this way in their API.

You can create qualified or unqualified identifiers explicitly as follows:

// Unqualified table identifier
Name table = name("t");

// Qualified column identifier
Name field = name("t", "col");

And then, use those identifiers in your DDL statements, e.g. to create the table:

ctx.createTable(table)
   .column(field, SQLDataType.INTEGER)
   .execute();

Depending on the context, qualification is necessary or not. In this case, field qualification would not have been necessary.

For convenience, you can also just use String types on the createTable(String) API:

ctx.createTable("t")
   .column("col", SQLDataType.INTEGER)
   .execute();

These strings will simply be wrapped in DSL.name(String) as illustrated before.

Beware: In jOOQ, all identifiers are quoted by default (RenderQuotedNames.EXPLICIT_DEFAULT_QUOTED). This has two benefits:

  • Special characters, and more importantly, keyword conflicts are handled correctly out of the box
  • Quoting prevents SQL injection
  • Case sensitivity is handled correctly in those dialects that support it for quoted identifiers

The price to pay for this convenience is that quoted identifiers might turn case sensitive, when this is undesired. To work around this, you can turn off quoting using your settings, e.g. by setting RenderQuotedNames.EXPLICIT_DEFAULT_UNQUOTED. But beware, this will put you again at SQL injection risk if you don’t sanitise your identifier names first!

4. Keywords

Keywords are strings too, in jOOQ. In rare cases, you may want to wrap your keyword-as-string representation in a org.jooq.Keyword type. The main benefit (as of jOOQ 3.13) is a consistent keyword style. There is no convenience API for this, as it is very rare for client code to use this feature. There’s only DSL.keyword(String):

Keyword current = keyword("current");
Keyword current = keyword("time");

You can now use keywords in plain SQL templates:

Field<Time> currentTime = field(
  "{0} {1}", 
  SQLDataType.TIME, 
  current, time
);

Create Empty Optional SQL Clauses with jOOQ

When using jOOQ to create dynamic SQL statements (one of jOOQ’s core value propositions), it is often necessary to add query elements conditionally, with a default “No-op” behaviour. For first time users, this default “no-op” behaviour is not always obvious as the jOOQ API is vast, and as with any vast API, there are many different options to do similar things.

How not to do it

A common pitfall is to be tempted to work with the many XYZStep types. What types are these? They are usually invisible to the developer as developers use jOOQ’s DSL API in a fluent fashion, just like the JDK Stream API. For example:

DSLContext ctx = ...;

Result<?> result =
ctx.select(T.A, T.B)
   .from(T)
   .where(T.C.eq(1))
   .and(T.D.eq(2))
   .fetch();

Let’s decompose the above query to see what happens in the API. We could assign every method result to a local variable:

SelectFromStep<?> s1 = ctx.select(T.A, T.B);
SelectWhereStep<?> s2 = s1.from(T);
SelectConditionStep<?> s3 = s2.where(T.C.eq(1));
SelectConditionStep<?> s4 = s3.and(T.D.eq(2))

Result<?> result = s4.fetch();

Our previous fluent API design blog post explains this API design technique.

This is not what people usually do with “static SQL” statements, but they might be tempted to do this if they wanted to add the last predicate (T.D = 2) conditionally, e.g:

DSLContext ctx = ...;

SelectConditionStep<?> c =
ctx.select(T.A, T.B)
   .from(T)
   .where(T.C.eq(1));

if (something)
    c = c.and(T.D.eq(2));

Result<?> result = c.fetch();

This is perfectly valid API usage, but we do not recommend it because it is very messy and leads to difficult to maintain client code. Also, it is absolutely unnecessary, because there is a better way:

Composing queries from its parts

The problem with the above approach is that it is trying to use an imperative approach of adding things to a query step by step. This is how many developers tend to structure their code, but with SQL (and by consequence, jOOQ) that can turn out to be difficult to get right. A functional approach tends to work better.

Notice that not only the entire DSL structure could be assigned to local variables, but also the individual SELECT clause arguments. For example:

DSLContext ctx = ...;

List<SelectField<?>> select = Arrays.asList(T.A, T.B);
Table<?> from = T;
Condition where = T.C.eq(1).and(T.D.eq(2));

Result<?> result =
ctx.select(select)
   .from(from)
   .where(where)
   .fetch();

In fact, every jOOQ query is a dynamic SQL query. Many queries just happen to look like static queries, due to jOOQ’s API design.

Again, we wouldn’t be assigning every SELECT clause argument to a local variable, only the truly dynamic ones. For example:

DSLContext ctx = ...;

Condition where = T.C.eq(1);

if (something)
    where = where.and(T.D.eq(2));

Result<?> result =
ctx.select(T.A, T.B)
   .from(T)
   .where(where)
   .fetch();

This already looks quite decent.

Avoid breaking readability

A lot of people aren’t happy with this approach either, because it breaks a query’s readability by making its components non-local. The predicates in the query are declared up front, away from the query itself. This isn’t how many people like to reason about SQL.

And you don’t have to! It is totally possible to embed the condition directly in the WHERE clause like this:

DSLContext ctx = ...;

Result<?> result =
ctx.select(T.A, T.B)
   .from(T)

   // We always need this predicate
   .where(T.C.eq(1))

   // This is only added conditionally
   .and(something
      ? T.D.eq(2)
      : DSL.noCondition())
   .fetch();

The magic is in the above usage of DSL.noCondition, which is a pseudo predicate that does not generate any content. It is a placeholder where an org.jooq.Condition type is required without actually materialising one.

There is also:

… but that requires having to think about these identities and the reductions all the time. Also, if you append many of these trueCondition() or falseCondition() to a query, the resulting SQL tends to be quite ugly, for example for people having to analyse performance in production. noCondition() just never generates any content at all.

Note that noCondition() does not act as an identity! If your noCondition() is the only predicate left in a WHERE clause, there will not be any WHERE clause, regardless if you work with AND predicates or OR predicates.

No-op expressions in jOOQ

When using dynamic SQL like this, and adding things conditionally to queries, such “no-op expressions” become mandatory. In the previous example, we’ve seen how to add a “no-op predicate” to a WHERE clause (the same approach would obviously work with HAVING and all other clauses that work with boolean expressions).

The three most important jOOQ query types are:

Users may want to add all of these conditionally to queries.

org.jooq.Condition

We’ve already seen how to do this with org.jooq.Condition.

org.jooq.Field

What about dynamic column expressions in the projection (the SELECT clause)? Assuming you want to project columns only in certain cases. In our example, the T.B column is something we don’t always need. That’s easy! The same approach can be used (assuming T.B is a string column):

DSLContext ctx = ...;

Result<Record2<String, String>> result =
ctx.select(T.A, something ? T.B : DSL.inline("").as(T.B))
   .from(T)
   .where(T.C.eq(1))
   .and(T.D.eq(2))
   .fetch();

Using inlined parameters via DSL.inline(), you can easily produce a no-op value in your projection if you don’t want to modify the projection’s row type. The advantage is that you can now use this subquery in a union that expects two columns:

DSLContext ctx = ...;

Result<Record2<String, String>> result =

// First union subquery has a conditionally projected column
ctx.select(T.A, something ? T.B : DSL.inline("").as(T.B))
   .from(T)
   .where(T.C.eq(1))
   .and(T.D.eq(2))

   .union(

// Second union subquery has no such conditions
    select(U.A, U.B)
   .from(U))
   .fetch();

You can take this one step further, and make an entire union subquery conditional this way!

DSLContext ctx = ...;

Result<Record2<String, String>> result =

// First union subquery has a conditionally projected column
ctx.select(T.A, T.B)
   .from(T)
   .union(
      something
        ? select(U.A, U.B).from(U)
        : select(inline(""), inline("")).where(falseCondition())
   )
   .fetch();

This is a bit more syntactic work, but it’s nice to see how easy it is to add something conditionally to a jOOQ query without making the query completely unreadable. Everything is local to where it is being used. No local variables are needed, no imperative control flow is invoked.

And because everything is now an expression (and not a statement / no control flow), we can factor out parts of this query into auxiliary methods, that can be made reusable.

org.jooq.Table

Conditional table expressions usually appear when doing conditional joins. This is usually not done in isolation, but together with other conditional elements in a query. E.g. if some columns are projected conditionally, those columns may require an additional join, as they originate from another table than the tables that are used unconditionally. For example:

DSLContext ctx = ...;

Result<?> result =
ctx.select(
      T.A, 
      T.B, 
      something ? U.X : inline(""))
   .from(
      something
      ? T.join(U).on(T.Y.eq(U.Y))
      : T)
   .where(T.C.eq(1))
   .and(T.D.eq(2))
   .fetch();

There isn’t a more simple way to produce the conditional JOIN expression, because JOIN and ON need to be provided independently. For simple cases as shown above, this is perfectly fine. In more complex cases, some auxiliary methods may be needed, or views.

Conclusion

There are two important messages here in this post:

  1. The XYZStep types are auxiliary types only. They are there to make your dynamically constructed SQL statement look like static SQL. But you should never feel the need to assign them to local variables, or return them from methods. While it is not wrong to do so, there is almost always a better way to write dynamic SQL.
  2. In jOOQ, every query is a dynamic query. This is the benefit of composing SQL queries using an expression tree like the one that is used in jOOQ’s internals. You may not see the expression tree because the jOOQ DSL API mimicks static SQL statement syntax. But behind the scenes, you’re effectively building this expression tree. Every part of the expression tree can be produced dynamically, from local variables, methods, or expressions, such as conditional expressions. I’m looking forward to using the new JEP 361 switch expressions in dynamic SQL. Just like a SQL CASE expression, some SQL statement parts can be constructed dynamically in the client, prior to passing them to the server.

Once these two things are internalised, you can write very fancy dynamic SQL, including using FP approaches to constructing data structures, such as a jOOQ query object.

Using Java 13+ Text Blocks for Plain SQL with jOOQ

Most jOOQ users use the jOOQ DSL API, which provides compile time type safety and an easy way to write dynamic SQL.

But occasionally, this DSL get in the way, because it might be

  • Overkill for some very simple quick and dirty SQL queries
  • Too limited when running very advanced vendor specific SQL, such as Oracle’s MODEL or MATCH_RECOGNIZE clauses

In such cases, you can still benefit from jOOQ’s many secondary features, including for example its nice integration with the Stream API, export functionality, and much more. Think of jOOQ as an improved JDBC!

Starting with Java 13, when enabling preview features, you can now make use of text blocks, or “multi line strings”, which are very useful for embedding static SQL strings in Java code (and XML, JSON, regular expressions, etc). There are two main use cases for using text blocks in jOOQ:

Plain SQL

The main use case is to use plain SQL and also plain SQL templating. For example, to run a quick and dirty jOOQ query like this on H2 or PostgreSQL, for example:

System.out.println(ctx.fetch("""
        SELECT table_schema, count(*)
        FROM information_schema.tables
        GROUP BY table_schema
        ORDER BY table_schema
        """));

(unfortunately, this blog’s syntax highlighter is not quite there yet…)

The output is jOOQ’s nicely formatted text result:

+------------------+--------+
|TABLE_SCHEMA      |COUNT(*)|
+------------------+--------+
|INFORMATION_SCHEMA|      33|
|MCVE              |       2|
|PUBLIC            |       1|
+------------------+--------+

The above is a simple plain SQL query, but we can also make use of plain SQL templating with text blocks! What if we wanted to have a dynamic GROUP BY clause?

Always assuming this static import:

import static org.jooq.impl.DSL.*;

We can now write:

Stream.of(
        field("table_schema"),
        list(field("table_schema"), field("table_type")))
    .forEach(q -> {
        System.out.println(ctx.fetch("""
          SELECT {0}, count(*), row_number() OVER (ORDER BY {0}) AS rn
          FROM information_schema.tables
          GROUP BY {0}
          ORDER BY {0}
          """, q));
    });

We’ve written a dynamic SQL query with jOOQ, but without using much type safety (we could still use some type safe, generated schema object references if we wanted to). jOOQ’s plain SQL templating capabilities are a much underestimated feature set. People have been using MyBatis for many years precisely for these purposes, or maybe built home grown, velocity template based frameworks to produce dynamic SQL using native SQL strings.

You can also use jOOQ for this, and profit from all of the jOOQ API, including, again the formatted output from the above:

+------------------+--------+----+
|TABLE_SCHEMA      |COUNT(*)|  RN|
+------------------+--------+----+
|INFORMATION_SCHEMA|      33|   1|
|MCVE              |       2|   2|
|PUBLIC            |       1|   3|
+------------------+--------+----+

+------------------+------------+--------+----+
|TABLE_SCHEMA      |TABLE_TYPE  |COUNT(*)|  RN|
+------------------+------------+--------+----+
|INFORMATION_SCHEMA|SYSTEM TABLE|      33|   1|
|MCVE              |TABLE       |       1|   2|
|MCVE              |VIEW        |       1|   3|
|PUBLIC            |TABLE       |       1|   4|
+------------------+------------+--------+----+

All of this was available before text blocks, but with text blocks, it makes even more sense to use these features. There is not just one way to use jOOQ, but this approach is really very underestimated!

The parser

Another very important jOOQ feature that works with string based SQL is the jOOQ parser. There are a variety of use cases for using jOOQ’s parser, as we’re adding more and more features to it. One use-case is simple formatting of SQL. Why would you want to do this? One example is to standardise your SQL strings for better execution plan cache usage. As you may know, some database vendors (e.g. Oracle) treat these two SQL strings as different:

SELECT 1 FROM dual
SELECT  1 FROM dual

Observe the irrelevant whitespace difference. Oracle will parse a new SQL string, produce a new SQL_ID, and yet come up with the same execution plan. You don’t want this to happen too often, because it puts a lot of contention on the plan cache and its locks and latches, and creates extra work.

This can happen, nonetheless, with string based dynamic SQL. Using jOOQ’s parser, you can normalise all sorts of SQL strings (including removing comments)

System.out.println(
    ctx.parser()
       .parseResultQuery("""
            SELECT table_schema, count(*)
            FROM information_schema.tables
            GROUP BY table_schema
            -- Order by column index!
            ORDER BY 1
            """)
       .fetch()
);

The SQL string that gets sent to the JDBC driver is this:

select table_schema, count(*) from information_schema.tables group by table_schema order by 1

Alternatively, you could specify some formatting in jOOQ:

DSLContext ctx = DSL.using(connection, 
    new Settings().withRenderFormatted(true));

And now, the SQL string that gets sent to the JDBC driver is this:

select 
  table_schema, 
  count(*)
from information_schema.tables
group by table_schema
order by 1

You can tweak the various supported formatting rules, and use this for logging your hand written SQL strings in a human readable way, for example, rather than executing the SQL through jOOQ. The options are endless. You could also translate parser output to some other SQL dialect. Assuming you run your queries against SQL Server, but want to translate them to H2. No problem! The following SQL Server specific query can be run easily on H2 as well:

System.out.println(
    ctx.parser()
       .parseResultQuery("""
            SELECT TOP 1 table_schema, count(*)
            FROM information_schema.tables
            GROUP BY table_schema
            ORDER BY count(*) DESC
            """)
       .fetch()
);

The translated SQL that gets executed on H2 is this:

select 
  table_schema, 
  count(*)
from information_schema.tables
group by table_schema
order by count(*) desc
limit 1

Or, because I like showing off:

System.out.println(
    ctx.parser()
       .parseResultQuery("""
            SELECT TOP 1 WITH TIES table_schema, count(*)
            FROM information_schema.tables
            GROUP BY table_schema
            ORDER BY count(*) DESC
            """)
       .fetch()
);

Producing this in H2:

select 
  TABLE_SCHEMA, 
  count(*)
from INFORMATION_SCHEMA.TABLES
group by TABLE_SCHEMA
order by 2 desc
fetch next 1 rows with ties

Or this, in PostgreSQL:

select 
  "v0" as table_schema, 
  "v1" as "count"
from (
  select 
    table_schema as "v0", 
    count(*) as "v1", 
    rank() over (order by 2 desc) as "rn"
  from information_schema.tables
  group by table_schema
) "x"
where "rn" > 0
and "rn" <= (0 + 1)
order by "rn"

I know, right?

Parser-derived features

The parser is used in more and more of jOOQ’s many features. Including for example in jOOQ 3.13’s schema diff and management utility.

Run this with jOOQ:

System.out.println(
    ctx.meta("""
    create table t (
      i int
    )
    """).apply("""
    alter table t
      add j int;
    alter table t
      add constraint t_pk primary key (i)
    """)
);

I.e. just copy paste your plain SQL schema definitions and increments (as used in Flyway, for example, or hand written while developing them against your database), and start using the entire tool chain on it. For example, the above prints:

create table T(
  I int null,
  J int null,
  constraint T_PK
    primary key (I)
);

Instead of applying increments to a schema, you can also calculate the diff between two schema versions:

System.out.println(
    ctx.meta("""
    create table t (
      i int
    )
    """).migrateTo(ctx.meta("""
    create table t (
      i int,
      j int,
      constraint t_pk primary key (i)
    )
    """))
);

And now, the output is the increment we’ve applied earlier

alter table T
  add J int null;
alter table T
  add constraint T_PK
    primary key (I);

Conclusion

jOOQ’s DSL is very powerful because:

  • It offers type safety
  • It is compile time checked
  • It helps with auto completion

But sometimes, it gets in the way and plain SQL strings work better, especially when using text blocks. In that case, jOOQ is still also very powerful. The entirety of the jOOQ tool chain is still available to you behind the scenes. You’re just not using the DSL as an API anymore.

Never Concatenate Strings With jOOQ

jOOQ supports a vast amount of SQL syntax out of the box. As such, most users will not think of resorting to string concatenation like in the old days when writing dynamic SQL with JDBC.

But every now and then, a vendor specific feature is not supported by jOOQ (yes, it happens). In that case, jOOQ supports a variety of “plain SQL” API, which can be used to construct almost all types of jOOQ API elements, such as:

// Static import is implied, as always
import static org.jooq.impl.DSL.*;

// Column expressions
Field<String> f = field("cool_function(1, 2, 3)", String.class);

// Predicates
Condition c = condition("col1 <fancy operator> col2");

// Tables
Table<?> t = table("wicked_table_valued_function(x, y)");

But then, sometimes, you need to pass an argument to such a function dynamically, such as another column expression. And you want to do that in a type safe way, because the jOOQ code generator already produced type safe column expressions. So you might be inclined to concatenate, nonetheless:

field("cool_function(1, " + MY_TABLE.MY_COLUMN + ", 3)");

Never do this!

For these reasons:

  1. Despite jOOQ being very SQL injection safe in general, this is where you can in fact introduce a plain SQL injection vulnerability nonetheless. Not in this case, as the column is generated code, but maybe, you will concatenate user input. Note that in order to increase SQL injection protection, plain SQL usage can be prevented globally, and allowed only locally when needed by adding our PlainSQL checker, using the checker framework or Google ErrorProne.
  2. As always with string concatenation, you’re prone to SQL syntax errors. In this case, the generated SQL is not specific to any dialect, as MY_TABLE.MY_COLUMN.toString() is being called, without any contextual information, such as the SQLDialect and all the other configuration flags.

Instead, use jOOQ’s plain SQL templating mini language, which allows for template placeholders like {0}, {1}, {2}:

field("cool_function(1, {0}, 3)", MY_TABLE.MY_COLUMN);

And if you’re doing this more often, you can factor out this call in your own mini DSL:

public static Field<String> coolFunction(Field<?> field) {
    field("cool_function(1, {0}, 3)", field);
}

And now, call it like this:

coolFunction(MY_TABLE.MY_COLUMN)

As a rule of thumb:

With jOOQ, you should never need to resort to SQL string concatenation

You can always use either:

  • The type safe jOOQ DSL API
  • The plain SQL templating API (and ideally hide such usage behind your own type safe DSL API)

5 Ways to Better Understand SQL by Adding Optional Parentheses

It appears that our recent beginner SQL articles explaining SQL syntax were quite popular. These include:

Today, I want to tackle the problem of understanding and learning SQL syntax from a new angle.

SQL’s COBOL-esque syntactic heritage

Unlike most C-style languages, SQL syntax is a messy mixture of keywords, special characters, and both with a lot of overloaded meanings. I’d call myself rather experienced with SQL, yet I’m still often unsure about whether something is a keyword or an identifier (and yes, it does matter):

From my past SQL training sessions, however, I have learned that one of the best tools to understand what’s really going on in SQL syntax is to add parentheses at random places. Parentheses can be noisy, but they can definitely help understand the syntactic structure of a statement.

A trivial example from arithmetics, or logic:

a + b * c
a OR b AND c

In almost all languages, operator precedence dictates that the above be equivalent to this:

a + (b * c)
a OR (b AND c)

And already, we’ve introduced some syntactic clarity to readers who might not be aware of operator precedence. While the parentheses are not necessary in these cases, they can help with readability if the expressions are much more complex than the above.

Did you know you can place parentheses at many other locations in SQL? Here’s a list of interesting places, where most people usually omit parentheses. I’m going to use the Sakila database for examples.

1. Row value expressions

In every day language, we don’t call them “row value expressions”. We call them ordinary “column expressions”. For example, we might look for a specific actor by name

SELECT * 
FROM actor
WHERE first_name = 'SUSAN'
AND last_name = 'DAVIS';

We will get:

|actor_id   |first_name|last_name |
|-----------|----------|----------|
|101        |SUSAN     |DAVIS     |
|110        |SUSAN     |DAVIS     |

But what we really did was this:

-- Works in many dialects
SELECT * 
FROM actor
WHERE (first_name) = ('SUSAN')
AND (last_name) = ('DAVIS');

Or also:

-- Works in a few dialects, including e.g. PostgreSQL
SELECT * 
FROM actor
WHERE ROW (first_name) = ROW ('SUSAN')
AND ROW (last_name) = ROW ('DAVIS');

The parentheses are somewhat ambiguous in this case, the ROW constructor makes it more obvious. The SQL standard specifies a “row value expression special case”, i.e. the case where a row value expression (i.e. a tuple expression) is of degree 1. In case of which most people will “colloquially” omit the parentheses, thinking we’re comparing column expressions.

In this example, we could rewrite the query to this:

SELECT * 
FROM actor
WHERE (first_name, last_name) = ('SUSAN', 'DAVIS');

Or, in funky Oracle, double parentheses are required for no good reason on the right hand side.

SELECT * 
FROM actor
WHERE (first_name, last_name) = (('SUSAN', 'DAVIS'));

But I was not going to confuse you, I was going to clarify things ;-)

In fact, once you get a hang of row value expressions, you will quickly want to use them for less trivial cases than avoiding two AND connected predicates. For example, you can use this approach also with the IN predicate (and if you’re daring enough, with the NOT IN predicate, but beware of NULLS):

SELECT * 
FROM actor
WHERE (first_name, last_name) IN (
  ('SUSAN', 'DAVIS'),
  ('NICK' , 'WAHLBERG')
)

Wow! Result:

|actor_id   |first_name|last_name |
|-----------|----------|----------|
|2          |NICK      |WAHLBERG  |
|101        |SUSAN     |DAVIS     |
|110        |SUSAN     |DAVIS     |

And not only that. You can also use the same approach with IN and subqueries:

-- Actors that have the same name as a customer
SELECT *
FROM actor
WHERE (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM customer
)

And indeed:

|actor_id   |first_name|last_name |
|-----------|----------|----------|
|4          |JENNIFER  |DAVIS     |

Yes! The “colloquial” IN predicate that you all wrote without the parentheses around the row value expression, in reality, had a row value expression of degree 1 on the left side, AND on the right side. Increase that degree to 2, and now you need to write a subquery with 2 columns as well.

It’s quite idiomatic, and readable, and nifty. Quite a few dialects support this.

2. JOINs

Joins are very misunderstood, in SQL. Very very misunderstood. One reason is because people confuse them with other set operations, such as UNION, and thus illustrate them with Venn Diagrams. I’ve already covered that. That article explains that JOINs are just (filtered) cartesian products.

The simplest JOIN type is CROSS JOIN. It does not have an ON or USING clause. Much more common, however, is INNER JOIN, or just JOIN (INNER being an optional keyword).

SELECT a.first_name, a.last_name, f.title
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)

We’ve all typed millions of these to obtain:

|first_name|last_name |title                         |
|----------|----------|------------------------------|
|...       |...       |...                           |
|PENELOPE  |GUINESS   |SPLASH GUMP                   |
|PENELOPE  |GUINESS   |VERTIGO NORTHWEST             |
|PENELOPE  |GUINESS   |WESTWARD SEABISCUIT           |
|PENELOPE  |GUINESS   |WIZARD COLDBLOODED            |
|NICK      |WAHLBERG  |ADAPTATION HOLES              |
|NICK      |WAHLBERG  |APACHE DIVINE                 |
|NICK      |WAHLBERG  |BABY HALL                     |
|...       |...       |...                           |

But what are these JOIN things? They are not SELECT clauses! They are operators like + or * or AND or OR.

And like any operator, the messy laws of associativity apply (as all the different join types have the same operator precedence, at least in the SQL standard. Some wonky implementations may disagree).

Luckily, left associativity applies with JOIN (right associativity would be madness). So, our magic parentheses come into play again. Our query from before is really this:

SELECT a.first_name, a.last_name, f.title
FROM (
  actor AS a
    JOIN film_actor AS fa 
      USING (actor_id)
)
JOIN film AS f 
  USING (film_id)

Can you see it? The readability problem here is the USING (or ON) clause. Without it, I could have written:

(actor JOIN film_actor) JOIN film

So, join FILM_ACTOR to ACTOR and then join the FILM table to the product.

Because INNER JOIN is associative (both left and right), we could write the following equivalent expression:

actor JOIN (film_actor JOIN film)

So, join FILM to FILM_ACTOR first, and then join the product to ACTOR. Quite a different beast, logically (the optimiser shouldn’t make a difference in this case).

… or in full SQL

SELECT a.first_name, a.last_name, f.title
FROM actor AS a 
  JOIN (
    film_actor
      JOIN film AS f 
        USING (film_id)
  )
USING (actor_id)

Egh. Don’t be clever. Use this only if you really have to. I find it quite unreadable. Careful formatting is essential here. But it works.

With more than 3 tables, you could build entire JOIN trees. While, in order to stay sane, most people will not actually build JOIN trees (yes, a list is also a tree), it is totally possible to do this confusing thing:

SELECT a.first_name, a.last_name, sum(p.amount)
FROM (
  actor AS a
    JOIN film_actor AS fa 
      USING (actor_id)
)
JOIN (
  film AS f
    JOIN (
      inventory AS i 
        JOIN rental AS r
          USING (inventory_id)
    ) USING (film_id)
) USING (film_id)
JOIN payment AS p 
  USING (rental_id)
GROUP BY a.actor_id

Instead of this nice clean SQL:

SELECT a.first_name, a.last_name, sum(p.amount)
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
JOIN inventory AS i USING (film_id)
JOIN rental AS r USING (inventory_id)
JOIN payment AS p USING (rental_id)
GROUP BY a.actor_id

… to get the revenue of all films we made for any given actor. Something like this:

|first_name|last_name |sum         |
|----------|----------|------------|
|ADAM      |GRANT     |974.19      |
|ADAM      |HOPPER    |1532.21     |
|AL        |GARLAND   |1525.87     |
|ALAN      |DREYFUSS  |1850.29     |
|ALBERT    |JOHANSSON |2202.78     |
|ALBERT    |NOLTE     |2183.75     |

But nevertheless. It is good to know how JOIN works, syntactically, because there will be that edge case where you want to give precedence to an INNER JOIN and then LEFT JOIN the entire product, or something like that. In that case, parentheses are your friend.

3. DISTINCT

There are no parentheses around the “DISTINCT arguments”. Despite a lot of people thinking there are. This is not what some people think it is:

SELECT DISTINCT (actor_id), first_name, last_name
FROM actor

The parentheses can be omitted, and it is the exact same thing.

The other nice thing about SQL syntax is that you never know whether parentheses are part of the language, or part of an expression. In fact, the above (id) is a row value expression again. If you (accidentally) wrapped more than one column in those parentheses:

SELECT DISTINCT (actor_id, first_name), last_name
FROM actor

Then you’d still get the same behaviour (in PostgreSQL, which supports nesting records like that), but the result is not what you wanted:

|row              |last_name|
|-----------------|---------|
|(1,PENELOPE)     |GUINESS  |
|(2,NICK)         |WAHLBERG |
|(3,ED)           |CHASE    |
|(4,JENNIFER)     |DAVIS    |

Funky eh?

4. UNION, INTERSECT, EXCEPT

The nice thing about SQL is that you can never derive a rule from another rule. While all types of JOIN have the same precedence, the set operators do not. Duh. Luckily, most people don’t use INTERSECT or EXCEPT at all, only UNION [ ALL ], so they don’t run into this problem.

In case there is any doubt, INTERSECT has a higher precedence, whereas UNION and EXCEPT have the same precedence (in the SQL standard). I’ll prove it by quoting it. From ISO/IEC 9075-2:2016(E) 7.17 <query expression> (I’ve removed some irrelevant bits)

<query expression body> ::=
  <query term>
| <query expression body> UNION [ ALL ] <query term>
| <query expression body> EXCEPT [ ALL ] <query term>

<query term> ::=
  <query primary>
| <query term> INTERSECT [ ALL ] <query primary>

See. INTERSECT has a higher precedence. Let’s try it (in PostgreSQL):

SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
INTERSECT
SELECT 1 AS a, 2 AS b

We get:

|a          |b          |
|-----------|-----------|
|1          |2          |
|2          |3          |

So, what we’ve done effectively (and the parenthesis are totally allowed, because <query term> allows it), is this

SELECT 2 AS a, 3 AS b
UNION
(
  SELECT 1 AS a, 2 AS b
  INTERSECT
  SELECT 1 AS a, 2 AS b
)

Now, for the life of me, I really don’t know how to nicely format parenthesised set operations in SQL. It looks awkward. But here we are. The above is a totally different thing than this:

(
  SELECT 2 AS a, 3 AS b
  UNION
  SELECT 1 AS a, 2 AS b
)
INTERSECT
SELECT 1 AS a, 2 AS b

If we do the union first, and then the intersection, the only resulting record is:

|a          |b          |
|-----------|-----------|
|1          |2          |

A similar effect can be seen when combining UNION and UNION ALL, which have the same precedence and are … left associative. Lucky again. So, this query here:

SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
UNION ALL
SELECT 1 AS a, 2 AS b

Has a UNION that does not get to remove any duplicates, and then a UNION ALL that adds a duplicate. It is equivalent to this (remember, left associative):

(
  SELECT 2 AS a, 3 AS b
  UNION
  SELECT 1 AS a, 2 AS b
)
UNION ALL
SELECT 1 AS a, 2 AS b

The result is:

|a          |b          |
|-----------|-----------|
|1          |2          |
|2          |3          |
|1          |2          |

It is quite a different beast from this one:

SELECT 2 AS a, 3 AS b
UNION
(
  SELECT 1 AS a, 2 AS b
  UNION ALL
  SELECT 1 AS a, 2 AS b
)

Now, we’ve used parentheses to force the query to do the UNION ALL operator first (producing duplicates), but then, the UNION removes them agian. The result is now:

|a          |b          |
|-----------|-----------|
|1          |2          |
|2          |3          |

Bonus: What about the ORDER BY clause

Ah, yes. The miracles of SQL syntax. In theory (and in the SQL standard), there is an optional ORDER BY clause over all set operations. E.g. like this:

SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
INTERSECT
SELECT 1 AS a, 2 AS b
ORDER BY a DESC

We’re getting:

|a          |b          |
|-----------|-----------|
|2          |3          |
|1          |2          |

Think of it this way:

(
  SELECT 2 AS a, 3 AS b
  UNION
  SELECT 1 AS a, 2 AS b
  INTERSECT
  SELECT 1 AS a, 2 AS b
)
ORDER BY a DESC

Some dialects may or may not allow you to actually place those parentheses. E.g. PostgreSQL does.

5. Subqueries

When you write subqueries, or specifically:

  • Scalar subqueries (subqueries in SELECT or WHERE, etc.)
  • Derived tables (subqueries in FROM)

… then you must put them in parentheses. For example, a correlated (scalar) subquery:

SELECT 
  first_name, 
  last_name, (
    -- Correlated subquery here
    SELECT count(*) 
    FROM film_actor AS fa 
    WHERE fa.actor_id = a.actor_id
  ) c
FROM actor AS a

These parentheses are not optional. Neither are those around the derived table:

SELECT 
  first_name, 
  last_name, 
  c
FROM actor AS a
JOIN (
  -- Derived table here
  SELECT actor_id, count(*) AS c
  FROM film_actor
  GROUP BY actor_id
) fa USING (actor_id)

Conclusion

SQL (and its syntax) is a device whose mystery is only exceeded by its power. With parentheses (as with other syntactic tokens), you can never really know if they are part of the language, of some clause, or part of an expression.

With some operators, parentheses help better understand the syntactic structure of the language.

  • Row value expressions of degree 1 have optional parentheses. We never write them, but we could have!
  • JOINs are really trees, not lists. We can nest them arbitrarily to get associativity under control.
  • DISTINCT doesn’t have parentheses. Don’t be fooled by your coworkers’ coding style!
  • Set operations have two levels of precedence: 1) INTERSECT and 2) UNION and EXCEPT (assuming standards compliance).
  • Subqueries (outside of set operations) have mandatory parentheses

I’m curious about your own greatest SQL syntax confusions. Please leave them in the comments.

SQL DISTINCT is not a function

A very common misconception I often encounter with SQL users is the idea that DISTINCT is something like a function, and that it can take parenthesised arguments. Just recently, I’ve seen this Stack Overflow question where the OP was looking for a way to express this in jOOQ:

SELECT DISTINCT (emp.id), emp.fname, emp.name
FROM employee emp;

Notice the parentheses around (emp.id), which look as though this is some special kind of DISTINCT usage, which is akin to a DISTINCT function. The idea is often that:

  • The behaviour is somewhat different from omitting the parentheses
  • The performance is faster, because only the ID needs to be considered for distinctness

This is incorrect

These claims are incorrect, of course. There is no semantic or performance difference between the two. The parentheses are merely parentheses around a column expression, in a similar way as you would use parentheses to influence operator precedence. Think of it this way:

SELECT DISTINCT (emp.id + 1) * 2, emp.fname, emp.name
FROM employee emp;

In the above example, we do not apply a “DISTINCT function” to the expression emp.id + 1. We merely placed parentheses around a column expression emp.id + 1 to make sure the addition happens before the multiplication. The DISTINCT operator happens after the projection, always. If SQL had used a more logical syntax, rather than following English grammar (it was originally called Structured English QUEry Language, or SEQUEL), then we would write the OP’s statement like this:

FROM employee
SELECT id, fname, name
DISTINCT

Again, the DISTINCT operation always happens after the projection (SELECT clause content), and is applied to the entirety of the projection. There is no way in standard SQL to apply distinctness only to parts of the projection (there is in PostgreSQL, see further down).

To clarify this a bit more, I recommend reading our previous blog post about the logical order of operations in SQL, and how DISTINCT and ORDER BY are related.

What would it mean anyway?

We can revert the question back to the OP and ask ourselves, what would it mean for a DISTINCT operation to apply to only one column, anyway? Let’s assume this data set:

|id |fname|name|
|---|-----|----|
|1  |A    |A   |
|1  |B    |B   |

If we applied DISTINCT only to the ID column (and didn’t project anything else), clearly, we’d only get one row as a result:

SELECT DISTINCT id FROM employee
|id |
|---|
|1  |

But if we wanted to also project FNAME and NAME, which row would “win”? Would we display the first or the second row, or any random row? The behaviour would be undefined, and SQL doesn’t like undefined behaviour, so this is not possible. The only reasonable application of DISTINCT is always on the entire projection.

Exception: PostgreSQL

Fortunately (or to add more to syntactic confusion: unfortunately), PostgreSQL has implemented an extension to the SQL standard. With DISTINCT ON, it is effectively possible to apply distinctness only to parts of the projection:

WITH emp (id, fname, name) AS (
  VALUES (1, 'A', 'A'),
         (1, 'B', 'B')
)
SELECT DISTINCT ON (id) id, fname, name
FROM emp
ORDER BY id, fname, name

The output is now what the OP desired (but couldn’t use, because they were using MySQL):

|id         |fname|name |
|-----------|-----|-----|
|1          |A    |A    |

I personally don’t like DISTINCT ON. While it is very useful, no doubt, it makes something that is already very difficult to explain to SQL beginners even more complicated. With a “more reasonable” syntax, the query would be written like this:

FROM emp
SELECT id, fname, name
ORDER BY id, fname, name
DISTINCT ON (id) 

With this syntactic order of operation, there would be no doubt about the semantics of DISTINCT or DISTINCT ON.

Use the jOOQ-Refaster Module for Automatic Migration off of Deprecated jOOQ API

Starting with jOOQ 3.13, we’re offering a new module called jOOQ Refaster, which provides refaster templates for automatic API migration.

What’s Refaster?

Refaster is a lesser known sub project of the popular Google Error Prone library, a library for static code analysis, implemented as a Java compiler plugin. We already support a few such static code analysis tools through the jOOQ Checker module, which can be used with both the Checker Framework and with Google’s Error Prone.

Refaster allows for finding “bad” API usage patterns in your code, and produce .patch files automatically, which fix such patterns, producing better code.

What are we offering in jOOQ?

Over the years, we have deprecated quite a bit of API, as new API came along. For example, there is the ResultQuery.fetchLazy(int) method, where the int parameter corresponds to the JDBC fetchSize that is being passed along to JDBC.

Long ago, we introduced better API for passing the fetchSize. You can:

  • Specify a Settings value with your configuration for a global fetchSize to apply
  • Specify ResultQuery.fetchSize(int), which can then be combined with fetchLazy(), or with any type of fetch() call, as this isn’t really specific to “lazy” fetching

The Javadoc says:

Deprecated. [#2811] – 3.3.0 – Use fetchSize(int) and fetchLazy() instead.

…and your IDE will probably warn you about deprecated API usage, but who has time to fix all these warnings and read all these Javadoc?

Instead, you can now download the refaster file for your jOOQ version from our website https://www.jooq.org/refaster, configure your Java compiler to include ErrorProne’s refaster library (as documented in the jOOQ manual), and get a patch right away.

For example, if your code reads:

public void runQuery() {
    try (Cursor<?> cursor = ctx.select(T.A)
       .from(T)
       .fetchLazy(42)) {
        // Do something
    }
}

You might be getting a patch file like this one:

--- ..\..\src\main\java\org\jooq\refaster\test\RefasterTests.java
+++ ..\..\src\main\java\org\jooq\refaster\test\RefasterTests.java
@@ -62,6 +62,5 @@
 public void runQuery() {
     try (Cursor<?> cursor = ctx.select(T.A)
-       .from(T)
-       .fetchLazy(42)) {
+       .from(T).fetchSize(42).fetchLazy()) {
         // Do something
     }

And you’ll get the improved version of your code automatically:

public void runQuery() {
    try (Cursor<?> cursor = ctx.select(T.A)
       .from(T).fetchSize(42).fetchLazy()) {
        // Do something
    }
}

In our opinion, all libraries should offer such refaster files (or another technology doing something similar) to help users upgrade more easily between versions.

What’s next?

In addition to upgrading deprecated API usage (which is a very obvious use-case for Refaster), we’re also experimenting with some other static code analysis and improvement approaches that deal with our understanding of the SQL language, and best practices. For example, as we’ve blogged before, it is not recommended to use COUNT(*) queries in SQL (or in any language), when a simple EXISTS() check would suffice.

jOOQ 3.13 already offers an introspection that replaces calls like these:

ctx.fetchCount(table) != 0
ctx.fetchCount(table) >= 1 // And many more

By equivalent, but likely much more performant expressions:

ctx.fetchExists(table)

So, stay tuned for more such goodies in the future!

jOOQ 3.13 Released with More API and Tooling for DDL Management

jOOQ 3.13 has been released with CockroachDB support, much more API and tooling for DDL management, and SQL:2011 temporal table support

Starting with this release, we will further embrace our support for parsing,
translating, executing, and now also interpreting DDL statements. The driving
force is better code generation support, but in the future, also better
integration with database change management systems like Flyway or Liquibase.

We’ve added support for a very promising new dialect: CockroachDB.

We’re starting to support one of SQL:2011’s most interesting features: temporal
tables for system and application time versioning, and we’ve generally improved
the way we build and distribute jOOQ to our paying customers.

SQL Dialects

We’ve worked together with a team of great engineers at CockroachDB to speed up
our newly added, formal CockroachDB support in the jOOQ Professional and
Enterprise Editions. While many users have successfully used the POSTGRES
dialect with CockroachDB for simple integrations, this new dialect is now fully
integration tested and supported for future jOOQ and CockroachDB versions. The
integration fixed a variety of issues and minor differences that we have
encountered in the code generator and the runtime library.

We’ve added support for a variety of features now available in Firebird 3.0+,
MariaDB 10.3+, MySQL 8+, and PostgreSQL 10+.

DDL interpretation

Our biggest investment has been made in the area of DDL interpretation. Since
jOOQ 3.9, we can parse more and more SQL syntax. A logical next step is to allow
for interpreting parsed (or ordinary, DSL constructed) DDL statements in order
to maintain an incremental, in-memory meta model of a database. This has a
variety of benefits, which we will take advantage of in this release and in the
future.

The main out of the box benefits in this version include:

  • Being able to create a diff between two versions of a schema, programmatically through Meta.migrateTo(Meta), on the command line through DiffCLI (https://www.jooq.org/doc/latest/manual/sql-building/schema-diff-cli), or online, through our website: https://www.jooq.org/diff.
  • Such a diff or the meta model itself can be used for ad-hoc introspections, or formally, for automated migrations by tools built on top of jOOQ. Future versions will offer such migrations out of the box.
  • Unfortunately, we have not yet replaced our H2 backed DDLDatabase by our new DDL interpretation capabilities. This is a high priority for jOOQ 3.14, though

In the context of these investments, we have improved a variety of
infrastructure that were needed for these features and that users can benefit as
well. These improvements include:

  • A new LiquibaseDatabase, which uses Liquibase behind the scenes to simulate a schema migration in memory (similar to the DDLDatabase for Flyway), to reverse engineer that for the code generator. If you’re using Liquibase, chances are that you no longer need to connect to an actual database to use jOOQ’s code generator. More info here: https://blog.jooq.org/2019/10/16/how-to-simulate-a-liquibase-migration-using-h2/
  • We now support the inclusion of SQL text in generated code and other parts of the runtime meta model, including for CHECK constraints and views
  • A lot of improvements to the JDBC DatabaseMetaData backed org.jooq.Meta API have been implemented.
  • A lot of additional DDL statement and syntax support has been added throughout the API, parser, and other tooling.
  • The runtime meta model has been enhanced with new information, such as sequence flags, check constraints, table type information, and more.

Temporal table support

Perhaps the SQL:2011 standard’s most exciting “new” feature are temporal tables,
an enhancement on top of Oracle’s historic flashback query feature, and much
more.

In many systems, single, or bi temporal versioning of data is essential. The two
dimensions of versioning that are now supported in jOOQ are:

  • System versioning, useful for backups and auditing, where every mutating operation results in a backup copy of the data in a managed backup table.
  • Application versioning, useful for data that has a timely limited validity, such as a price for a product, which could have a temporary discount.

jOOQ 3.13 supports org.jooq.Period and related API in Select, Update, and Delete
to manipulate and query temporal tables in dialects that ship with native
support for these features. To different degrees, these dialects include:

  • DB2
  • MariaDB
  • Oracle
  • SQL Server

Emulations of incompletely supported syntax in the above dialects, as well as in
all other dialects will be added in future jOOQ versions, making temporal
validity a problem of the past.

More details here:
https://www.jooq.org/doc/3.13/manual/sql-building/table-expressions/temporal-tables/

This is a commercially licensed only feature.

Build improvements

We’ve invested a lot of time in improving our build automation using newer
Jenkins features that allow us to publish builds more frequently, especially
snapshot builds of upcoming minor releases for early adopters, as well as patch
releases for customers needing urgent fixes:

https://www.jooq.org/download/versions

These new releases also include trial versions for Java 6, 8, 11, without the
need to register for extended trials.

In addition to the above, we’ve improved and continue to improve our
documentation of features per edition, in the future.

Procedural API improvement

In jOOQ 3.12, we’ve introduced support for procedural languages, which we are
improving in every new release. jOOQ 3.13 adds support for additional
procedural statements, including:

  • EXECUTE for dynamic SQL execution embedded in procedural blocks
  • RETURN to return from functions
  • Plain SQL statement support

In the future, we’re planning to support also the generation of functions,
procedures, and triggers through standard jOOQ API.

More details here:
https://www.jooq.org/doc/3.13/manual/sql-building/procedural-statements/

This is a commercially licensed only feature.

jOOQ-Refaster

Some of our users have had a lot of success using our checker-framework or
ErrorProne integrations, which allow for some static API usage analysis in
client code (e.g. plain SQL / SQLi risk, dialect compatiblity, etc.)

Starting from jOOQ 3.13, we’re also adding support for ErrorProne’s sub project
“refaster”, which allows for automatically rewriting bad API usage to something
better, such as for example:

  • Migrate deprecated API usage like field.trim(), to its replacement trim(field)
  • Migrate inefficient SQL usage like ctx.fetchCount(select) != 0 to ctx.fetchExists(select)

The complete list of changes can be found on our website:
https://www.jooq.org/notes