Automatically Transform Oracle Style Implicit Joins to ANSI JOIN using jOOQ

While jOOQ is mostly being used as an internal SQL DSL for embedded, dynamic SQL in Java, where it offers the best solution on the market, jOOQ is increasingly also used for one of its secondary features: Its parser.

Having been introduced in jOOQ 3.9 primarly for the purpose of being able to parse DDL statements, e.g. to reverse engineer your schema for code generation purposes, we’ve added an increasing amount of features and SQL transformation capabilities that allow for the parser to be used as a standalone product, through the command line interface, our website, or through the ordinary jOOQ API.

One feature that has been added to jOOQ 3.14, which is mostly useful to those using jOOQ as a parser, is the capability of transforming old Oracle style implicit joins to ANSI JOIN.

Why avoid “implicit joins”?

The old Oracle style implicit join syntax is supported and properly optimised by most RDBMS vendors. In the past, prior to SQL-92, this is how we used to inner join tables, e.g. when querying the Sakila database:

SELECT *
FROM actor a, film_actor fa, film f
WHERE a.actor_id = fa.actor_id
AND fa.film_id = f.film_id

Granted, the syntax is kind of intuitive. Just declare all the tables you want to fetch data from, and then make sure the proper data is retained only by filtering for matching primary key / foreign key values.

Of course, this can go terribly wrong. For the many obvious reasons, e.g. when you forget a join predicate after adding a table. If the query is complex, this may be hard to debug. The solution is ANSI JOIN. Starting from SQL-92 (almost 30 years now!), this is how we join in most RDBMS:

SELECT *
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id

While it’s still possible to define wrong join predicates, at least it’s no longer possible to forget a predicate, because this is syntactically incorrect (except for MySQL, where, regrettably, the ON clause is optional):

SELECT *
FROM actor a
JOIN film_actor fa -- Syntax error
JOIN film f -- Syntax error

jOOQ’s implicit JOIN

Notice that it is common to refer to the above syntax as “implicit join”, while JPQL and jOOQ recycled the term for another kind of “implicit join”, which is foreign key path based, and even less error prone than the ANSI SQL syntax. With jOOQ, the above query can be written as follows:

ctx.select(
      FILM_ACTOR.actor().asterisk(), 
      FILM_ACTOR.asterisk(),
      FILM_ACTOR.film().asterisk())
   .from(FILM_ACTOR)
   .fetch();

The mere presence of these to-one relationship paths in the query will implicitly add the appropriate LEFT JOIN or INNER JOIN to the FROM clause. This is merely convenience on top of ordinary ANSI JOINs, not a replacement.

Transforming Oracle implicit joins

When you have an old code base that you wish to upgrade and transform all your queries to using ANSI JOIN, use jOOQ for that. You can use jOOQ’s programmatic capabilities (as mentioned before), or the free website https://www.jooq.org/translate.

On the website, just pick the “Oracle style to ANSI JOIN” option, place the following SQL on the left:

Input

SELECT
  a.first_name, 
  a.last_name, 
  count(c.category_id)
FROM
  actor a, 
  film_actor fa, 
  film f,
  film_category fc, 
  category c
WHERE a.actor_id = fa.actor_id
AND fa.film_id = f.film_id
AND fc.category_id = c.category_id
GROUP BY 
  a.actor_id,
  a.first_name,
  a.last_name

Output

SELECT
  a.first_name,
  a.last_name,
  count(c.category_id)
FROM actor a
  JOIN film_actor fa
    ON a.actor_id = fa.actor_id
  JOIN film f
    ON fa.film_id = f.film_id
  CROSS JOIN (
    film_category fc
      JOIN category c
        ON fc.category_id = c.category_id
  )
GROUP BY
  a.actor_id,
  a.first_name,
  a.last_name

And… whoopsies. The output correctly displays the resulting, undesired CROSS JOIN because one of the join predicates was missing:

Yep, the tool already helped! Let’s fix the input query:

Fixed input

SELECT
  a.first_name, 
  a.last_name, 
  count(c.category_id)
FROM
  actor a, 
  film_actor fa, 
  film f,
  film_category fc, 
  category c
WHERE a.actor_id = fa.actor_id
AND fa.film_id = f.film_id
AND f.film_id = fc.film_id -- This was missing
AND fc.category_id = c.category_id
GROUP BY 
  a.actor_id,
  a.first_name,
  a.last_name

Fixed output

SELECT
  a.first_name,
  a.last_name,
  count(c.category_id)
FROM actor a
  JOIN film_actor fa
    ON a.actor_id = fa.actor_id
  JOIN film f
    ON fa.film_id = f.film_id
  JOIN film_category fc
    ON f.film_id = fc.film_id
  JOIN category c
    ON fc.category_id = c.category_id
GROUP BY
  a.actor_id,
  a.first_name,
  a.last_name

This also works if you were using Oracle’s arcane outer join syntax using (+) (or SQL Server’s *=, which has been unsupported for a while now). You might have this input:

Input

SELECT 
  a.first_name, 
  a.last_name, 
  count(c.category_id)
FROM 
  actor a, 
  film_actor fa, 
  film f,
  film_category fc, 
  category c
WHERE a.actor_id = fa.actor_id(+)
AND fa.film_id = f.film_id(+)
AND f.film_id = fc.film_id(+)
AND fc.category_id(+) = c.category_id
GROUP BY 
  a.actor_id,
  a.first_name,
  a.last_name

Producing this output

SELECT
  a.first_name,
  a.last_name,
  count(c.category_id)
FROM actor a
  LEFT OUTER JOIN film_actor fa
    ON a.actor_id = fa.actor_id
  LEFT OUTER JOIN film f
    ON fa.film_id = f.film_id
  LEFT OUTER JOIN (
    film_category fc
      RIGHT OUTER JOIN category c
        ON fc.category_id = c.category_id
  )
    ON f.film_id = fc.film_id
GROUP BY
  a.actor_id,
  a.first_name,
  a.last_name

Err, wat. Whoopsies again! One of the (+) symbols was on the wrong side, which is why we got that RIGHT OUTER JOIN. Again, the tool has shown that the old syntax was quite error prone. Let’s fix it.

Fixed input

SELECT 
  a.first_name, 
  a.last_name, 
  count(c.category_id)
FROM 
  actor a, 
  film_actor fa, 
  film f,
  film_category fc, 
  category c
WHERE a.actor_id = fa.actor_id(+)
AND fa.film_id = f.film_id(+)
AND f.film_id = fc.film_id(+)
AND fc.category_id = c.category_id(+)
GROUP BY 
  a.actor_id,
  a.first_name,
  a.last_name

Fixed output

SELECT
  a.first_name,
  a.last_name,
  count(c.category_id)
FROM actor a
  LEFT OUTER JOIN film_actor fa
    ON a.actor_id = fa.actor_id
  LEFT OUTER JOIN film f
    ON fa.film_id = f.film_id
  LEFT OUTER JOIN film_category fc
    ON f.film_id = fc.film_id
  LEFT OUTER JOIN category c
    ON fc.category_id = c.category_id
GROUP BY
  a.actor_id,
  a.first_name,
  a.last_name

Conclusion

Play around with it and tell us what you think! https://www.jooq.org/translate

Nesting Collections With jOOQ 3.14’s SQL/XML or SQL/JSON support

One of the main features of ORMs is M as in Mapping. Libraries like jOOQ help auto-mapping flat or nested database records onto Java classes that have the same structure as the SQL result set.

The following has always been possible in jOOQ, assuming PostgreSQL’s INFORMATION_SCHEMA (using the generated code from the jOOQ-meta module):

class Column {
    String tableSchema;
    String tableName;
    String columnName;
}

for (Column c :
    ctx.select(
            COLUMNS.TABLE_SCHEMA, 
            COLUMNS.TABLE_NAME, 
            COLUMNS.COLUMN_NAME)
       .from(COLUMNS)
       .where(COLUMNS.TABLE_NAME.eq("t_author"))
       .orderBy(COLUMNS.ORDINAL_POSITION)
       .fetchInto(Column.class))
    System.out.println(
        c.tableSchema + "." + c.tableName + "." + c.columnName
    );

The above resulting in something like:

public.t_author.id
public.t_author.first_name
public.t_author.last_name
public.t_author.date_of_birth
public.t_author.year_of_birth
public.t_author.address

The mapping is straight forward, as explained in jOOQ’s DefaultRecordMapper.

Nested mappings

A lesser known feature that we’ve offered for a while was to use a dot notation to emulate nesting records into nested Java classes. Assuming you want to use a re-usable data type description in your columns and elsewhere:

class Type {
    String name;
    int precision;
    int scale;
    int length;
}

class Column {
    String tableSchema;
    String tableName;
    String columnName;
    Type type;
}

You can now write this query where you’ll alias some columns using the dot notation to type.name, for example (several nesting levels are possible):

for (Column c :
    ctx.select(
            COLUMNS.TABLE_SCHEMA,
            COLUMNS.TABLE_NAME,
            COLUMNS.COLUMN_NAME,
            COLUMNS.DATA_TYPE.as("type.name"),
            COLUMNS.NUMERIC_PRECISION.as("type.precision"),
            COLUMNS.NUMERIC_SCALE.as("type.scale"),
            COLUMNS.CHARACTER_MAXIMUM_LENGTH.as("type.length")
       )
       .from(COLUMNS)
       .where(COLUMNS.TABLE_NAME.eq("t_author"))
       .orderBy(COLUMNS.ORDINAL_POSITION)
       .fetchInto(Column.class))

    System.out.println(String.format(
        "%1$-30s: %2$s",
        c.tableSchema + "." + c.tableName + "." + c.columnName,
        c.type.name + (c.type.precision != 0
               ? "(" + c.type.precision + ", " + c.type.scale + ")"
               :       c.type.length != 0
               ? "(" + c.type.length + ")"
               : "")
    ));

The above will print:

public.t_author.id            : integer(32, 0)
public.t_author.first_name    : character varying(50)
public.t_author.last_name     : character varying(50)
public.t_author.date_of_birth : date
public.t_author.year_of_birth : integer(32, 0)
public.t_author.address       : USER-DEFINED

Using XML or JSON

Using XML or JSON, starting from jOOQ 3.14, you can also nest collections in your result set mapping very easily. First, let’s look again at how to use a JSON query using jOOQ, e.g. to find all columns per table:

for (Record1<JSON> record :
    ctx.select(
            jsonObject(
                key("tableSchema").value(COLUMNS.TABLE_SCHEMA),
                key("tableName").value(COLUMNS.TABLE_NAME),
                key("columns").value(jsonArrayAgg(
                    jsonObject(
                        key("columnName").value(COLUMNS.COLUMN_NAME),
                        key("type").value(jsonObject(
                            "name", COLUMNS.DATA_TYPE)
                        )
                    )
                ).orderBy(COLUMNS.ORDINAL_POSITION))
            )
       )
       .from(COLUMNS)
       .where(COLUMNS.TABLE_NAME.in("t_author", "t_book"))
       .groupBy(COLUMNS.TABLE_SCHEMA, COLUMNS.TABLE_NAME)
       .orderBy(COLUMNS.TABLE_SCHEMA, COLUMNS.TABLE_NAME)
       .fetch())
    System.out.println(record.value1());

The following JSON documents are returned:

{
  "tableSchema": "public", 
  "tableName": "t_author", 
  "columns": [{
    "columnName": "id", 
    "type": {"name": "integer"}
  }, {
    "columnName": "first_name", 
    "type": {"name": "character varying"}
  }, {...}]
}

{
  "tableSchema": "public", 
  "tableName": "t_book", 
  "columns": [{...}, ...]
}

That’s already awesome, isn’t it? We’ve blogged about this previously here and here. Starting with jOOQ 3.14, you can remove all the other middleware and mapping and what not, and produce your XML or JSON documents directly from your database using standard SQL/XML or SQL/JSON API!

But that’s not all!

Maybe, you don’t actually need the JSON document, you just want to use JSON to allow for nesting data structures, mapping them back to Java.

What about these nested Java classes:

public static class Type {
    public String name;
}

public static class Column {
    public String columnName;
    public Type type;
}

public static class Table {
    public String tableSchema;
    public String tableName;

    public List<Column> columns;
}

Assuming you have gson or Jackson or JAXB on your classpath (or you configure them directly), you can write the exact same query as before, and use jOOQ’s DefaultRecordMapper using the fetchInto(Table.class) call:

for (Table t :
    ctx.select(
            jsonObject(
                key("tableSchema").value(COLUMNS.TABLE_SCHEMA),
                key("tableName").value(COLUMNS.TABLE_NAME),
                key("columns").value(jsonArrayAgg(
                    jsonObject(
                        key("columnName").value(COLUMNS.COLUMN_NAME),
                        key("type").value(jsonObject(
                            "name", COLUMNS.DATA_TYPE)
                        )
                    )
                ).orderBy(COLUMNS.ORDINAL_POSITION))
            )
       )
       .from(COLUMNS)
       .where(COLUMNS.TABLE_NAME.in("t_author", "t_book"))
       .groupBy(COLUMNS.TABLE_SCHEMA, COLUMNS.TABLE_NAME)
       .orderBy(COLUMNS.TABLE_SCHEMA, COLUMNS.TABLE_NAME)
       .fetchInto(Table.class))
    System.out.println(t.tableName + ":\n" + t.columns
       .stream()
       .map(c -> c.columnName + " (" + c.type.name + ")")
       .collect(joining("\n  ")));

The output being:

t_author:
  id (integer)
  first_name (character varying)
  last_name (character varying)
  date_of_birth (date)
  year_of_birth (integer)
  address (USER-DEFINED)
t_book:
  id (integer)
  author_id (integer)
  co_author_id (integer)
  details_id (integer)
  title (character varying)
  published_in (integer)
  language_id (integer)
  content_text (text)
  content_pdf (bytea)
  status (USER-DEFINED)
  rec_version (integer)
  rec_timestamp (timestamp without time zone)

No join magic. No cartesian products. No data deduplication. Just SQL-native nested collections, using an intuitive, declarative approach to creating the document data structure, combined with the usual awesomeness of SQL.

Using this without the jOOQ DSL

Of course, this also works without the jOOQ API, e.g. using our parser. Check out our translator tool. Plug in this native SQL beauty:

SELECT
  json_object(
    KEY 'tableSchema' VALUE columns.table_schema,
    KEY 'tableName' VALUE columns.table_name,
    KEY 'columns' VALUE json_arrayagg(
      json_object(
        KEY 'columnName' VALUE columns.column_name,
        KEY 'type' VALUE json_object(
          KEY 'name' VALUE columns.data_type
        )
      )
    )
  )
FROM columns
WHERE columns.table_name IN ('t_author', 't_book')
GROUP BY columns.table_schema, columns.table_name
ORDER BY columns.table_schema, columns.table_name

And, because the devil of SQL agnosticity and translation is in the detail, take out the vendor-specific version, e.g. for PostgreSQL:

SELECT json_build_object(
  'tableSchema', columns.table_schema,
  'tableName', columns.table_name,
  'columns', json_agg(json_build_object(
    'columnName', columns.column_name,
    'type', json_build_object('name', columns.data_type)
  ))
)
FROM columns
WHERE columns.table_name IN (
  't_author', 't_book'
)
GROUP BY
  columns.table_schema,
  columns.table_name
ORDER BY
  columns.table_schema,
  columns.table_name

You might need to run this, before:

SET search_path = 'information_schema'

Conclusion

We’ve waited way too long with this game changing feature. I truly think this approach will change how we perceive ORMs in the future. The database first approach, where we can use SQL and only SQL to map SQL data onto any hierarchical data structure is very commpelling.

On the jOOQ side, we’re far from done yet. What if we can auto-generate some of the JSON document declaration from other types of meta data for you? What if you can do that, yourself? E.g. to map a GraphQL specification to jOOQ API based JSON queries? On all the SQL dialects that support these features!

The future of mapping nested data structures from SQL to any client, XML, JSON, objects is bright. jOOQ 3.14 is around the corner and will be released within the next 2 weeks. You can already build it from github: https://github.com/jOOQ/jOOQ, or if you have a license, download a nightly build from here: https://www.jooq.org/download/versions

Looking forward to your feedback.

Having “constant” columns in foreign keys

I was asked a very interesting question on Twitter just now:

Can we have “constant” foreign key columns in (PostgreSQL) tables? Luckily, yes, we can. Using a nice standard feature that is “computed columns” or “generated columns”

Sometimes, you cannot completely normalise your schema for whatever reason. There may be a case where you have a table with a composite primary key like this:

CREATE TABLE t1 (
  a int,
  b int,
  t1 int,
  PRIMARY KEY (a, b)
)

And in a referencing table t2, you will always reference one of the primary key columns by value, say, 1. Of course, you could create a table t2 with a CHECK constraint ensuring b = 1:

CREATE TABLE t2 (
  a int,
  b int NOT NULL DEFAULT 1 CHECK (b = 1),
  t2 int,
  FOREIGN KEY (a, b) REFERENCES t1
)

But why not use a generated column instead?

CREATE TABLE t2 (
  a int,
  b int GENERATED ALWAYS AS (1) STORED,
  t2 int,
  FOREIGN KEY (a, b) REFERENCES t1
)

In my opinion, this is even more powerful. As of PostgreSQL 12, only STORED is supported (meaning the value is stored on disk), when in this case VIRTUAL would be even better (meaning the value is produced only when reading the row).

Inserting some test data:

INSERT INTO t1 (a, b, t1) 
VALUES(1, 1, 1), (1, 2, 2), (2, 1, 3);

INSERT INTO t2 (a, t2) 
VALUES (1, 11), (2, 12);

SELECT * 
FROM t1
NATURAL LEFT JOIN t2

Produces the expected result. We can only insert (b = 1) into t2:

a|b|t1|t2|
-|-|--|--|
1|1| 1|11|
2|1| 3|12|
1|2| 2|  |

A nice trick to keep up one’s sleeve.

Computed or generated columns are available in a variety of RDBMS, including at least:

  • Db2
  • MySQL
  • Oracle
  • PostgreSQL
  • SQL Server

Use NATURAL FULL JOIN to compare two tables in SQL

There are a few ways to compare two similar tables in SQL. Assuming PostgreSQL syntax, we might have this schema:

CREATE TABLE t1 (a INT, b INT, c INT);
CREATE TABLE t2 (a INT, b INT, c INT);
INSERT INTO t1 VALUES (1, 2, 3), (4, 5, 6), (7, 8, 9);
INSERT INTO t2 VALUES            (4, 5, 6), (7, 8, 9), (10, 11, 12);

It is now possible to use UNION and EXCEPT as suggested by Chris Saxon:

In PostgreSQL, we can write:

(TABLE t1 EXCEPT TABLE t2) 
UNION
(TABLE t2 EXCEPT TABLE t1)
ORDER BY a, b, c

Notice how TABLE x is just standard SQL, and PostgreSQL, syntax sugar for SELECT * FROM x.

And we’ll get:

a |b |c |
--|--|--|
 1| 2| 3|
10|11|12|

Unfortunately, this requires two accesses to each table. Can we do it with a single access?

Using NATURAL FULL JOIN

Yes! Using NATURAL FULL JOIN, another rare use-case for this esoteric operator.

Assuming there are no NULL values, we can write this:

SELECT *
FROM (
  SELECT 't1' AS t1, t1.* FROM t1
) t1 NATURAL FULL JOIN (
  SELECT 't2' AS t2, t2.* FROM t2
) t2 
WHERE NOT (t1, t2) IS NOT NULL;

This produces:

a |b |c |t1|t2|
--|--|--|--|--|
 1| 2| 3|t1|  |
10|11|12|  |t2|

Why? Because a NATURAL JOIN is syntax sugar for joining using all the shared column names of the two tables, and the FULL JOIN makes sure we can retrieve also the columns that are not matched by the join predicate. Another way to write this is:

-- Use JOIN .. USING, instead of NATURAL JOIN
SELECT *
FROM (
  SELECT 't1' AS t1, t1.* FROM t1
) t1 FULL JOIN (
  SELECT 't2' AS t2, t2.* FROM t2
) t2 USING (a, b, c)
WHERE NOT (t1, t2) IS NOT NULL;

Or:

-- Use JOIN .. ON, instead of JOIN .. USING
SELECT 
  coalesce(t1.a, t2.a) AS a,
  coalesce(t1.b, t2.b) AS b,
  coalesce(t1.c, t2.c) AS c,
  t1.t1,
  t2.t2
FROM (
  SELECT 't1' AS t1, t1.* FROM t1
) t1 FULL JOIN (
  SELECT 't2' AS t2, t2.* FROM t2
) t2 ON (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) 
WHERE NOT (t1, t2) IS NOT NULL;

Unfortunately, as of PostgreSQL 12, this produces an error:

ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions

Pros and cons

Pros and cons compared to the set operator solution using UNION and EXCEPT:

Pros

  • Each table is accessed only once
  • Comparison is now name based, not column index based, i.e. it can still work if only parts of the columns are the shared

Cons

  • If index based column comparison was desired (because the tables are the same structurally, but do not share the exact same column names), then we’d have to rename each individual column to a common column name.
  • If there’s duplicate data, there’s going to be a cartesian product, which might make this solution quite slower
  • UNION and EXCEPT treat NULL values as “not distinct”. This isn’t the case with NATURAL JOIN. See workaround below

When there are NULL values in the data

In the presence of NULL values, we can no longer use NATURAL JOIN or JOIN .. USING. We could use the DISTINCT predicate:

SELECT 
  coalesce(t1.a, t2.a) AS a,
  coalesce(t1.b, t2.b) AS b,
  coalesce(t1.c, t2.c) AS c,
  t1.t1,
  t2.t2
FROM (
  SELECT 't1' AS t1, t1.* FROM t1
) t1 FULL JOIN (
  SELECT 't2' AS t2, t2.* FROM t2
) t2 ON (t1.a, t1.b, t1.c) IS NOT DISTINCT FROM (t2.a, t2.b, t2.c) 
WHERE NOT (t1, t2) IS NOT NULL;

Row value expression NULL predicate

Observe the usage of the esoteric NULL predicate for row value expressions, which uses the following truth table:

+-----------------------+-----------+---------------+---------------+-------------------+
| Expression            | R IS NULL | R IS NOT NULL | NOT R IS NULL | NOT R IS NOT NULL |
+-----------------------+-----------+---------------+---------------+-------------------+
| degree 1: null        | true      | false         | false         |  true             |
| degree 1: not null    | false     | true          | true          |  false            |
| degree > 1: all null  | true      | false         | false         |  true             |
| degree > 1: some null | false     | false         | true          |  true             |
| degree > 1: none null | false     | true          | true          |  false            |
+-----------------------+-----------+---------------+---------------+-------------------+

Yes. R IS NULL and NOT R IS NOT NULL are not the same thing in SQL…

It’s just another way of writing:

SELECT *
FROM (
  SELECT 't1' AS t1, t1.* FROM t1
) t1 NATURAL FULL JOIN (
  SELECT 't2' AS t2, t2.* FROM t2
) t2 
WHERE t1 IS NULL
OR t2 IS NULL;

Using SQL Server FOR XML and FOR JSON Syntax on Other RDBMS With jOOQ

SQL Server supports transforming flat tabular SQL result sets into hierarchical structures by convention using the convenient FOR XML or FOR JSON syntaxes. This is really convenient and less verbose than the standard SQL/XML or SQL/JSON APIs – although the standard ones are more powerful.

In this blog post, I’d like to show a few core features of the SQL Server syntax, and what they correspond to in standard SQL. jOOQ 3.14 will support both SQL Server’s syntax and the standard syntax, and will be able to translate from one to the other, such that you can use SQL Server syntax also on Db2, MariaDB, MySQL, Oracle, PostgreSQL. You can play around with the current state of development on our website here.

As always, using the Sakila database, here’s a simple example as a teaser:

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
FOR XML RAW;

-- Db2, Oracle, PostgreSQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlattributes(
    t.first_name AS first_name,
    t.last_name AS last_name,
    t.title AS title
  )
))
FROM (
  -- Original query here
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
) AS t

Producing in both cases something like:

<row first_name="PENELOPE" last_name="GUINESS" title="OKLAHOMA JUMANJI"/>
<row first_name="PENELOPE" last_name="GUINESS" title="RULES HUMAN"/>
<row first_name="PENELOPE" last_name="GUINESS" title="SPLASH GUMP"/>
<row first_name="PENELOPE" last_name="GUINESS" title="VERTIGO NORTHWEST"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WESTWARD SEABISCUIT"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WIZARD COLDBLOODED"/>
<row first_name="NICK" last_name="WAHLBERG" title="ADAPTATION HOLES"/>
<row first_name="NICK" last_name="WAHLBERG" title="APACHE DIVINE"/>

FOR XML and FOR JSON concepts

As could be seen in the above teaser, the SQL Server syntax is far less verbose and concise, and it seems to produce a reasonable default behaviour, where the Db2, Oracle, PostgreSQL (and SQL Standard) SQL/XML APIs are more verbose, but also more powerful. For example, it is possible to map column a to attribute x and column b to a nested XML element y very easily.

The advantages of both approaches are clear. SQL Server’s approach is much more usable in the general case. But what is the general case? Let’s summarise a few key parallels between SQL result sets, and XML/JSON data structures:


Tables are XML elements or JSON arrays

Tables (i.e. sets of data) are not a foreign concept to both XML and JSON documents. The most natural way to represent a set of data in XML is a set of elements using the same element name, optionally wrapped by a wrapper element. For example:

<!-- With wrapper element -->
<films>
  <film title="OKLAHOMA JUMANJI"/>
  <film title="RULES HUMAN"/>
  <film title="SPLASH GUMP"/>
</films>

<!-- Without wrapper element -->
<film title="OKLAHOMA JUMANJI"/>
<film title="RULES HUMAN"/>
<film title="SPLASH GUMP"/>

The distinction of whether a wrapper element is added is mostly significant when nesting data.

With JSON, the obvious choice of data structure to represent a table is an array. For example:

[
  {"title": "OKLAHOMA JUMANJI"},
  {"title": "RULES HUMAN"},
  {"title": "SPLASH GUMP"}
]


Rows are XML elements or JSON objects

As we’ve already seen above, a SQL row is represented in XML using an element.

<film title="OKLAHOMA JUMANJI"/>

The question is only what the element name should be. It can usually be any of:

  • A standard name, such as “row”
  • The name of the table the row stems from
  • A custom name

In JSON, it is an object.

{"title": "OKLAHOMA JUMANJI"}

Unlike in XML, there is no such thing as an element name, so the row is “anonymous”. The row type is defined by what table / array the JSON object is contained in.


Column values are XML elements or attributes, or JSON attributes

We have a bit more choices of how to represent SQL column values in XML. Mainly two choices:

  • Represent values as attributes
  • Represent values as elements

Scalar values can easily be represented as attributes. If a value needs further nesting (e.g. an array, user defined type, etc.), then elements are a better choice. In most cases, the choice is not relevant, so we can pick both:

<!-- Using attributes -->
<film film_id="635" title="OKLAHOMA JUMANJI"/>

<!-- Using elements from table and column names -->
<film>
  <film_id>635</film_id>
  <title>OKLAHOMA JUMANJI</title>
</film>

<!-- Using standard element names
<row>
  <value name="film_id" value="635"/>
  <value name="title" value="OKLAHOMA JUMANJI"/>
</row>

There are a few other reasonable default options for the representation of a column value in XML.

In JSON, on the other hand, there are two main reasonable approaches. In most cases, an object will be chosen, where column values are identified by column name. But just like SQL records are a mixture between “structs” and “tuples”, we could imagine a representation that maps column values to array indexes as well:

// Using objects
{"film_id": 635, "title": "OKLAHOMA JUMANJI"}

// Using arrays
[635, "OKLAHOMA JUMANJI"]


GROUP BY and ORDER BY can be seen as a way to nest data

So far, all data was represented in a flat way, just like the SQL table. There was some nesting when wrapping XML elements of JSON arrays in some wrapper element or object, or when representing XML data with more elements rather than attributes, but the data was still always tabular.

Very often, we want to consume data in a hierarchical form, though. An actor played in films, so we’d like to group the films by actor, rather than repeating the actor information for every film. In general, operations like GROUP BY or ORDER BY serve this purpose. GROUP BY allows for aggregating all data into nested data structures per group (e.g. into strings, arrays, XML elements, JSON arrays, JSON objects). ORDER BY does the same, “visually” – perhaps a bit less formally. When we look at this set of XML elements, we can see visually that they’re “grouped” (i.e. ordered) by actor:

<row first_name="PENELOPE" last_name="GUINESS" title="OKLAHOMA JUMANJI"/>
<row first_name="PENELOPE" last_name="GUINESS" title="RULES HUMAN"/>
<row first_name="PENELOPE" last_name="GUINESS" title="SPLASH GUMP"/>
<row first_name="PENELOPE" last_name="GUINESS" title="VERTIGO NORTHWEST"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WESTWARD SEABISCUIT"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WIZARD COLDBLOODED"/>
<row first_name="NICK" last_name="WAHLBERG" title="ADAPTATION HOLES"/>
<row first_name="NICK" last_name="WAHLBERG" title="APACHE DIVINE"/>

SQL Server supports such grouping in at least two ways:

  • Implicitly by convention, using ORDER BY
  • Explicity by creating correlated subqueries

The implicit approach could transform the above flat representation into something like this:

<a first_name="PENELOPE" last_name="GUINESS">
    <f title="OKLAHOMA JUMANJI"/>
    <f title="RULES HUMAN"/>
    <f title="SPLASH GUMP"/>
    <f title="VERTIGO NORTHWEST"/>
    <f title="WESTWARD SEABISCUIT"/>
    <f title="WIZARD COLDBLOODED"/>
</a>
<a first_name="NICK" last_name="WAHLBERG">
    <f title="ADAPTATION HOLES"/>
    <f title="APACHE DIVINE"/>
</a>

… where “a” and “f” are the table names in the query (actor a and film f).

How do FOR XML and FOR JSON work in detail?

There are several features that can be combined in SQL Server. The complete picture can be seen from the docs. We’ll omit a few features in this blog post here.

  • The transformation algorithm RAW (flat results, only in XML), AUTO (hierarchical, automatic results), PATH (hierarchical, explicit results)
  • The “root” name, which corresponds to an XML wrapper element, or a JSON wrapper object
  • XML only: Whether values should be placed in ELEMENTS or attributes
  • JSON only: INCLUDE_NULL_VALUES specifies whether NULL values are explicit, or implicit (absent from the JSON object).
  • JSON only: WITHOUT_ARRAY_WRAPPER specifies whether the set of JSON objects should be listed as a JSON array, or a comma separated list of objects (which could be combined with other queries)

This is not complete, there are more flags and features, but instead of discussing them in theory, let’s look at a few examples:

FOR XML RAW

Producing flat results with attributes for values

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML RAW;

-- Standard SQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlattributes(
    t.first_name AS first_name,
    t.last_name AS last_name,
    t.title AS title
  )
))
FROM (
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) AS t

This produces

<row first_name="NICK" last_name="WAHLBERG" title="SMILE EARRING"/>
<row first_name="NICK" last_name="WAHLBERG" title="WARDROBE PHANTOM"/>
<row first_name="PENELOPE" last_name="GUINESS" title="ACADEMY DINOSAUR"/>
<row first_name="PENELOPE" last_name="GUINESS" title="ANACONDA CONFESSIONS"/>

FOR XML RAW, ROOT

Producing flat results with attributes for values, and a root element to wrap the listed elements

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML RAW, ROOT('rows');

-- Standard SQL
SELECT xmlelement(
  NAME rows,
  xmlagg(xmlelement(
    NAME row,
    xmlattributes(
      t.first_name AS first_name,
      t.last_name AS last_name,
      t.title AS title
    )
  ))
)
FROM (
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) AS t

This produces

<rows>
  <row first_name="NICK" last_name="WAHLBERG" title="SMILE EARRING"/>
  <row first_name="NICK" last_name="WAHLBERG" title="WARDROBE PHANTOM"/>
  <row first_name="PENELOPE" last_name="GUINESS" title="ACADEMY DINOSAUR"/>
  <row first_name="PENELOPE" last_name="GUINESS" title="ANACONDA CONFESSIONS"/>
</rows>

FOR XML RAW, ELEMENTS

Producing flat results with elements for values.

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML RAW, ELEMENTS;

-- Standard SQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlelement(
    NAME first_name,
    first_name
  ),
  xmlelement(
    NAME last_name,
    last_name
  ),
  xmlelement(
    NAME title,
    title
  )
))
FROM (
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
  FOR XML RAW, ELEMENTS
) AS t

This produces

<row>
    <first_name>NICK</first_name>
    <last_name>WAHLBERG</last_name>
    <title>SMILE EARRING</title>
</row>
<row>
    <first_name>NICK</first_name>
    <last_name>WAHLBERG</last_name>
    <title>WARDROBE PHANTOM</title>
</row>
<row>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <title>ACADEMY DINOSAUR</title>
</row>
<row>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <title>ANACONDA CONFESSIONS</title>
</row>

This could also be combined with ROOT, which we’re omitting for brevity.

FOR XML/JSON AUTO

This approach derives results completely automatically from your query structure. Mainly:

  • The SELECT clause defines in what order XML or JSON data is nested.
  • The FROM clause defines the table names (via aliasing), which are translated to XML element or JSON object attribute names.
  • The ORDER BY clause produces the “grouping”, which is translated to nesting XML elements or JSON objects.
-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML AUTO;

-- Standard SQL
SELECT xmlagg(e)
FROM (
  SELECT xmlelement(
    NAME a,
    xmlattributes(
      t.first_name AS first_name,
      t.last_name AS last_name
    ),
    xmlagg(xmlelement(
      NAME f,
      xmlattributes(t.title AS title)
    ))
  ) AS e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) AS t
  GROUP BY
    first_name,
    last_name
) AS t

Notice how this emulation requires two steps of XMLAGG with GROUP BY. It gets more hairy with more tables being joined and projected! I won’t add more complex examples here, but try it online!

This produces

<a first_name="NICK" last_name="WAHLBERG">
    <f title="SMILE EARRING"/>
    <f title="WARDROBE PHANTOM"/>
</a>
<a first_name="PENELOPE" last_name="GUINESS">
    <f title="ACADEMY DINOSAUR"/>
    <f title="ANACONDA CONFESSIONS"/>
</a>

Let’s try the same thing again with JSON:

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR JSON AUTO;

-- Standard SQL
SELECT json_arrayagg(e)
FROM (
  SELECT JSON_OBJECT(
    KEY 'FIRST_NAME' VALUE first_name,
    KEY 'LAST_NAME' VALUE last_name,
    KEY 'F' VALUE JSON_ARRAYAGG(JSON_OBJECT(
      KEY 'TITLE' VALUE title
      ABSENT ON NULL
    ))
    ABSENT ON NULL
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

The result being:

[
    {
        "first_name": "NICK",
        "last_name": "WAHLBERG",
        "f": [
            {
                "title": "SMILE EARRING"
            },
            {
                "title": "WARDROBE PHANTOM"
            }
        ]
    },
    {
        "first_name": "PENELOPE",
        "last_name": "GUINESS",
        "f": [
            {
                "title": "ACADEMY DINOSAUR"
            },
            {
                "title": "ANACONDA CONFESSIONS"
            }
        ]
    }
]

FOR XML/JSON AUTO, ROOT

Like before, we could wrap this in a root XML element or a root JSON object if need be.

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML AUTO, ROOT;

-- Standard SQL
SELECT xmlelement(
  NAME join,
  xmlagg(e)
)
FROM (
  SELECT xmlelement(
    NAME a,
    xmlattributes(
      t.first_name AS first_name,
      t.last_name AS last_name
    ),
    xmlagg(xmlelement(
      NAME f,
      xmlattributes(t.title AS title)
    ))
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

This does the same thing as before, but just wraps the previous root XMLAGG() element in another XMLELEMENT() function call.

This produces

<root>
    <a first_name="NICK" last_name="WAHLBERG">
        <f title="SMILE EARRING"/>
        <f title="WARDROBE PHANTOM"/>
    </a>
    <a first_name="PENELOPE" last_name="GUINESS">
        <f title="ACADEMY DINOSAUR"/>
        <f title="ANACONDA CONFESSIONS"/>
    </a>
</root>

Let’s try the same thing again with JSON:

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR JSON AUTO, ROOT;

-- Standard SQL
SELECT JSON_OBJECT(KEY 'a' VALUE json_arrayagg(e))
FROM (
  SELECT JSON_OBJECT(
    KEY 'FIRST_NAME' VALUE first_name,
    KEY 'LAST_NAME' VALUE last_name,
    KEY 'F' VALUE JSON_ARRAY_AGG(JSON_OBJECT(
      KEY 'TITLE' VALUE title
      ABSENT ON NULL
    ))
    ABSENT ON NULL
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

The result being:

{
    "a": [
        {
            "first_name": "NICK",
            "last_name": "WAHLBERG",
            "f": [
                {
                    "title": "SMILE EARRING"
                },
                {
                    "title": "WARDROBE PHANTOM"
                }
            ]
        },
        {
            "first_name": "PENELOPE",
            "last_name": "GUINESS",
            "f": [
                {
                    "title": "ACADEMY DINOSAUR"
                },
                {
                    "title": "ANACONDA CONFESSIONS"
                }
            ]
        }
    ]
}

FOR XML AUTO, ELEMENTS

Like before, instead of producing attributes, we might decide to produce elements instead (in XML only):

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML AUTO, ELEMENTS;

-- Standard SQL
SELECT xmlagg(e)
FROM (
  SELECT xmlelement(
    NAME a,
    xmlelement(
      NAME first_name,
      first_name
    ),
    xmlelement(
      NAME last_name,
      last_name
    ),
    xmlagg(xmlelement(
      NAME f,
      xmlelement(
        NAME title,
        title
      )
    ))
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

Not much has changed, except for the fact that a set of XMLELEMENT() calls are made, rather than XMLATTRIBUTES() callse.

This produces

<a>
    <first_name>NICK</first_name>
    <last_name>WAHLBERG</last_name>
    <f>
        <title>SMILE EARRING</title>
    </f>
    <f>
        <title>WARDROBE PHANTOM</title>
    </f>
</a>
<a>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <f>
        <title>ACADEMY DINOSAUR</title>
    </f>
    <f>
        <title>ANACONDA CONFESSIONS</title>
    </f>
</a>

FOR XML/JSON PATH

The PATH strategy is my personal favourite. It is used to create nested XML or JSON path structures more explicitly, and also allows for additional nesting levels when grouping projections together. This is best shown by example. Notice, how I’m now using aliases for my columns, and the alias looks like an XPath expression using '/' (slashes):

-- SQL Server
SELECT 
  a.first_name AS [author/first_name], 
  a.last_name AS [author/last_name], 
  f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML PATH;

-- Standard SQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlelement(
    NAME author,
    xmlelement(
      NAME first_name,
      "author/first_name"
    ),
    xmlelement(
      NAME last_name,
      "author/last_name"
    )
  ),
  xmlelement(
    NAME title,
    title
  )
))
FROM (
  SELECT 
    a.first_name AS "author/first_name", 
    a.last_name AS "author/last_name", 
    f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) t

Check out how by convention, we’re now getting an additional level of nesting for author related columns under the row/author element:

<row>
    <author>
        <first_name>NICK</first_name>
        <last_name>WAHLBERG</last_name>
    </author>
    <title>SMILE EARRING</title>
</row>
<row>
    <author>
        <first_name>NICK</first_name>
        <last_name>WAHLBERG</last_name>
    </author>
    <title>WARDROBE PHANTOM</title>
</row>
<row>
    <author>
        <first_name>PENELOPE</first_name>
        <last_name>GUINESS</last_name>
    </author>
    <title>ACADEMY DINOSAUR</title>
</row>
<row>
    <author>
        <first_name>PENELOPE</first_name>
        <last_name>GUINESS</last_name>
    </author>
    <title>ANACONDA CONFESSIONS</title>
</row>

This is really neat! The SQL Server syntax is definitely much more convenient for this common use-case.

Let’s try the same thing again with JSON. The only thing we change is we now use a JSON-path-ish syntax using dots ('.') rather than slashes ('/'):

-- SQL Server
SELECT 
  a.first_name AS [author.first_name], 
  a.last_name AS [author.last_name], 
  f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR JSON PATH;

-- Standard SQL
SELECT JSON_ARRAYAGG(JSON_OBJECT(
  KEY 'author' VALUE JSON_OBJECT(
    KEY 'first_name' VALUE author.first_name,
    KEY 'last_name' VALUE author.last_name
  ),
  KEY 'TITLE' VALUE title
  ABSENT ON NULL
))
FROM (
  SELECT 
    a.first_name AS "author.first_name", 
    a.last_name AS "author.last_name", 
    f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) t

The result being (again, with nested objects):

[
    {
        "author": {
            "first_name": "NICK",
            "last_name": "WAHLBERG"
        },
        "title": "SMILE EARRING"
    },
    {
        "author": {
            "first_name": "NICK",
            "last_name": "WAHLBERG"
        },
        "title": "WARDROBE PHANTOM"
    },
    {
        "author": {
            "first_name": "PENELOPE",
            "last_name": "GUINESS"
        },
        "title": "ACADEMY DINOSAUR"
    },
    {
        "author": {
            "first_name": "PENELOPE",
            "last_name": "GUINESS"
        },
        "title": "ANACONDA CONFESSIONS"
    }
]

For more sophisticated nesting, including nesting of collections, a correlated subquery is needed in SQL Server, also with a FOR XML or FOR JSON syntax.

Conclusion

XML and JSON are popular document formats outside and inside of the database. SQL Server has some of the most conventient syntax for most cases, while standard SQL supports much more basic, and thus more powerful constructs. In standard SQL, almost any kind of XML or JSON projection is possible, and with XMLTABLE() and JSON_TABLE(), the documents can be transformed back to SQL tables, as well. In many applications, using these XML or JSON features natively would lead to much less boilerplate code, as many applications do not need middleware between the database and some client, just to transform data between formats.

Most ORMs don’t expose this functionality for a variety of reasons, the main one being that the devil is in the details. While both XML and JSON are nicely standardised, the implementations differ greatly:

  • The SQL/XML standard is implemented mostly by DB2, Oracle, and PostgreSQL. Many dialects offer some XML capabilities, but not as impressive as the standard and the previous three. SQL Server has FOR XML which is very powerful for standard XML serialisations, but may be a bit difficult to use for edge cases
  • The SQL/JSON standard was added late and is implemented again to large extents by DB2 and Oracle, but inceasingly also by MariaDB and MySQL. PostgreSQL (and by consequence, compatible dialects, like CockroachDB) had their own proprietary functions and APIs, which are not compatible with the standard. And again, SQL Server has FOR JSON which works well for standard serialisations, but a bit less well for edge cases

These technologies are poorly adopted in clients because of the many subtle differences. jOOQ has been leveling out these minor differences for many years without hiding the core functionality. SQL/XML and SQL/JSON are perfect use-cases for jOOQ 3.14 (due in Q2 2020), which now allows for using both the standard SQL/XML and SQL/JSON syntaxes as well as the SQL Server FOR XML and FOR JSON syntax in the jOOQ Professional and Enterprise Editions.

Before jOOQ 3.14 is out, you can already play with the current functionality on our website: https://www.jooq.org/translate

The Many Flavours of the Arcane SQL MERGE Statement

The SQL MERGE statement is a device whose mystery is only exceeded by its power.

A simple example shows its full power according to standard SQL. Imagine you have a production table for product prices, and a staging table from which you want to load the latest prices. For once, I’m using the Db2 LuW MERGE syntax, because that’s the most standards compliant syntax out there (among the dialects we support in jOOQ):

DROP TABLE IF EXISTS prices;
DROP TABLE IF EXISTS staging;

CREATE TABLE prices (
  product_id BIGINT NOT NULL PRIMARY KEY,
  price DECIMAL(10, 2) NOT NULL,
  price_date DATE NOT NULL,
  update_count BIGINT NOT NULL
);

CREATE TABLE staging (
  product_id BIGINT NOT NULL PRIMARY KEY,
  price DECIMAL(10, 2) NOT NULL
);

DELETE FROM prices;
DELETE FROM staging;
INSERT INTO staging 
VALUES (1, 100.00),
       (2, 125.00),
       (3, 150.00);

So, we’ve loaded a few records in our staging table, which we now want to merge over to the prices table. We could just insert them, easily, but we’ll stage more prices later on, e.g. these ones:

DELETE FROM staging;
INSERT INTO staging 
VALUES (1, 100.00),
       (2,  99.00),
       (4, 300.00);

We want our logic to be this:

  • All new IDs in the staging table should just be inserted into the prices table.
  • Existing IDs should be updated if and only if the price has changed. In that case, the update_count should increase.
  • Prices that are no longer encountered in the staging table should be deleted from the prices table, to implement a full sync, rather than a delta sync, for the sake of the example. We could also add a “command” column that contains the instruction on whether data should be updated or deleted, to implement a delta sync.

So, this is the Db2 (and standards compliant) MERGE statement that we use for the job:

MERGE INTO prices AS p
USING (
  SELECT COALESCE(p.product_id, s.product_id) AS product_id, s.price
  FROM prices AS p
  FULL JOIN staging AS s ON p.product_id = s.product_id
) AS s
ON (p.product_id = s.product_id)
WHEN MATCHED AND s.price IS NULL THEN DELETE
WHEN MATCHED AND p.price != s.price THEN UPDATE SET 
  price = s.price,
  price_date = CURRENT_DATE,
  update_count = update_count + 1
WHEN NOT MATCHED THEN INSERT 
  (product_id, price, price_date, update_count)
VALUES 
  (s.product_id, s.price, CURRENT_DATE, 0);

Easy eh?

Hm, not so simple if you haven’t written MERGE statements all your life. If so, don’t be scared. As most of SQL, the scary part is syntax (keywords, UPPER CASE, etc.). The underlying concepts are simpler than they may seem at first. Let’s go through it step by step. It has 4 parts:

1. The target table

Just like with INSERT statements, we can define where we want to MERGE the data INTO. This is the simple part:

MERGE INTO prices AS p
-- ...

2. The source table

The USING keyword wraps a source table that we want to merge. We could have just placed the staging table here, but I wanted to enrich the source data with some additional data first. I’m using a FULL JOIN to produce the matching between old data (prices) and new data (staging). If, after filling the staging table for the second time, but before running the MERGE statement, we ran the USING clause alone (with some minor modifications for illustration):

SELECT 
  COALESCE(p.product_id, s.product_id) AS product_id, 
  p.price AS old_price, 
  s.price AS new_price
FROM prices AS p
FULL JOIN staging AS s ON p.product_id = s.product_id
ORDER BY product_id

Then we would get this:

PRODUCT_ID|OLD_PRICE|NEW_PRICE|
----------|---------|---------|
         1|   100.00|   100.00| <-- same price
         2|   125.00|    99.00| <-- updated price
         3|   150.00|         | <-- deleted price
         4|         |   300.00| <-- added price

Neat!

3. The ON clause

Next, we RIGHT JOIN the target table and source tables using an ON clause, just like with an ordinary JOIN:

ON (p.product_id = s.product_id)

MERGE always uses RIGHT JOIN semantics, which is why I placed a FULL JOIN in the source table, the USING clause. It is totally possible to write things a bit differently, such that we can avoid accessing the prices table twice, but I want to show the full power of this statement. Notice, SQL Server joins source and target tables using a FULL JOIN, as I’ll explain further down. I’ll also explain why RIGHT JOIN, right away.

4. The WHEN clauses

Now comes the interesting part! There can either be a match between the two tables (target and source), like when we get the result of an INNER JOIN, or no such match because the source table contains a row that is not matched by the target table (the RIGHT JOIN semantics). In our example, PRODUCT_ID IN (1, 2, 3) will produce a match (contained in both source and target tables), whereas PRODUCT_ID = 4 will not produce a match (not contained in the target table yet).

Colouring our source data set:

PRODUCT_ID|OLD_PRICE|NEW_PRICE|
----------|---------|---------|
         1|   100.00|   100.00| <-- matched (but ignored)
         2|   125.00|    99.00| <-- matched
         3|   150.00|         | <-- matched
         4|         |   300.00| <-- not matched

The following are a sequence of matching instructions that will be executed in the order of appearance, for each row resulting from the previous RIGHT JOIN:

-- With my FULL JOIN, I've produced NULL price values
-- whenever a PRODUCT_ID is in the target table, but not
-- in the source table. These rows, we want to DELETE
WHEN MATCHED AND s.price IS NULL THEN DELETE

-- When there is a price change (and only then), we 
-- want to update the price information in the target table.
WHEN MATCHED AND p.price != s.price THEN UPDATE SET 
  price = s.price,
  price_date = CURRENT_DATE,
  update_count = update_count + 1

-- Finally, when we don't have a match, i.e. a row is
-- in the source table, but not in the target table, then
-- we simply insert it into the target table.
WHEN NOT MATCHED THEN INSERT 
  (product_id, price, price_date, update_count)
VALUES 
  (s.product_id, s.price, CURRENT_DATE, 0);

It’s not too complicated, just a lot of keywords and syntax. So, after running this MERGE on the second set of data of the staging table, we’ll get this result in the price table:

PRODUCT_ID|PRICE |PRICE_DATE|UPDATE_COUNT|
----------|------|----------|------------|
         1|100.00|2020-04-09|           0|
         2| 99.00|2020-04-09|           1|
         4|300.00|2020-04-09|           0|

The way I expressed this MERGE statement, it is idempotent, i.e. I can run it again on the same staging table content, and it won’t modify any data in the price table – because none of the WHEN clauses applies. Idempotence is not a property of MERGE, I just wrote my statement this way.

Dialect specifics

A few dialects support MERGE. Among the ones that jOOQ 3.13 supports, there are at least:

  • Db2
  • Derby
  • Firebird
  • H2
  • HSQLDB
  • Oracle
  • SQL Server
  • Sybase SQL Anywhere
  • Teradata
  • Vertica

For once, regrettably, this list does not include PostgreSQL. But even the dialects in this list do not all agree on what MERGE really is. The SQL standard specifies 3 features, each one optional:

  • F312 MERGE statement
  • F313 Enhanced MERGE statement
  • F314 MERGE statement with DELETE branch

But instead of looking at the standards and what they require, let’s look at what the dialects offer, and how it can be emulated if something is not available.

The AND clause

As you may have noticed, this post uses the syntax:

WHEN MATCHED AND <some predicate> THEN

It’s also possible to specify

WHEN NOT MATCHED AND <some predicate> THEN

With the exception of Teradata, most dialects support these AND clauses (Oracle has a specific syntax using WHERE for this, which I’ll get to later).

The point of these clauses is to be able to have several of these WHEN MATCHED or WHEN NOT MATCHED clauses, in fact an arbitrary number of them. Unfortunately, this isn’t supported by all dialects. Some dialects only support one clause of each type (INSERT, UPDATE, DELETE). It isn’t strictly necessary to support several clauses, but it is a lot more convenient as we’ll see below.

These dialects do not support multiple WHEN MATCHED or WHEN NOT MATCHED clauses:

  • HSQLDB
  • Oracle
  • SQL Server
  • Teradata

If a dialect does not support AND, or if it does not support multiple WHEN MATCHED clauses, simply translate the clauses to case expressions. Instead of our previous WHEN clauses, we’d get:

-- The DELETE clause doesn't make much sense without AND,
-- So there's not much we can do about this emulation in Teradata.
WHEN MATCHED AND s.price IS NULL THEN DELETE

-- Repeat the AND clause in every branch of the CASE
-- Expression where it applies
WHEN MATCHED THEN UPDATE SET 
  price = CASE

    -- Update the price if the AND clause applies
    WHEN p.price != s.price THEN s.price,

    -- Otherwise, leave it untouched
    ELSE p.price
  END

  -- Repeat for all columns
  price_date = CASE
    WHEN p.price != s.price THEN CURRENT_DATE
    ELSE p.price_date
  END,
  update_count = CASE
    WHEN p.price != s.price THEN update_count + 1
    ELSE p.update_count
  END

-- Unchanged, in this case
WHEN NOT MATCHED THEN INSERT 
  (product_id, price, price_date, update_count)
VALUES 
  (s.product_id, s.price, CURRENT_DATE, 0);

The formalism is this one:

If there is no AND, add AND

These are the same:

WHEN MATCHED THEN [ UPDATE | DELETE ]
WHEN MATCHED AND 1 = 1 THEN [ UPDATE | DELETE ]

This replacement may be needed in Firebird (which has a bug in this area) and SQL Server (which does not allow for WHEN MATCHED clauses after a WHEN MATCHED clause without AND clause, which is kind of a linting error). Instead of emulating stuff, you can just skip all the subsequent WHEN MATCHED branches, because they will not apply. Every row is updated only once, i.e. by only one WHEN clause:

Every row is updated only once

Make sure no row is updated more than once in the emulation, as required by the standard. When writing this:

WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND p2 THEN DELETE
WHEN MATCHED AND p3 THEN UPDATE SET c1 = 3, c2 = 3
WHEN MATCHED AND p4 THEN DELETE

This really means the same as:

WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND NOT p1 AND p2 THEN DELETE
WHEN MATCHED AND NOT p1 AND NOT p2 AND p3 THEN UPDATE SET c1 = 3,c2 = 3
WHEN MATCHED AND NOT p1 AND NOT p2 AND NOT p3 AND p4 THEN DELETE

To emulate the above, write this instead:

WHEN MATCHED AND 
  p1 OR
  NOT p1 AND NOT p2 AND p3
THEN UPDATE SET 
  c1 = CASE 
    WHEN p1                       THEN 1
    WHEN NOT p1 AND NOT p2 AND p3 THEN 3
                                  ELSE c1
  END,
  c2 = CASE
    WHEN NOT p1 AND NOT p2 AND p3 THEN 3
                                  ELSE c2
  END
WHEN MATCHED AND 
  NOT p1 AND p2 OR
  NOT p1 AND NOT p2 AND NOT p3 AND p4
THEN DELETE

Quite laborious, but that’s how it is.

H2 and HSQLDB

Notice that both H2 and HSQLDB do not follow the “every row is updated only once” rule. I’ve reported this to H2 already: https://github.com/h2database/h2database/issues/2552. If you want to be standards compliant (jOOQ 3.14 will emulate this for you, don’t worry), then you have to do the above CASE expression madness in these dialects, or, in H2 (HSQLDB doesn’t support multiple WHEN MATCHED clauses of the same type) enhance all the WHEN MATCHED AND clauses as I’ve illustrated before:

WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND NOT p1 AND p2 THEN DELETE
WHEN MATCHED AND NOT p1 AND NOT p2 AND p3 THEN UPDATE SET c1 = 3,c2 = 3
WHEN MATCHED AND NOT p1 AND NOT p2 AND NOT p3 AND p4 THEN DELETE

Oracle

Oracle doesn’t support AND here, but some interesting vendor specific syntax. It looks reasonable at first, but it’s really funky.

  • After UPDATE, you can add a WHERE clause, which is the same thing as AND. So far so good.
  • You can also add a DELETE WHERE clause, but only together with an UPDATE. So you cannot DELETE without updating. Fine, we weren’t going to, in our example.
  • However, the interesting thing is that the UPDATE / DELETE command is executed together, and DELETE happens after UPDATE. So the same row gets processed twice. If you use WHERE in UPDATE, only rows included in the UPDATE can also be included in DELETE. I mean, why would you update the rows first, prior to deletion?

This means that our standard clauses:

WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND p2 THEN DELETE
WHEN MATCHED AND p3 THEN UPDATE SET c1 = 3, c2 = 3
WHEN MATCHED AND p4 THEN DELETE

Would need to be emulated like this:

WHEN MATCHED 
THEN UPDATE SET 
  c1 = CASE 
    WHEN p1 THEN 1  -- Normal update for WHEN MATCHED AND p1 clause
    WHEN p2 THEN c1 -- "Touch" record for later deletion
    WHEN p3 THEN 3  -- Normal update for WHEN MATCHED AND p3 clause
    WHEN p4 THEN c1 -- "Touch" record for later deletion
            ELSE c1
  END,
  c2 = CASE
    WHEN p1 THEN c2 -- p1 is not affecting c2
    WHEN p2 THEN c2 -- "Touch" record for later deletion
    WHEN p3 THEN 3  -- Normal update for WHEN MATCHED AND p3 clause
    WHEN p4 THEN c2 -- "Touch" record for later deletion
            ELSE c2
  END

-- Any predicate from any AND clause, regardless if UPDATE or DELETE
WHERE p1 OR p2 OR p3 OR p4

-- Repeat the predicates required for deletion
DELETE WHERE 
  NOT p1 AND p2 OR
  NOT p1 AND NOT p2 AND NOT p3 AND p4

It was just a simple MERGE statement in standard SQL syntax!

There is an additional level of trickiness here, which I will not cover in this blog post (but we might handle it in jOOQ). In Oracle, the DELETE WHERE clause can already see the updates performed by the UPDATE clause. This means that if, for example, p2 depends on the value of c1:

  c1 = CASE 
    ...
    WHEN p2 THEN c1 -- "Touch" record for later deletion
    ...
  END,

Then the evaluation of p2 in DELETE WHERE will be affected by this

DELETE WHERE 
  NOT p1 AND p2 OR
  NOT p1 AND NOT p2 AND NOT p3 AND p4

The c1 in these p2 expressions will not be the same c1 as the c1 in the UPDATE clause. It’s obviously possible to manage this as well to some extent, through variable substitution.

SQL Server BY SOURCE and BY TARGET

SQL Server has a useful extension to the WHEN NOT MATCHED clause, which I think belongs in the SQL standard!

With this extension, you can specify whether you want to perform INSERT actions WHEN NOT MATCHED [ BY TARGET ] (the default that everyone else supports as well), or WHEN NOT MATCHED BY SOURCE (in case of which you can perform another UPDATE or DELETE action.

The BY TARGET clause means that we found a row in the source table but not in the target table. The BY SOURCE clause means that we found a row in the target table but not in the source table. This means that in SQL Server, the target and source tables are FULL OUTER JOINed, not RIGHT OUTER JOINed, which would mean that our original statement can be greatly simplified:

MERGE INTO prices AS p
USING staging AS s
ON (p.product_id = s.product_id)
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN MATCHED AND p.price != s.price THEN UPDATE SET 
  price = s.price,
  price_date = getdate(),
  update_count = update_count + 1
WHEN NOT MATCHED BY TARGET THEN INSERT 
  (product_id, price, price_date, update_count)
VALUES 
  (s.product_id, s.price, getdate(), 0);

We could colour again the rows encountered here:

PRODUCT_ID|  P.PRICE|  S.PRICE|
----------|---------|---------|
         1|   100.00|   100.00| <-- matched (but ignored)
         2|   125.00|    99.00| <-- matched
         3|   150.00|         | <-- not matched by source
         4|         |   300.00| <-- not matched by target

As can be seen, this is really just how a FULL OUTER JOIN works.

The emulation of these clauses back into standard SQL are laborious too, as we’d have to emulate this FULL OUTER JOIN explicitly. I think it’s possible, but we might not implement it in jOOQ.

Vertica

Only Vertica seems to not support the DELETE branch, meaning you cannot use a MERGE statement to DELETE data from your target table. You can use it only to INSERT or UPDATE data, which is good enough in almost all cases. Curiously, Teradata supports DELETE, but not AND, which seems kinda pointless, as DELETE and UPDATE cannot be combined this way.

Conclusion

The MERGE statement is a device whose mystery is only exceeded by its power. In a simple form (no AND or WHERE clauses, no DELETE clauses), all dialects pretty much agree, and that’s already a very useful feature set, which jOOQ has supported for a long time. Starting from jOOQ 3.14, we’re tackling also all the other features listed in this article to help you write complex, vendor agnostic MERGE statements and emulate them on all the dialects that have some MERGE support.

Want to play around with it already now? Check out our free online SQL translation tool.

Create Empty Optional SQL Clauses with jOOQ

When using jOOQ to create dynamic SQL statements (one of jOOQ’s core value propositions), it is often necessary to add query elements conditionally, with a default “No-op” behaviour. For first time users, this default “no-op” behaviour is not always obvious as the jOOQ API is vast, and as with any vast API, there are many different options to do similar things.

How not to do it

A common pitfall is to be tempted to work with the many XYZStep types. What types are these? They are usually invisible to the developer as developers use jOOQ’s DSL API in a fluent fashion, just like the JDK Stream API. For example:

DSLContext ctx = ...;

Result<?> result =
ctx.select(T.A, T.B)
   .from(T)
   .where(T.C.eq(1))
   .and(T.D.eq(2))
   .fetch();

Let’s decompose the above query to see what happens in the API. We could assign every method result to a local variable:

SelectFromStep<?> s1 = ctx.select(T.A, T.B);
SelectWhereStep<?> s2 = s1.from(T);
SelectConditionStep<?> s3 = s2.where(T.C.eq(1));
SelectConditionStep<?> s4 = s3.and(T.D.eq(2))

Result<?> result = s4.fetch();

Our previous fluent API design blog post explains this API design technique.

This is not what people usually do with “static SQL” statements, but they might be tempted to do this if they wanted to add the last predicate (T.D = 2) conditionally, e.g:

DSLContext ctx = ...;

SelectConditionStep<?> c =
ctx.select(T.A, T.B)
   .from(T)
   .where(T.C.eq(1));

if (something)
    c = c.and(T.D.eq(2));

Result<?> result = c.fetch();

This is perfectly valid API usage, but we do not recommend it because it is very messy and leads to difficult to maintain client code. Also, it is absolutely unnecessary, because there is a better way:

Composing queries from its parts

The problem with the above approach is that it is trying to use an imperative approach of adding things to a query step by step. This is how many developers tend to structure their code, but with SQL (and by consequence, jOOQ) that can turn out to be difficult to get right. A functional approach tends to work better.

Notice that not only the entire DSL structure could be assigned to local variables, but also the individual SELECT clause arguments. For example:

DSLContext ctx = ...;

List<SelectField<?>> select = Arrays.asList(T.A, T.B);
Table<?> from = T;
Condition where = T.C.eq(1).and(T.D.eq(2));

Result<?> result =
ctx.select(select)
   .from(from)
   .where(where)
   .fetch();

In fact, every jOOQ query is a dynamic SQL query. Many queries just happen to look like static queries, due to jOOQ’s API design.

Again, we wouldn’t be assigning every SELECT clause argument to a local variable, only the truly dynamic ones. For example:

DSLContext ctx = ...;

Condition where = T.C.eq(1);

if (something)
    where = where.and(T.D.eq(2));

Result<?> result =
ctx.select(T.A, T.B)
   .from(T)
   .where(where)
   .fetch();

This already looks quite decent.

Avoid breaking readability

A lot of people aren’t happy with this approach either, because it breaks a query’s readability by making its components non-local. The predicates in the query are declared up front, away from the query itself. This isn’t how many people like to reason about SQL.

And you don’t have to! It is totally possible to embed the condition directly in the WHERE clause like this:

DSLContext ctx = ...;

Result<?> result =
ctx.select(T.A, T.B)
   .from(T)

   // We always need this predicate
   .where(T.C.eq(1))

   // This is only added conditionally
   .and(something
      ? T.D.eq(2)
      : DSL.noCondition())
   .fetch();

The magic is in the above usage of DSL.noCondition, which is a pseudo predicate that does not generate any content. It is a placeholder where an org.jooq.Condition type is required without actually materialising one.

There is also:

… but that requires having to think about these identities and the reductions all the time. Also, if you append many of these trueCondition() or falseCondition() to a query, the resulting SQL tends to be quite ugly, for example for people having to analyse performance in production. noCondition() just never generates any content at all.

Note that noCondition() does not act as an identity! If your noCondition() is the only predicate left in a WHERE clause, there will not be any WHERE clause, regardless if you work with AND predicates or OR predicates.

No-op expressions in jOOQ

When using dynamic SQL like this, and adding things conditionally to queries, such “no-op expressions” become mandatory. In the previous example, we’ve seen how to add a “no-op predicate” to a WHERE clause (the same approach would obviously work with HAVING and all other clauses that work with boolean expressions).

The three most important jOOQ query types are:

Users may want to add all of these conditionally to queries.

org.jooq.Condition

We’ve already seen how to do this with org.jooq.Condition.

org.jooq.Field

What about dynamic column expressions in the projection (the SELECT clause)? Assuming you want to project columns only in certain cases. In our example, the T.B column is something we don’t always need. That’s easy! The same approach can be used (assuming T.B is a string column):

DSLContext ctx = ...;

Result<Record2<String, String>> result =
ctx.select(T.A, something ? T.B : DSL.inline("").as(T.B))
   .from(T)
   .where(T.C.eq(1))
   .and(T.D.eq(2))
   .fetch();

Using inlined parameters via DSL.inline(), you can easily produce a no-op value in your projection if you don’t want to modify the projection’s row type. The advantage is that you can now use this subquery in a union that expects two columns:

DSLContext ctx = ...;

Result<Record2<String, String>> result =

// First union subquery has a conditionally projected column
ctx.select(T.A, something ? T.B : DSL.inline("").as(T.B))
   .from(T)
   .where(T.C.eq(1))
   .and(T.D.eq(2))

   .union(

// Second union subquery has no such conditions
    select(U.A, U.B)
   .from(U))
   .fetch();

You can take this one step further, and make an entire union subquery conditional this way!

DSLContext ctx = ...;

Result<Record2<String, String>> result =

// First union subquery has a conditionally projected column
ctx.select(T.A, T.B)
   .from(T)
   .union(
      something
        ? select(U.A, U.B).from(U)
        : select(inline(""), inline("")).where(falseCondition())
   )
   .fetch();

This is a bit more syntactic work, but it’s nice to see how easy it is to add something conditionally to a jOOQ query without making the query completely unreadable. Everything is local to where it is being used. No local variables are needed, no imperative control flow is invoked.

And because everything is now an expression (and not a statement / no control flow), we can factor out parts of this query into auxiliary methods, that can be made reusable.

org.jooq.Table

Conditional table expressions usually appear when doing conditional joins. This is usually not done in isolation, but together with other conditional elements in a query. E.g. if some columns are projected conditionally, those columns may require an additional join, as they originate from another table than the tables that are used unconditionally. For example:

DSLContext ctx = ...;

Result<?> result =
ctx.select(
      T.A, 
      T.B, 
      something ? U.X : inline(""))
   .from(
      something
      ? T.join(U).on(T.Y.eq(U.Y))
      : T)
   .where(T.C.eq(1))
   .and(T.D.eq(2))
   .fetch();

There isn’t a more simple way to produce the conditional JOIN expression, because JOIN and ON need to be provided independently. For simple cases as shown above, this is perfectly fine. In more complex cases, some auxiliary methods may be needed, or views.

Conclusion

There are two important messages here in this post:

  1. The XYZStep types are auxiliary types only. They are there to make your dynamically constructed SQL statement look like static SQL. But you should never feel the need to assign them to local variables, or return them from methods. While it is not wrong to do so, there is almost always a better way to write dynamic SQL.
  2. In jOOQ, every query is a dynamic query. This is the benefit of composing SQL queries using an expression tree like the one that is used in jOOQ’s internals. You may not see the expression tree because the jOOQ DSL API mimicks static SQL statement syntax. But behind the scenes, you’re effectively building this expression tree. Every part of the expression tree can be produced dynamically, from local variables, methods, or expressions, such as conditional expressions. I’m looking forward to using the new JEP 361 switch expressions in dynamic SQL. Just like a SQL CASE expression, some SQL statement parts can be constructed dynamically in the client, prior to passing them to the server.

Once these two things are internalised, you can write very fancy dynamic SQL, including using FP approaches to constructing data structures, such as a jOOQ query object.

5 Ways to Better Understand SQL by Adding Optional Parentheses

It appears that our recent beginner SQL articles explaining SQL syntax were quite popular. These include:

Today, I want to tackle the problem of understanding and learning SQL syntax from a new angle.

SQL’s COBOL-esque syntactic heritage

Unlike most C-style languages, SQL syntax is a messy mixture of keywords, special characters, and both with a lot of overloaded meanings. I’d call myself rather experienced with SQL, yet I’m still often unsure about whether something is a keyword or an identifier (and yes, it does matter):

From my past SQL training sessions, however, I have learned that one of the best tools to understand what’s really going on in SQL syntax is to add parentheses at random places. Parentheses can be noisy, but they can definitely help understand the syntactic structure of a statement.

A trivial example from arithmetics, or logic:

a + b * c
a OR b AND c

In almost all languages, operator precedence dictates that the above be equivalent to this:

a + (b * c)
a OR (b AND c)

And already, we’ve introduced some syntactic clarity to readers who might not be aware of operator precedence. While the parentheses are not necessary in these cases, they can help with readability if the expressions are much more complex than the above.

Did you know you can place parentheses at many other locations in SQL? Here’s a list of interesting places, where most people usually omit parentheses. I’m going to use the Sakila database for examples.

1. Row value expressions

In every day language, we don’t call them “row value expressions”. We call them ordinary “column expressions”. For example, we might look for a specific actor by name

SELECT * 
FROM actor
WHERE first_name = 'SUSAN'
AND last_name = 'DAVIS';

We will get:

|actor_id   |first_name|last_name |
|-----------|----------|----------|
|101        |SUSAN     |DAVIS     |
|110        |SUSAN     |DAVIS     |

But what we really did was this:

-- Works in many dialects
SELECT * 
FROM actor
WHERE (first_name) = ('SUSAN')
AND (last_name) = ('DAVIS');

Or also:

-- Works in a few dialects, including e.g. PostgreSQL
SELECT * 
FROM actor
WHERE ROW (first_name) = ROW ('SUSAN')
AND ROW (last_name) = ROW ('DAVIS');

The parentheses are somewhat ambiguous in this case, the ROW constructor makes it more obvious. The SQL standard specifies a “row value expression special case”, i.e. the case where a row value expression (i.e. a tuple expression) is of degree 1. In case of which most people will “colloquially” omit the parentheses, thinking we’re comparing column expressions.

In this example, we could rewrite the query to this:

SELECT * 
FROM actor
WHERE (first_name, last_name) = ('SUSAN', 'DAVIS');

Or, in funky Oracle, double parentheses are required for no good reason on the right hand side.

SELECT * 
FROM actor
WHERE (first_name, last_name) = (('SUSAN', 'DAVIS'));

But I was not going to confuse you, I was going to clarify things ;-)

In fact, once you get a hang of row value expressions, you will quickly want to use them for less trivial cases than avoiding two AND connected predicates. For example, you can use this approach also with the IN predicate (and if you’re daring enough, with the NOT IN predicate, but beware of NULLS):

SELECT * 
FROM actor
WHERE (first_name, last_name) IN (
  ('SUSAN', 'DAVIS'),
  ('NICK' , 'WAHLBERG')
)

Wow! Result:

|actor_id   |first_name|last_name |
|-----------|----------|----------|
|2          |NICK      |WAHLBERG  |
|101        |SUSAN     |DAVIS     |
|110        |SUSAN     |DAVIS     |

And not only that. You can also use the same approach with IN and subqueries:

-- Actors that have the same name as a customer
SELECT *
FROM actor
WHERE (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM customer
)

And indeed:

|actor_id   |first_name|last_name |
|-----------|----------|----------|
|4          |JENNIFER  |DAVIS     |

Yes! The “colloquial” IN predicate that you all wrote without the parentheses around the row value expression, in reality, had a row value expression of degree 1 on the left side, AND on the right side. Increase that degree to 2, and now you need to write a subquery with 2 columns as well.

It’s quite idiomatic, and readable, and nifty. Quite a few dialects support this.

2. JOINs

Joins are very misunderstood, in SQL. Very very misunderstood. One reason is because people confuse them with other set operations, such as UNION, and thus illustrate them with Venn Diagrams. I’ve already covered that. That article explains that JOINs are just (filtered) cartesian products.

The simplest JOIN type is CROSS JOIN. It does not have an ON or USING clause. Much more common, however, is INNER JOIN, or just JOIN (INNER being an optional keyword).

SELECT a.first_name, a.last_name, f.title
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)

We’ve all typed millions of these to obtain:

|first_name|last_name |title                         |
|----------|----------|------------------------------|
|...       |...       |...                           |
|PENELOPE  |GUINESS   |SPLASH GUMP                   |
|PENELOPE  |GUINESS   |VERTIGO NORTHWEST             |
|PENELOPE  |GUINESS   |WESTWARD SEABISCUIT           |
|PENELOPE  |GUINESS   |WIZARD COLDBLOODED            |
|NICK      |WAHLBERG  |ADAPTATION HOLES              |
|NICK      |WAHLBERG  |APACHE DIVINE                 |
|NICK      |WAHLBERG  |BABY HALL                     |
|...       |...       |...                           |

But what are these JOIN things? They are not SELECT clauses! They are operators like + or * or AND or OR.

And like any operator, the messy laws of associativity apply (as all the different join types have the same operator precedence, at least in the SQL standard. Some wonky implementations may disagree).

Luckily, left associativity applies with JOIN (right associativity would be madness). So, our magic parentheses come into play again. Our query from before is really this:

SELECT a.first_name, a.last_name, f.title
FROM (
  actor AS a
    JOIN film_actor AS fa 
      USING (actor_id)
)
JOIN film AS f 
  USING (film_id)

Can you see it? The readability problem here is the USING (or ON) clause. Without it, I could have written:

(actor JOIN film_actor) JOIN film

So, join FILM_ACTOR to ACTOR and then join the FILM table to the product.

Because INNER JOIN is associative (both left and right), we could write the following equivalent expression:

actor JOIN (film_actor JOIN film)

So, join FILM to FILM_ACTOR first, and then join the product to ACTOR. Quite a different beast, logically (the optimiser shouldn’t make a difference in this case).

… or in full SQL

SELECT a.first_name, a.last_name, f.title
FROM actor AS a 
  JOIN (
    film_actor
      JOIN film AS f 
        USING (film_id)
  )
USING (actor_id)

Egh. Don’t be clever. Use this only if you really have to. I find it quite unreadable. Careful formatting is essential here. But it works.

With more than 3 tables, you could build entire JOIN trees. While, in order to stay sane, most people will not actually build JOIN trees (yes, a list is also a tree), it is totally possible to do this confusing thing:

SELECT a.first_name, a.last_name, sum(p.amount)
FROM (
  actor AS a
    JOIN film_actor AS fa 
      USING (actor_id)
)
JOIN (
  film AS f
    JOIN (
      inventory AS i 
        JOIN rental AS r
          USING (inventory_id)
    ) USING (film_id)
) USING (film_id)
JOIN payment AS p 
  USING (rental_id)
GROUP BY a.actor_id

Instead of this nice clean SQL:

SELECT a.first_name, a.last_name, sum(p.amount)
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
JOIN inventory AS i USING (film_id)
JOIN rental AS r USING (inventory_id)
JOIN payment AS p USING (rental_id)
GROUP BY a.actor_id

… to get the revenue of all films we made for any given actor. Something like this:

|first_name|last_name |sum         |
|----------|----------|------------|
|ADAM      |GRANT     |974.19      |
|ADAM      |HOPPER    |1532.21     |
|AL        |GARLAND   |1525.87     |
|ALAN      |DREYFUSS  |1850.29     |
|ALBERT    |JOHANSSON |2202.78     |
|ALBERT    |NOLTE     |2183.75     |

But nevertheless. It is good to know how JOIN works, syntactically, because there will be that edge case where you want to give precedence to an INNER JOIN and then LEFT JOIN the entire product, or something like that. In that case, parentheses are your friend.

3. DISTINCT

There are no parentheses around the “DISTINCT arguments”. Despite a lot of people thinking there are. This is not what some people think it is:

SELECT DISTINCT (actor_id), first_name, last_name
FROM actor

The parentheses can be omitted, and it is the exact same thing.

The other nice thing about SQL syntax is that you never know whether parentheses are part of the language, or part of an expression. In fact, the above (id) is a row value expression again. If you (accidentally) wrapped more than one column in those parentheses:

SELECT DISTINCT (actor_id, first_name), last_name
FROM actor

Then you’d still get the same behaviour (in PostgreSQL, which supports nesting records like that), but the result is not what you wanted:

|row              |last_name|
|-----------------|---------|
|(1,PENELOPE)     |GUINESS  |
|(2,NICK)         |WAHLBERG |
|(3,ED)           |CHASE    |
|(4,JENNIFER)     |DAVIS    |

Funky eh?

4. UNION, INTERSECT, EXCEPT

The nice thing about SQL is that you can never derive a rule from another rule. While all types of JOIN have the same precedence, the set operators do not. Duh. Luckily, most people don’t use INTERSECT or EXCEPT at all, only UNION [ ALL ], so they don’t run into this problem.

In case there is any doubt, INTERSECT has a higher precedence, whereas UNION and EXCEPT have the same precedence (in the SQL standard). I’ll prove it by quoting it. From ISO/IEC 9075-2:2016(E) 7.17 <query expression> (I’ve removed some irrelevant bits)

<query expression body> ::=
  <query term>
| <query expression body> UNION [ ALL ] <query term>
| <query expression body> EXCEPT [ ALL ] <query term>

<query term> ::=
  <query primary>
| <query term> INTERSECT [ ALL ] <query primary>

See. INTERSECT has a higher precedence. Let’s try it (in PostgreSQL):

SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
INTERSECT
SELECT 1 AS a, 2 AS b

We get:

|a          |b          |
|-----------|-----------|
|1          |2          |
|2          |3          |

So, what we’ve done effectively (and the parenthesis are totally allowed, because <query term> allows it), is this

SELECT 2 AS a, 3 AS b
UNION
(
  SELECT 1 AS a, 2 AS b
  INTERSECT
  SELECT 1 AS a, 2 AS b
)

Now, for the life of me, I really don’t know how to nicely format parenthesised set operations in SQL. It looks awkward. But here we are. The above is a totally different thing than this:

(
  SELECT 2 AS a, 3 AS b
  UNION
  SELECT 1 AS a, 2 AS b
)
INTERSECT
SELECT 1 AS a, 2 AS b

If we do the union first, and then the intersection, the only resulting record is:

|a          |b          |
|-----------|-----------|
|1          |2          |

A similar effect can be seen when combining UNION and UNION ALL, which have the same precedence and are … left associative. Lucky again. So, this query here:

SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
UNION ALL
SELECT 1 AS a, 2 AS b

Has a UNION that does not get to remove any duplicates, and then a UNION ALL that adds a duplicate. It is equivalent to this (remember, left associative):

(
  SELECT 2 AS a, 3 AS b
  UNION
  SELECT 1 AS a, 2 AS b
)
UNION ALL
SELECT 1 AS a, 2 AS b

The result is:

|a          |b          |
|-----------|-----------|
|1          |2          |
|2          |3          |
|1          |2          |

It is quite a different beast from this one:

SELECT 2 AS a, 3 AS b
UNION
(
  SELECT 1 AS a, 2 AS b
  UNION ALL
  SELECT 1 AS a, 2 AS b
)

Now, we’ve used parentheses to force the query to do the UNION ALL operator first (producing duplicates), but then, the UNION removes them agian. The result is now:

|a          |b          |
|-----------|-----------|
|1          |2          |
|2          |3          |

Bonus: What about the ORDER BY clause

Ah, yes. The miracles of SQL syntax. In theory (and in the SQL standard), there is an optional ORDER BY clause over all set operations. E.g. like this:

SELECT 2 AS a, 3 AS b
UNION
SELECT 1 AS a, 2 AS b
INTERSECT
SELECT 1 AS a, 2 AS b
ORDER BY a DESC

We’re getting:

|a          |b          |
|-----------|-----------|
|2          |3          |
|1          |2          |

Think of it this way:

(
  SELECT 2 AS a, 3 AS b
  UNION
  SELECT 1 AS a, 2 AS b
  INTERSECT
  SELECT 1 AS a, 2 AS b
)
ORDER BY a DESC

Some dialects may or may not allow you to actually place those parentheses. E.g. PostgreSQL does.

5. Subqueries

When you write subqueries, or specifically:

  • Scalar subqueries (subqueries in SELECT or WHERE, etc.)
  • Derived tables (subqueries in FROM)

… then you must put them in parentheses. For example, a correlated (scalar) subquery:

SELECT 
  first_name, 
  last_name, (
    -- Correlated subquery here
    SELECT count(*) 
    FROM film_actor AS fa 
    WHERE fa.actor_id = a.actor_id
  ) c
FROM actor AS a

These parentheses are not optional. Neither are those around the derived table:

SELECT 
  first_name, 
  last_name, 
  c
FROM actor AS a
JOIN (
  -- Derived table here
  SELECT actor_id, count(*) AS c
  FROM film_actor
  GROUP BY actor_id
) fa USING (actor_id)

Conclusion

SQL (and its syntax) is a device whose mystery is only exceeded by its power. With parentheses (as with other syntactic tokens), you can never really know if they are part of the language, of some clause, or part of an expression.

With some operators, parentheses help better understand the syntactic structure of the language.

  • Row value expressions of degree 1 have optional parentheses. We never write them, but we could have!
  • JOINs are really trees, not lists. We can nest them arbitrarily to get associativity under control.
  • DISTINCT doesn’t have parentheses. Don’t be fooled by your coworkers’ coding style!
  • Set operations have two levels of precedence: 1) INTERSECT and 2) UNION and EXCEPT (assuming standards compliance).
  • Subqueries (outside of set operations) have mandatory parentheses

I’m curious about your own greatest SQL syntax confusions. Please leave them in the comments.

SQL DISTINCT is not a function

A very common misconception I often encounter with SQL users is the idea that DISTINCT is something like a function, and that it can take parenthesised arguments. Just recently, I’ve seen this Stack Overflow question where the OP was looking for a way to express this in jOOQ:

SELECT DISTINCT (emp.id), emp.fname, emp.name
FROM employee emp;

Notice the parentheses around (emp.id), which look as though this is some special kind of DISTINCT usage, which is akin to a DISTINCT function. The idea is often that:

  • The behaviour is somewhat different from omitting the parentheses
  • The performance is faster, because only the ID needs to be considered for distinctness

This is incorrect

These claims are incorrect, of course. There is no semantic or performance difference between the two. The parentheses are merely parentheses around a column expression, in a similar way as you would use parentheses to influence operator precedence. Think of it this way:

SELECT DISTINCT (emp.id + 1) * 2, emp.fname, emp.name
FROM employee emp;

In the above example, we do not apply a “DISTINCT function” to the expression emp.id + 1. We merely placed parentheses around a column expression emp.id + 1 to make sure the addition happens before the multiplication. The DISTINCT operator happens after the projection, always. If SQL had used a more logical syntax, rather than following English grammar (it was originally called Structured English QUEry Language, or SEQUEL), then we would write the OP’s statement like this:

FROM employee
SELECT id, fname, name
DISTINCT

Again, the DISTINCT operation always happens after the projection (SELECT clause content), and is applied to the entirety of the projection. There is no way in standard SQL to apply distinctness only to parts of the projection (there is in PostgreSQL, see further down).

To clarify this a bit more, I recommend reading our previous blog post about the logical order of operations in SQL, and how DISTINCT and ORDER BY are related.

What would it mean anyway?

We can revert the question back to the OP and ask ourselves, what would it mean for a DISTINCT operation to apply to only one column, anyway? Let’s assume this data set:

|id |fname|name|
|---|-----|----|
|1  |A    |A   |
|1  |B    |B   |

If we applied DISTINCT only to the ID column (and didn’t project anything else), clearly, we’d only get one row as a result:

SELECT DISTINCT id FROM employee
|id |
|---|
|1  |

But if we wanted to also project FNAME and NAME, which row would “win”? Would we display the first or the second row, or any random row? The behaviour would be undefined, and SQL doesn’t like undefined behaviour, so this is not possible. The only reasonable application of DISTINCT is always on the entire projection.

Exception: PostgreSQL

Fortunately (or to add more to syntactic confusion: unfortunately), PostgreSQL has implemented an extension to the SQL standard. With DISTINCT ON, it is effectively possible to apply distinctness only to parts of the projection:

WITH emp (id, fname, name) AS (
  VALUES (1, 'A', 'A'),
         (1, 'B', 'B')
)
SELECT DISTINCT ON (id) id, fname, name
FROM emp
ORDER BY id, fname, name

The output is now what the OP desired (but couldn’t use, because they were using MySQL):

|id         |fname|name |
|-----------|-----|-----|
|1          |A    |A    |

I personally don’t like DISTINCT ON. While it is very useful, no doubt, it makes something that is already very difficult to explain to SQL beginners even more complicated. With a “more reasonable” syntax, the query would be written like this:

FROM emp
SELECT id, fname, name
ORDER BY id, fname, name
DISTINCT ON (id) 

With this syntactic order of operation, there would be no doubt about the semantics of DISTINCT or DISTINCT ON.

Stop Mapping Stuff in Your Middleware. Use SQL’s XML or JSON Operators Instead

It’s been a while since I’ve ranted on this blog, but I was recently challenged by a reddit thread to write about this topic, so here goes…

So, you’re writing a service that produces some JSON from your database model. What do you need? Let’s see:

  • Read a book on DDD
  • Read another book on DDD
  • Write some entities, DTOs, factories, and factory builders
  • Discuss whether your entities, DTOs, factories, and factory builders should be immutable, and use Lombok, Autovalue, or Immutables to ease the pain of construction of said objects
  • Discuss whether you want to use standard JPA, or Hibernate specific features for your mapping
  • Plug in Jackson, the XML and JSON mapper library, because you’ve read a nice blog post about it
  • Debug 1-2 problems arising from combining Jackson, JAXB, Lombok, and JPA annotations. Minor thing
  • Debug 1-2 N+1 cases

STOP IT

No, seriously. Just stop it right there!

What you needed was this kind of JSON structure, exported form your favourite Sakila database:

[{
  "first_name": "PENELOPE",
  "last_name": "GUINESS",
  "categories": [{
    "name": "Animation",
    "films": [{
      "title": "ANACONDA CONFESSIONS"
    }]
   }, {
    "name": "Family",
    "films": [{
      "title": "KING EVOLUTION"
    }, {
      "title": "SPLASH GUMP"
    }]
  }]
}, {
   ...

In English: We need a list of actors, and the film categories they played in, and grouped in each category, the individual films they played in.

Let me show you how easy this is with SQL Server SQL (all other database dialects can do it these days, I just happen to have a SQL Server example ready:

-- 1) Produce actors
SELECT
  a.first_name,
  a.last_name, (

    -- 2) Nest categories in each actor
    SELECT
      c.name, (

        -- 3) Nest films in each category
        SELECT title
        FROM film AS f
        JOIN film_category AS fc ON f.film_id = fc.film_id
        JOIN film_actor AS fa ON fc.film_id = fa.film_id
        WHERE fc.category_id = c.category_id
        AND a.actor_id = fa.actor_id
        FOR JSON PATH -- 4) Turn into JSON
      ) AS films
    FROM category AS c
    JOIN film_category AS fc ON c.category_id = fc.category_id
    JOIN film_actor AS fa ON fc.film_id = fa.film_id
    WHERE fa.actor_id = a.actor_id
    GROUP BY c.category_id, c.name
    FOR JSON PATH -- 4) Turn into JSON
  ) AS categories
FROM
  actor AS a 
FOR JSON PATH, ROOT ('actors') -- 4) Turn into JSON

That’s it. That’s all there is to it. Only basic SQL-92, enhanced with some vendor-specific JSON export syntax. (There are also SQL standard JSON APIs as implemented in other RDBMS). Let’s discuss it quickly:

  1. The outer most query produces a set of actors. As you would have expected
  2. For each actor, a correlated subquery produces a nested JSON array of categories
  3. For each category, another correlated subquery finds all the films per actor and category
  4. Finally, turn all the result structures into JSON

That’s it.

Want to change the result structure? Super easy. Just modify the query accordingly. No need to modify:

  • Whatever you thought your DDD “root aggregate was”
  • Your gazillion entities, DTOs, factories, and factory builders
  • Your gazillion Lombok, Autovalue, or Immutables annotations
  • Your hacks and workarounds to get this stuff through your standard JPA, or Hibernate specific features for your mapping
  • Your gazilion Jackson, the XML and JSON mapper library annotations
  • Debugging another 1-2 problems arising from combining Jackson, JAXB, Lombok, and JPA annotations
  • Debugging another 1-2 N+1 cases

No! No need! It’s so simple. Just stream the JSON directly from the database to the client using whatever SQL API of your preference: JDBC, jOOQ, JdbcTemplate, MyBatis, or even JPA native query. Just don’t go mapping that stuff in the middleware if you’re not consuming it in the middleware. Let me repeat that for emphasis:

Don’t go mapping that stuff in the middleware if you’re not consuming it in the middleware.

Oh, want to switch to XML? Easy. In SQL Server, this amounts to almost nothing but replacing JSON by XML:

SELECT
  a.first_name,
  a.last_name, (
    SELECT
      c.name, (
	    SELECT title
	    FROM film AS f
	    JOIN film_category AS fc ON f.film_id = fc.film_id
	    JOIN film_actor AS fa ON fc.film_id = fa.film_id
	    WHERE fc.category_id = c.category_id
	    AND a.actor_id = fa.actor_id
	    FOR XML PATH ('film'), TYPE
      ) AS films
    FROM category AS c
    JOIN film_category AS fc ON c.category_id = fc.category_id
    JOIN film_actor AS fa ON fc.film_id = fa.film_id
    WHERE fa.actor_id = a.actor_id
    GROUP BY c.category_id, c.name
    FOR XML PATH ('category'), TYPE
  ) AS categories
FROM
  actor AS a 
FOR XML PATH ('actor'), ROOT ('actors')

And now, you’re getting:

<actors>
  <actor>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <categories>
      <category>
        <name>Animation</name>
        <films>
          <film>
            <title>ANACONDA CONFESSIONS</title>
          </film>
        </films>
      </category>
      <category>
        <name>Family</name>
        <films>
          <film>
            <title>KING EVOLUTION</title>
          </film>
          <film>
            <title>SPLASH GUMP</title>
          </film>
        </films>
      </category>
      ...

It’s so easy with SQL!

Want to support both without rewriting too much logic? Produce XML and use XSLT to automatically generate the JSON. Whatever.

FAQ, Q&A

But my favourite Java SQL API can’t handle it

So what. Write a view and query that instead. By the way, jOOQ 3.14 supports all of it.

But this doesn’t fit our architecture

Then fix the architecture

But SQL is bad

No, it’s great. It’s based on relational algebra and augments it in many many useful ways. It’s a declarative 4GL, the optimiser produces way better execution plans than you could ever imagine (see my talk), and it’s way more fun than your gazillion 3GL mapping libraries.

But SQL is evil because of Oracle

Then use PostgreSQL. It can do JSON.

But what about testing

Just spin up a test database with https://www.testcontainers.org, install your schema with some migration framework like Flyway or Liquibase in it, fill in some sample data, and write your simple integration tests.

But mocking is better

It is not. The more you mock away the database, the more you’re writing your own database.

But I’m paid by the lines of code

Well, good riddance, then.

But what if we have to change the RDBMS

So what? Your management paid tens of millions for the new licensing. They can pay you tens of hundreds to spend 20 minutes rewriting your 5-10 SQL queries. You already wrote the integration tests above.

Anyway. It won’t happen. And if it will, then those few JSON queries will not be your biggest problem.

What was that talk of yours again?

Here, highly recommended:

But we’ve already spent so many person years implementing our middleware

It has a name

But I’ve read this other blog post…

And now you’ve read mine.

But that’s like 90s style 2 tier architecture

So what? You’ve spent 5% the time to implement it. That’s 95% more time adding value to your customers, rather than bikeshedding mapping technology. I call that a feature.

What about ingestion? We need abstraction over ingestion

No, you don’t. You can send the JSON directly into your database, and transform / normalise it from there, using the same technique. You don’t need middleware abstraction and mapping, you just want middleware abstraction and mapping.