Having “constant” columns in foreign keys

I was asked a very interesting question on Twitter just now:

Can we have “constant” foreign key columns in (PostgreSQL) tables? Luckily, yes, we can. Using a nice standard feature that is “computed columns” or “generated columns”

Sometimes, you cannot completely normalise your schema for whatever reason. There may be a case where you have a table with a composite primary key like this:

CREATE TABLE t1 (
  a int,
  b int,
  t1 int,
  PRIMARY KEY (a, b)
)

And in a referencing table t2, you will always reference one of the primary key columns by value, say, 1. Of course, you could create a table t2 with a CHECK constraint ensuring b = 1:

CREATE TABLE t2 (
  a int,
  b int NOT NULL DEFAULT 1 CHECK (b = 1),
  t2 int,
  FOREIGN KEY (a, b) REFERENCES t1
)

But why not use a generated column instead?

CREATE TABLE t2 (
  a int,
  b int GENERATED ALWAYS AS (1) STORED,
  t2 int,
  FOREIGN KEY (a, b) REFERENCES t1
)

In my opinion, this is even more powerful. As of PostgreSQL 12, only STORED is supported (meaning the value is stored on disk), when in this case VIRTUAL would be even better (meaning the value is produced only when reading the row).

Inserting some test data:

INSERT INTO t1 (a, b, t1) 
VALUES(1, 1, 1), (1, 2, 2), (2, 1, 3);

INSERT INTO t2 (a, t2) 
VALUES (1, 11), (2, 12);

SELECT * 
FROM t1
NATURAL LEFT JOIN t2

Produces the expected result. We can only insert (b = 1) into t2:

a|b|t1|t2|
-|-|--|--|
1|1| 1|11|
2|1| 3|12|
1|2| 2|  |

A nice trick to keep up one’s sleeve.

Computed or generated columns are available in a variety of RDBMS, including at least:

  • Db2
  • MySQL
  • Oracle
  • PostgreSQL
  • SQL Server

Using SQL Server FOR XML and FOR JSON Syntax on Other RDBMS With jOOQ

SQL Server supports transforming flat tabular SQL result sets into hierarchical structures by convention using the convenient FOR XML or FOR JSON syntaxes. This is really convenient and less verbose than the standard SQL/XML or SQL/JSON APIs – although the standard ones are more powerful.

In this blog post, I’d like to show a few core features of the SQL Server syntax, and what they correspond to in standard SQL. jOOQ 3.14 will support both SQL Server’s syntax and the standard syntax, and will be able to translate from one to the other, such that you can use SQL Server syntax also on Db2, MariaDB, MySQL, Oracle, PostgreSQL. You can play around with the current state of development on our website here.

As always, using the Sakila database, here’s a simple example as a teaser:

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
FOR XML RAW;

-- Db2, Oracle, PostgreSQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlattributes(
    t.first_name AS first_name,
    t.last_name AS last_name,
    t.title AS title
  )
))
FROM (
  -- Original query here
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
) AS t

Producing in both cases something like:

<row first_name="PENELOPE" last_name="GUINESS" title="OKLAHOMA JUMANJI"/>
<row first_name="PENELOPE" last_name="GUINESS" title="RULES HUMAN"/>
<row first_name="PENELOPE" last_name="GUINESS" title="SPLASH GUMP"/>
<row first_name="PENELOPE" last_name="GUINESS" title="VERTIGO NORTHWEST"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WESTWARD SEABISCUIT"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WIZARD COLDBLOODED"/>
<row first_name="NICK" last_name="WAHLBERG" title="ADAPTATION HOLES"/>
<row first_name="NICK" last_name="WAHLBERG" title="APACHE DIVINE"/>

FOR XML and FOR JSON concepts

As could be seen in the above teaser, the SQL Server syntax is far less verbose and concise, and it seems to produce a reasonable default behaviour, where the Db2, Oracle, PostgreSQL (and SQL Standard) SQL/XML APIs are more verbose, but also more powerful. For example, it is possible to map column a to attribute x and column b to a nested XML element y very easily.

The advantages of both approaches are clear. SQL Server’s approach is much more usable in the general case. But what is the general case? Let’s summarise a few key parallels between SQL result sets, and XML/JSON data structures:


Tables are XML elements or JSON arrays

Tables (i.e. sets of data) are not a foreign concept to both XML and JSON documents. The most natural way to represent a set of data in XML is a set of elements using the same element name, optionally wrapped by a wrapper element. For example:

<!-- With wrapper element -->
<films>
  <film title="OKLAHOMA JUMANJI"/>
  <film title="RULES HUMAN"/>
  <film title="SPLASH GUMP"/>
</films>

<!-- Without wrapper element -->
<film title="OKLAHOMA JUMANJI"/>
<film title="RULES HUMAN"/>
<film title="SPLASH GUMP"/>

The distinction of whether a wrapper element is added is mostly significant when nesting data.

With JSON, the obvious choice of data structure to represent a table is an array. For example:

[
  {"title": "OKLAHOMA JUMANJI"},
  {"title": "RULES HUMAN"},
  {"title": "SPLASH GUMP"}
]


Rows are XML elements or JSON objects

As we’ve already seen above, a SQL row is represented in XML using an element.

<film title="OKLAHOMA JUMANJI"/>

The question is only what the element name should be. It can usually be any of:

  • A standard name, such as “row”
  • The name of the table the row stems from
  • A custom name

In JSON, it is an object.

{"title": "OKLAHOMA JUMANJI"}

Unlike in XML, there is no such thing as an element name, so the row is “anonymous”. The row type is defined by what table / array the JSON object is contained in.


Column values are XML elements or attributes, or JSON attributes

We have a bit more choices of how to represent SQL column values in XML. Mainly two choices:

  • Represent values as attributes
  • Represent values as elements

Scalar values can easily be represented as attributes. If a value needs further nesting (e.g. an array, user defined type, etc.), then elements are a better choice. In most cases, the choice is not relevant, so we can pick both:

<!-- Using attributes -->
<film film_id="635" title="OKLAHOMA JUMANJI"/>

<!-- Using elements from table and column names -->
<film>
  <film_id>635</film_id>
  <title>OKLAHOMA JUMANJI</title>
</film>

<!-- Using standard element names
<row>
  <value name="film_id" value="635"/>
  <value name="title" value="OKLAHOMA JUMANJI"/>
</row>

There are a few other reasonable default options for the representation of a column value in XML.

In JSON, on the other hand, there are two main reasonable approaches. In most cases, an object will be chosen, where column values are identified by column name. But just like SQL records are a mixture between “structs” and “tuples”, we could imagine a representation that maps column values to array indexes as well:

// Using objects
{"film_id": 635, "title": "OKLAHOMA JUMANJI"}

// Using arrays
[635, "OKLAHOMA JUMANJI"]


GROUP BY and ORDER BY can be seen as a way to nest data

So far, all data was represented in a flat way, just like the SQL table. There was some nesting when wrapping XML elements of JSON arrays in some wrapper element or object, or when representing XML data with more elements rather than attributes, but the data was still always tabular.

Very often, we want to consume data in a hierarchical form, though. An actor played in films, so we’d like to group the films by actor, rather than repeating the actor information for every film. In general, operations like GROUP BY or ORDER BY serve this purpose. GROUP BY allows for aggregating all data into nested data structures per group (e.g. into strings, arrays, XML elements, JSON arrays, JSON objects). ORDER BY does the same, “visually” – perhaps a bit less formally. When we look at this set of XML elements, we can see visually that they’re “grouped” (i.e. ordered) by actor:

<row first_name="PENELOPE" last_name="GUINESS" title="OKLAHOMA JUMANJI"/>
<row first_name="PENELOPE" last_name="GUINESS" title="RULES HUMAN"/>
<row first_name="PENELOPE" last_name="GUINESS" title="SPLASH GUMP"/>
<row first_name="PENELOPE" last_name="GUINESS" title="VERTIGO NORTHWEST"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WESTWARD SEABISCUIT"/>
<row first_name="PENELOPE" last_name="GUINESS" title="WIZARD COLDBLOODED"/>
<row first_name="NICK" last_name="WAHLBERG" title="ADAPTATION HOLES"/>
<row first_name="NICK" last_name="WAHLBERG" title="APACHE DIVINE"/>

SQL Server supports such grouping in at least two ways:

  • Implicitly by convention, using ORDER BY
  • Explicity by creating correlated subqueries

The implicit approach could transform the above flat representation into something like this:

<a first_name="PENELOPE" last_name="GUINESS">
    <f title="OKLAHOMA JUMANJI"/>
    <f title="RULES HUMAN"/>
    <f title="SPLASH GUMP"/>
    <f title="VERTIGO NORTHWEST"/>
    <f title="WESTWARD SEABISCUIT"/>
    <f title="WIZARD COLDBLOODED"/>
</a>
<a first_name="NICK" last_name="WAHLBERG">
    <f title="ADAPTATION HOLES"/>
    <f title="APACHE DIVINE"/>
</a>

… where “a” and “f” are the table names in the query (actor a and film f).

How do FOR XML and FOR JSON work in detail?

There are several features that can be combined in SQL Server. The complete picture can be seen from the docs. We’ll omit a few features in this blog post here.

  • The transformation algorithm RAW (flat results, only in XML), AUTO (hierarchical, automatic results), PATH (hierarchical, explicit results)
  • The “root” name, which corresponds to an XML wrapper element, or a JSON wrapper object
  • XML only: Whether values should be placed in ELEMENTS or attributes
  • JSON only: INCLUDE_NULL_VALUES specifies whether NULL values are explicit, or implicit (absent from the JSON object).
  • JSON only: WITHOUT_ARRAY_WRAPPER specifies whether the set of JSON objects should be listed as a JSON array, or a comma separated list of objects (which could be combined with other queries)

This is not complete, there are more flags and features, but instead of discussing them in theory, let’s look at a few examples:

FOR XML RAW

Producing flat results with attributes for values

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML RAW;

-- Standard SQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlattributes(
    t.first_name AS first_name,
    t.last_name AS last_name,
    t.title AS title
  )
))
FROM (
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) AS t

This produces

<row first_name="NICK" last_name="WAHLBERG" title="SMILE EARRING"/>
<row first_name="NICK" last_name="WAHLBERG" title="WARDROBE PHANTOM"/>
<row first_name="PENELOPE" last_name="GUINESS" title="ACADEMY DINOSAUR"/>
<row first_name="PENELOPE" last_name="GUINESS" title="ANACONDA CONFESSIONS"/>

FOR XML RAW, ROOT

Producing flat results with attributes for values, and a root element to wrap the listed elements

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML RAW, ROOT('rows');

-- Standard SQL
SELECT xmlelement(
  NAME rows,
  xmlagg(xmlelement(
    NAME row,
    xmlattributes(
      t.first_name AS first_name,
      t.last_name AS last_name,
      t.title AS title
    )
  ))
)
FROM (
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) AS t

This produces

<rows>
  <row first_name="NICK" last_name="WAHLBERG" title="SMILE EARRING"/>
  <row first_name="NICK" last_name="WAHLBERG" title="WARDROBE PHANTOM"/>
  <row first_name="PENELOPE" last_name="GUINESS" title="ACADEMY DINOSAUR"/>
  <row first_name="PENELOPE" last_name="GUINESS" title="ANACONDA CONFESSIONS"/>
</rows>

FOR XML RAW, ELEMENTS

Producing flat results with elements for values.

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML RAW, ELEMENTS;

-- Standard SQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlelement(
    NAME first_name,
    first_name
  ),
  xmlelement(
    NAME last_name,
    last_name
  ),
  xmlelement(
    NAME title,
    title
  )
))
FROM (
  SELECT a.first_name, a.last_name, f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
  FOR XML RAW, ELEMENTS
) AS t

This produces

<row>
    <first_name>NICK</first_name>
    <last_name>WAHLBERG</last_name>
    <title>SMILE EARRING</title>
</row>
<row>
    <first_name>NICK</first_name>
    <last_name>WAHLBERG</last_name>
    <title>WARDROBE PHANTOM</title>
</row>
<row>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <title>ACADEMY DINOSAUR</title>
</row>
<row>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <title>ANACONDA CONFESSIONS</title>
</row>

This could also be combined with ROOT, which we’re omitting for brevity.

FOR XML/JSON AUTO

This approach derives results completely automatically from your query structure. Mainly:

  • The SELECT clause defines in what order XML or JSON data is nested.
  • The FROM clause defines the table names (via aliasing), which are translated to XML element or JSON object attribute names.
  • The ORDER BY clause produces the “grouping”, which is translated to nesting XML elements or JSON objects.
-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML AUTO;

-- Standard SQL
SELECT xmlagg(e)
FROM (
  SELECT xmlelement(
    NAME a,
    xmlattributes(
      t.first_name AS first_name,
      t.last_name AS last_name
    ),
    xmlagg(xmlelement(
      NAME f,
      xmlattributes(t.title AS title)
    ))
  ) AS e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) AS t
  GROUP BY
    first_name,
    last_name
) AS t

Notice how this emulation requires two steps of XMLAGG with GROUP BY. It gets more hairy with more tables being joined and projected! I won’t add more complex examples here, but try it online!

This produces

<a first_name="NICK" last_name="WAHLBERG">
    <f title="SMILE EARRING"/>
    <f title="WARDROBE PHANTOM"/>
</a>
<a first_name="PENELOPE" last_name="GUINESS">
    <f title="ACADEMY DINOSAUR"/>
    <f title="ANACONDA CONFESSIONS"/>
</a>

Let’s try the same thing again with JSON:

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR JSON AUTO;

-- Standard SQL
SELECT json_arrayagg(e)
FROM (
  SELECT JSON_OBJECT(
    KEY 'FIRST_NAME' VALUE first_name,
    KEY 'LAST_NAME' VALUE last_name,
    KEY 'F' VALUE JSON_ARRAYAGG(JSON_OBJECT(
      KEY 'TITLE' VALUE title
      ABSENT ON NULL
    ))
    ABSENT ON NULL
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

The result being:

[
    {
        "first_name": "NICK",
        "last_name": "WAHLBERG",
        "f": [
            {
                "title": "SMILE EARRING"
            },
            {
                "title": "WARDROBE PHANTOM"
            }
        ]
    },
    {
        "first_name": "PENELOPE",
        "last_name": "GUINESS",
        "f": [
            {
                "title": "ACADEMY DINOSAUR"
            },
            {
                "title": "ANACONDA CONFESSIONS"
            }
        ]
    }
]

FOR XML/JSON AUTO, ROOT

Like before, we could wrap this in a root XML element or a root JSON object if need be.

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML AUTO, ROOT;

-- Standard SQL
SELECT xmlelement(
  NAME join,
  xmlagg(e)
)
FROM (
  SELECT xmlelement(
    NAME a,
    xmlattributes(
      t.first_name AS first_name,
      t.last_name AS last_name
    ),
    xmlagg(xmlelement(
      NAME f,
      xmlattributes(t.title AS title)
    ))
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

This does the same thing as before, but just wraps the previous root XMLAGG() element in another XMLELEMENT() function call.

This produces

<root>
    <a first_name="NICK" last_name="WAHLBERG">
        <f title="SMILE EARRING"/>
        <f title="WARDROBE PHANTOM"/>
    </a>
    <a first_name="PENELOPE" last_name="GUINESS">
        <f title="ACADEMY DINOSAUR"/>
        <f title="ANACONDA CONFESSIONS"/>
    </a>
</root>

Let’s try the same thing again with JSON:

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR JSON AUTO, ROOT;

-- Standard SQL
SELECT JSON_OBJECT(KEY 'a' VALUE json_arrayagg(e))
FROM (
  SELECT JSON_OBJECT(
    KEY 'FIRST_NAME' VALUE first_name,
    KEY 'LAST_NAME' VALUE last_name,
    KEY 'F' VALUE JSON_ARRAY_AGG(JSON_OBJECT(
      KEY 'TITLE' VALUE title
      ABSENT ON NULL
    ))
    ABSENT ON NULL
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

The result being:

{
    "a": [
        {
            "first_name": "NICK",
            "last_name": "WAHLBERG",
            "f": [
                {
                    "title": "SMILE EARRING"
                },
                {
                    "title": "WARDROBE PHANTOM"
                }
            ]
        },
        {
            "first_name": "PENELOPE",
            "last_name": "GUINESS",
            "f": [
                {
                    "title": "ACADEMY DINOSAUR"
                },
                {
                    "title": "ANACONDA CONFESSIONS"
                }
            ]
        }
    ]
}

FOR XML AUTO, ELEMENTS

Like before, instead of producing attributes, we might decide to produce elements instead (in XML only):

-- SQL Server
SELECT a.first_name, a.last_name, f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML AUTO, ELEMENTS;

-- Standard SQL
SELECT xmlagg(e)
FROM (
  SELECT xmlelement(
    NAME a,
    xmlelement(
      NAME first_name,
      first_name
    ),
    xmlelement(
      NAME last_name,
      last_name
    ),
    xmlagg(xmlelement(
      NAME f,
      xmlelement(
        NAME title,
        title
      )
    ))
  ) e
  FROM (
    SELECT a.first_name, a.last_name, f.title
    FROM actor a
    JOIN film_actor fa ON a.actor_id = fa.actor_id
    JOIN film f ON fa.film_id = f.film_id
    ORDER BY 1, 2, 3
  ) t
  GROUP BY
    first_name,
    last_name
) t

Not much has changed, except for the fact that a set of XMLELEMENT() calls are made, rather than XMLATTRIBUTES() callse.

This produces

<a>
    <first_name>NICK</first_name>
    <last_name>WAHLBERG</last_name>
    <f>
        <title>SMILE EARRING</title>
    </f>
    <f>
        <title>WARDROBE PHANTOM</title>
    </f>
</a>
<a>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <f>
        <title>ACADEMY DINOSAUR</title>
    </f>
    <f>
        <title>ANACONDA CONFESSIONS</title>
    </f>
</a>

FOR XML/JSON PATH

The PATH strategy is my personal favourite. It is used to create nested XML or JSON path structures more explicitly, and also allows for additional nesting levels when grouping projections together. This is best shown by example. Notice, how I’m now using aliases for my columns, and the alias looks like an XPath expression using '/' (slashes):

-- SQL Server
SELECT 
  a.first_name AS [author/first_name], 
  a.last_name AS [author/last_name], 
  f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR XML PATH;

-- Standard SQL
SELECT xmlagg(xmlelement(
  NAME row,
  xmlelement(
    NAME author,
    xmlelement(
      NAME first_name,
      "author/first_name"
    ),
    xmlelement(
      NAME last_name,
      "author/last_name"
    )
  ),
  xmlelement(
    NAME title,
    title
  )
))
FROM (
  SELECT 
    a.first_name AS "author/first_name", 
    a.last_name AS "author/last_name", 
    f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) t

Check out how by convention, we’re now getting an additional level of nesting for author related columns under the row/author element:

<row>
    <author>
        <first_name>NICK</first_name>
        <last_name>WAHLBERG</last_name>
    </author>
    <title>SMILE EARRING</title>
</row>
<row>
    <author>
        <first_name>NICK</first_name>
        <last_name>WAHLBERG</last_name>
    </author>
    <title>WARDROBE PHANTOM</title>
</row>
<row>
    <author>
        <first_name>PENELOPE</first_name>
        <last_name>GUINESS</last_name>
    </author>
    <title>ACADEMY DINOSAUR</title>
</row>
<row>
    <author>
        <first_name>PENELOPE</first_name>
        <last_name>GUINESS</last_name>
    </author>
    <title>ANACONDA CONFESSIONS</title>
</row>

This is really neat! The SQL Server syntax is definitely much more convenient for this common use-case.

Let’s try the same thing again with JSON. The only thing we change is we now use a JSON-path-ish syntax using dots ('.') rather than slashes ('/'):

-- SQL Server
SELECT 
  a.first_name AS [author.first_name], 
  a.last_name AS [author.last_name], 
  f.title
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
ORDER BY 1, 2, 3
FOR JSON PATH;

-- Standard SQL
SELECT JSON_ARRAYAGG(JSON_OBJECT(
  KEY 'author' VALUE JSON_OBJECT(
    KEY 'first_name' VALUE author.first_name,
    KEY 'last_name' VALUE author.last_name
  ),
  KEY 'TITLE' VALUE title
  ABSENT ON NULL
))
FROM (
  SELECT 
    a.first_name AS "author.first_name", 
    a.last_name AS "author.last_name", 
    f.title
  FROM actor a
  JOIN film_actor fa ON a.actor_id = fa.actor_id
  JOIN film f ON fa.film_id = f.film_id
  ORDER BY 1, 2, 3
) t

The result being (again, with nested objects):

[
    {
        "author": {
            "first_name": "NICK",
            "last_name": "WAHLBERG"
        },
        "title": "SMILE EARRING"
    },
    {
        "author": {
            "first_name": "NICK",
            "last_name": "WAHLBERG"
        },
        "title": "WARDROBE PHANTOM"
    },
    {
        "author": {
            "first_name": "PENELOPE",
            "last_name": "GUINESS"
        },
        "title": "ACADEMY DINOSAUR"
    },
    {
        "author": {
            "first_name": "PENELOPE",
            "last_name": "GUINESS"
        },
        "title": "ANACONDA CONFESSIONS"
    }
]

For more sophisticated nesting, including nesting of collections, a correlated subquery is needed in SQL Server, also with a FOR XML or FOR JSON syntax.

Conclusion

XML and JSON are popular document formats outside and inside of the database. SQL Server has some of the most conventient syntax for most cases, while standard SQL supports much more basic, and thus more powerful constructs. In standard SQL, almost any kind of XML or JSON projection is possible, and with XMLTABLE() and JSON_TABLE(), the documents can be transformed back to SQL tables, as well. In many applications, using these XML or JSON features natively would lead to much less boilerplate code, as many applications do not need middleware between the database and some client, just to transform data between formats.

Most ORMs don’t expose this functionality for a variety of reasons, the main one being that the devil is in the details. While both XML and JSON are nicely standardised, the implementations differ greatly:

  • The SQL/XML standard is implemented mostly by DB2, Oracle, and PostgreSQL. Many dialects offer some XML capabilities, but not as impressive as the standard and the previous three. SQL Server has FOR XML which is very powerful for standard XML serialisations, but may be a bit difficult to use for edge cases
  • The SQL/JSON standard was added late and is implemented again to large extents by DB2 and Oracle, but inceasingly also by MariaDB and MySQL. PostgreSQL (and by consequence, compatible dialects, like CockroachDB) had their own proprietary functions and APIs, which are not compatible with the standard. And again, SQL Server has FOR JSON which works well for standard serialisations, but a bit less well for edge cases

These technologies are poorly adopted in clients because of the many subtle differences. jOOQ has been leveling out these minor differences for many years without hiding the core functionality. SQL/XML and SQL/JSON are perfect use-cases for jOOQ 3.14 (due in Q2 2020), which now allows for using both the standard SQL/XML and SQL/JSON syntaxes as well as the SQL Server FOR XML and FOR JSON syntax in the jOOQ Professional and Enterprise Editions.

Before jOOQ 3.14 is out, you can already play with the current functionality on our website: https://www.jooq.org/translate

What’s Faster? COUNT(*) or COUNT(1)?

One of the biggest and undead myths in SQL is that COUNT(*) is faster than COUNT(1). Or was it that COUNT(1) is faster than COUNT(*)? Impossible to remember, because there’s really no reason at all why one should be faster than the other. But is the myth justified?

Let’s measure!

How does COUNT(…) work?

But first, let’s look into some theory. The two ways to count things are not exactly the same thing. Why?

  • COUNT(*) counts all the tuples in a group
  • COUNT(<expr>) counts all the tuples in a group for which <expr> evaluates to something that IS NOT NULL

This distinction can be quite useful. Most of the time, we’ll simply COUNT(*) for convenience, but there are (at least) two cases where we don’t want that, for example:

When outer joining

Imagine that in the Sakila database, we have some actors that did not play in any films. Making sure such an actor actually exists:

INSERT INTO actor (actor_id, first_name, last_name)
VALUES (201, 'SUSAN', 'DAVIS');

When inner joining, we might write the following (using PostgreSQL syntax):

SELECT actor_id, a.first_name, a.last_name, count(*) AS c
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
GROUP BY actor_id
ORDER BY c ASC, actor_id ASC;

And we won’t get the newly added SUSAN DAVIS, because of the nature of inner join:

actor_id|first_name |last_name   |c |
--------|-----------|------------|--|
     148|EMILY      |DEE         |14|
      35|JUDY       |DEAN        |15|
     199|JULIA      |FAWCETT     |15|
     186|JULIA      |ZELLWEGER   |16|
      31|SISSY      |SOBIESKI    |18|
      71|ADAM       |GRANT       |18|
       1|PENELOPE   |GUINESS     |19|
      30|SANDRA     |PECK        |19|

So we might change our query to use LEFT JOIN instead

SELECT actor_id, a.first_name, a.last_name, count(*) AS c
FROM actor AS a
LEFT JOIN film_actor AS fa USING (actor_id)
LEFT JOIN film AS f USING (film_id)
GROUP BY actor_id
ORDER BY c ASC, actor_id ASC;

There she is now, but oops, wrong count! She doesn’t have any films, which we have proven before with the INNER JOIN query. Yet we get 1:

actor_id|first_name |last_name   |c |
--------|-----------|------------|--|
     201|SUSAN      |DAVIS       | 1|
     148|EMILY      |DEE         |14|
      35|JUDY       |DEAN        |15|
     199|JULIA      |FAWCETT     |15|
     186|JULIA      |ZELLWEGER   |16|
      31|SISSY      |SOBIESKI    |18|
      71|ADAM       |GRANT       |18|
       1|PENELOPE   |GUINESS     |19|
      30|SANDRA     |PECK        |19|

Her COUNT(*) value is 1, because we do get 1 film tuple for her in the group, with all columns being NULL. The solution is to count the FILM_ID instead, which cannot be NULL in the table (being a primary key), but only because of the LEFT JOIN:

SELECT actor_id, a.first_name, a.last_name, count(film_id) AS c
FROM actor AS a
LEFT JOIN film_actor AS fa USING (actor_id)
LEFT JOIN film AS f USING (film_id)
GROUP BY actor_id
ORDER BY c ASC, actor_id ASC;

Notice, we could count other things than the primary key, but with the primary key, we’re quite certain we don’t get any other “accidental” nulls in our groups, which we did not want to exclude from the count value.

Now, we’re getting the correct result:

actor_id|first_name |last_name   |c |
--------|-----------|------------|--|
     201|SUSAN      |DAVIS       | 0|
     148|EMILY      |DEE         |14|
      35|JUDY       |DEAN        |15|
     199|JULIA      |FAWCETT     |15|
     186|JULIA      |ZELLWEGER   |16|
      31|SISSY      |SOBIESKI    |18|
      71|ADAM       |GRANT       |18|
       1|PENELOPE   |GUINESS     |19|
      30|SANDRA     |PECK        |19|

When counting subsets of a group

An even more powerful application of counting only non-null evaluations of an expression is counting only subsets of a group. We’ve already blogged about this technique in our previous post about aggregating several expressions in one single query.

For example, counting in a single query:

  • All actors
  • Actors with their first_name starting with A
  • Actors with their first_name ending with A
  • Actors with their first_name containing A

In SQL:

SELECT 
  count(*),
  count(CASE WHEN first_name LIKE 'A%' THEN 1 END),
  count(CASE WHEN first_name LIKE '%A' THEN 1 END),
  count(CASE WHEN first_name LIKE '%A%' THEN 1 END)
FROM actor;

This yields:

count|count|count|count|
-----|-----|-----|-----|
  201|   13|   30|  105|

This is very useful when pivoting data sets (see also Oracle/SQL Server PIVOT clause).

Notice that PostgreSQL supports the SQL standard FILTER clause for this, which is more convenient and more readable. The above query can be written like this, in PostgreSQL:

SELECT 
  count(*),
  count(*) FILTER (WHERE first_name LIKE 'A%'),
  count(*) FILTER (WHERE first_name LIKE '%A'),
  count(*) FILTER (WHERE first_name LIKE '%A%')
FROM actor;

Back to COUNT(*) vs COUNT(1)

Now that we know the theory behind these COUNT expressions, what’s the difference between COUNT(*) and COUNT(1). There is none, effectively. The 1 expression in COUNT(1) evaluates a constant expression for each row in the group, and it can be proven that this constant expression will never evaluate to NULL, so effectively, we’re running COUNT(*), counting ALL the rows in the group again.

There should be no difference, and parsers / optimisers should be able to recognise this and not do the extra work of checking every expression evaluation for NULL-ness.

I recently saw this discussion on Twitter, though, where Vik Fearing looked up the PostgreSQL sources, showing that PostgreSQL does do the extra work instead of optimising this:

So, I was curious to see if it mattered. I ran a benchmark on the 4 most popular RDBMS, with these results:

  • MySQL: Doesn’t matter. Sometimes COUNT(1) was faster, sometimes COUNT(*) was faster, so all differences were only benchmark artifacts
  • Oracle: Doesn’t matter. Like MySQL
  • PostgreSQL: Does matter (!). COUNT(*) was consistently faster by around 10% on 1M rows, that’s much more than I had expected
  • SQL Server: Doesn’t matter. Like MySQL

The benchmark code can be found in the following gists:

The results are below. Each benchmark run repeated SELECT COUNT(*) FROM t or SELECT COUNT(1) FROM t 100 times on a 1M row table, and then the benchmark was repeated 5 times to mitigate any warmup penalties and be fair with respect to caching.

The times displayed are relative to the fastest run per database product. This removes any distraction that may be caused by interpreting actual execution times as we do not want to compare database products against each other.

The database versions I’ve used are:

  • MySQL 8.0.16 (in Docker)
  • Oracle 18c XE (in Docker)
  • PostgreSQL 11.3 (in Docker)
  • SQL Server 2017 Express (in Windows)

MySQL

No relevant difference, nor a clear winner:

RUN     STMT    RELATIVE_TIME
-----------------------------
0	1	1.0079
0	2	1.0212
1	1	1.0229
1	2	1.0256
2	1	1.0009
2	2	1.0031
3	1	1.0291
3	2	1.0256
4	1	1.0618
4	2	1.0000

Oracle

No relevant difference, nor a clear winner

Run 1, Statement 1 : 1.06874
Run 1, Statement 2 : 1.01982
Run 2, Statement 1 : 1.09175
Run 2, Statement 2 : 1.0301
Run 3, Statement 1 : 1.00308
Run 3, Statement 2 : 1.02499
Run 4, Statement 1 : 1.02503
Run 4, Statement 2 : 1
Run 5, Statement 1 : 1.01259
Run 5, Statement 2 : 1.05828

PostgreSQL

A significant, consistent difference of almost 10%:

RUN 1, Statement 1: 1.00134
RUN 1, Statement 2: 1.09538
RUN 2, Statement 1: 1.00190
RUN 2, Statement 2: 1.09115
RUN 3, Statement 1: 1.00000
RUN 3, Statement 2: 1.09858
RUN 4, Statement 1: 1.00266
RUN 4, Statement 2: 1.09260
RUN 5, Statement 1: 1.00454
RUN 5, Statement 2: 1.09694

Again, I’m surprised by the order of magnitude of this difference. I would have expected it to be less. Curious to hear about your own results in the comments, or further ideas why this is so significant in PostgreSQL.

SQL Server

No relevant difference, nor a clear winner

Run 1, Statement 1: 1.00442
Run 1, Statement 2: 1.00702
Run 2, Statement 1: 1.00468
Run 2, Statement 2: 1.00000
Run 3, Statement 1: 1.00208
Run 3, Statement 2: 1.00624
Run 4, Statement 1: 1.00780
Run 4, Statement 2: 1.00364
Run 5, Statement 1: 1.00468
Run 5, Statement 2: 1.00702

Conclusion

As it is now in 2019, given the database versions mentioned above, unfortunately, there is a significant difference between COUNT(*) and COUNT(1) in PostgreSQL. Luckily (and this is rare in SQL), all the other dialects don’t care and thus, consistently using COUNT(*), rather than COUNT(1) is a slightly better choice for ALL measured database products from this article.

Do note that the benchmark only tried a very simple query! The results might be different when using joins, unions, or any other SQL constructs, or in other edge cases, e.g. when using COUNT() in HAVING or ORDER BY or with window functions, etc.

In any case, there shouldn’t be any difference, and I’m sure that a future PostgreSQL version will optimise the constant expression in the COUNT(<expr>) aggregate function directly in the parser to avoid the extra work.

For other interesting optimisations that do not depend on the cost model, see this article here.

Using DISTINCT ON in Non-PostgreSQL Databases

A nice little gem in PostgreSQL’s SQL syntax is the DISTINCT ON clause, which is as powerful as it is esoteric.

In a previous post, we’ve blogged about some caveats to think of when DISTINCT and ORDER BY are used together. The bigger picture can be seen in our article about the logical order of operations in SQL SELECT.

The PostgreSQL documentation explains it well:

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY. […] For example:

SELECT DISTINCT ON (location) location, time, report
    FROM weather_reports
    ORDER BY location, time DESC;

retrieves the most recent weather report for each location. […]

Again, this is quite esoteric as the distinct-ness is now decided only based on the columns listed separately in parentheses. For all the other columns, only the first row according to ORDER BY is projected. The SQL language is probably one of the only ones where the syntactic order of operations has absolutely nothing to do with the logical order of operations, and this DISTINCT ON syntax isn’t helping. In a more straightforward language design, the above statement could read, instead:

FROM weather_reports
WINDOW w AS (PARTITION BY location ORDER BY time DESC)
SELECT 
  location, 
  FIRST_VALUE (time) OVER w AS time,
  FIRST_VALUE (report) OVER w AS report
DISTINCT
ORDER BY location

In other words, we would execute these operations in the following logical order:

  1. FROM: Access the weather_reports table
  2. WINDOW: Specify “windows” or “groups” in our data set, grouping by location and ordering contents of each group by time, descendingly (using the term “WINDOW” is no accident as we’ll see afterwards)
  3. SELECT: Project the location (which is the grouping) and per group, the first values of time / report ordered by time
  4. DISTINCT: Remove all the duplicates, because the above operation will produce the same time and report value for each record that shares the same location
  5. ORDER BY: Finally, order the results per grouping

That’s what really happens, and incidentally, the above synthetic SQL syntax matches the actual logical order of operations in the SQL language, so translating it back to an actual SQL statement would yield:

SELECT DISTINCT
  location, 
  FIRST_VALUE (time) OVER w AS time,
  FIRST_VALUE (report) OVER w AS report
FROM weather_reports
WINDOW w AS (PARTITION BY location ORDER BY time DESC)
ORDER BY location

If your database doesn’t support the WINDOW clause, just expand it into the individual window functions. E.g. in Oracle, write:

SELECT DISTINCT
  location, 
  FIRST_VALUE (time) OVER (PARTITION BY location ORDER BY time DESC),
  FIRST_VALUE (report) OVER (PARTITION BY location ORDER BY time DESC)
FROM weather_reports
ORDER BY location

From a readability perspective, I would definitely prefer the standard SQL syntax over DISTINCT ON.

Want to play around with it? Here’s some sample data:

create table weather_reports (location text, time date, report text);
insert into weather_reports values ('X', DATE '2000-01-01', 'X1');
insert into weather_reports values ('X', DATE '2000-01-02', 'X2');
insert into weather_reports values ('X', DATE '2000-01-03', 'X3');
insert into weather_reports values ('Y', DATE '2000-01-03', 'Y1');
insert into weather_reports values ('Y', DATE '2000-01-05', 'Y2');
insert into weather_reports values ('Z', DATE '2000-01-04', 'Z1');

The result being:

|location|time      |report|
|--------|----------|------|
|X       |2000-01-03|X3    |
|Y       |2000-01-05|Y2    |
|Z       |2000-01-04|Z1    |

Notice that jOOQ already supports PostgreSQL DISTINCT ON and in the future, we might emulate it for other dialects using the above technique: https://github.com/jOOQ/jOOQ/issues/3564

Calculate Percentiles to Learn About Data Set Skew in SQL

B-Tree indexes are perfect when your data is uniformly distributed. They are not really useful, when you have skewed data. I’ll explain later why this is the case, but let’s first learn how to detect “skew”

What is skew?

Skew is a term from statistics when a normal distribution is not symmetric. The example given on Wikipedia shows a distribution like this:

In RDBMS, we sometimes use the term skew colloquially to mean the same thing as non-uniform distribution, i.e. a normal distribution would also be skewed. We simply mean that some values appear more often than others. Thus, I will put the term “skew” in double quotes in this article. While your RDBMS’s statistics contain this information once they are calculated, we can also detect such “skew” manually in ad-hoc queries using percentiles, which are defined in the SQL standard and supported in a variety of databases, as ordinary aggregate functions, including:

  • Oracle
  • PostgreSQL
  • SQL Server (regrettably, only as window functions)

Uniform distribution

Let’s look at the FILM_ID values in the Sakila database:

SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY film_id) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY film_id) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY film_id) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY film_id) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY film_id) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY film_id) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY film_id) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY film_id) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY film_id) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY film_id) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY film_id) AS "100%"
FROM film;

What are we calculating here? We’re trying to find 11 different values for which we can say that:

  • 0% of the film_ids are lower than the “0%” value
  • 10% of the film_ids are lower than the “10%” value

Or in other words:

  • 0% is the MIN(film_id) value
  • 50% is the MEDIAN(film_id) value
  • 100% is the MAX(film_id) value

The result shows an unsurprisingly uniform distribution:

0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
---|----|----|----|----|----|----|----|----|----|-----|
1  |100 |200 |300 |400 |500 |600 |700 |800 |900 |1000 |

We can plot this in Microsoft Excel or some other tool to get this nice curve:

This is not surprising, as the IDs are just consecutive values, which is a desired property of surrogate keys.

“Skewed” distribution

It’s a different story when we look at the distribution of amounts in the payment table:

SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY amount) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY amount) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY amount) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY amount) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY amount) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY amount) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY amount) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY amount) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY amount) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY amount) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY amount) AS "100%"
FROM payment;

We’re now getting:

0%   |10%  |20%  |30%  |40%  |50%  |60%  |70%  |80%  |90%  |100% 
-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----
0.00 |0.99 |1.99 |2.99 |2.99 |3.99 |4.99 |4.99 |5.99 |6.99 |11.99

This looks … “skewed”, although clearly the bias is mainly caused by the fact that this data is generated. When we plot the above, we’re getting:

The slope is less steep at the beginning of this curve, which essentially means that more values exist at the lower end of the range than at the upper end. We can validate this with another query:

SELECT amount, count(*)
FROM (
  SELECT trunc(amount) AS amount
  FROM payment
) t 
GROUP BY amount
ORDER BY amount;

… which yields:

amount |count |
-------|------|
0      |3003  |
1      |641   |
2      |3542  |
3      |1117  |
4      |3789  |
5      |1306  |
6      |1119  |
7      |675   |
8      |486   |
9      |257   |
10     |104   |
11     |10    |

Plotted:

When plotting this, we can see that there are more amounts in the lower half of the range than in the upper half, which leads to percentiles growing slower.

Correlations

This technique can also be applied to detect correlations in data. We can, for instance, try to find the percentiles of the length of films, and group data sets by rating. I’m using a GROUPING SETS function here, the ROLLUP() function, to calculate the grand total as well. Just check out the query and its results, and you’ll see:

SELECT
  rating,
  count(*),
  percentile_disc(0.0) WITHIN GROUP (ORDER BY length) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY length) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY length) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY length) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY length) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY length) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY length) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY length) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY length) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY length) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY length) AS "100%"
FROM film
GROUP BY ROLLUP(rating);

This yields:

rating |count |0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
-------|------|---|----|----|----|----|----|----|----|----|----|-----|
G      |178   |47 |57  |67  |80  |93  |107 |121 |138 |156 |176 |185  |
PG     |194   |46 |58  |72  |85  |99  |113 |122 |137 |151 |168 |185  |
PG-13  |223   |46 |61  |76  |92  |110 |125 |138 |150 |162 |176 |185  |
R      |195   |49 |68  |82  |90  |104 |115 |129 |145 |160 |173 |185  |
NC-17  |210   |46 |58  |74  |84  |97  |112 |125 |138 |153 |174 |184  |
       |1000  |46 |60  |74  |86  |102 |114 |128 |142 |156 |173 |185  |

So, the GROUP BY clause produced one row per rating, and an additional grand total column at the bottom. For illustration purposes, I’ve added the COUNT(*) column, to show how many films are in each group. The 5 first rows sum up to 1000, which is again the grand total at the bottom.

Let’s plot the percentiles now as line and bar charts:

We can “see” that there is no strong correlation between the two data points. Both data sets are close to uniformly distributed, quite independently of the rating, with the exception of PG-13, which is just slightly skewed towards longer film lengths.

Again, this isn’t terribly interesting as the data set was generated, probably using some randomness to avoid perfectly uniform distribution. In real world scenarios, the above data would have been more “skewed”.

How does this help with performance?

A balanced tree index is very useful when data is quite uniformly distributed, because in that case, it can help access data points or ranges of data in O(log(N)) time. This is quite a useful property for queries that look for film_id values, e.g.

SELECT *
FROM film
WHERE film_id = 1

When accessing “skewed” data, some values are more equal than others. This means that for example if we’re looking for amounts in the payment table, these two queries are not the same:

-- A lot of rows returned (3644)
SELECT * FROM payment WHERE amount BETWEEN 0 AND 2;

-- Few rows returned (361)
SELECT * FROM payment WHERE amount BETWEEN 9 AND 11;

An index on the amount column could have been useful for the second query, but maybe not for the first one.

There are several things we can do to make sure optimal index usage is being applied for all sorts of queries. In case of uniformly distributed data, we usually don’t have to do anything as SQL developers. In case of “skewed” data sets, it may be worth thinking about:

  • Using histogram statistics
  • Hinting the optimiser (in Oracle or SQL Server)
  • Avoiding bind variables (only in extreme cases)

Conclusion

Not all data sets are equal. They are often “skewed”. By “skewed”, in SQL, we don’t mean the statistical meaning of a normal distribution being skewed asymmetrically. We mean that a distribution is not uniform, so even a normal distribution is “skewed”. When it is, then some values appear way more often than others. Some examples are:

Uniform distribution

  • Surrogate keys generated from sequences (consecutive)
  • Surrogate keys generated from UUIDs (random)
  • Foreign keys on one-to-one relationships

Slight “skew”

Possibly significant “skew”

This really depends on the actual data set, but do expect significant “skew” in these data types

  • Foreign keys on to-many relationships (e.g. some customers have more assets than others)
  • Numeric values (e.g. amount)
  • Codes and other discrete values (e.g. film rating, payment settlement codes, etc.)

This article has shown how we can use simple SQL aggregate functions, including the percentiles, to calculate and visualise such “skew”.

How to Reduce Syntactic Overhead Using the SQL WINDOW Clause

SQL is a verbose language, and one of the most verbose features are window functions.

In a stack overflow question that I’ve encountered recently, someone asked to calculate the difference between the first and the last value in a time series for any given day:

Input

volume  tstamp
---------------------------
29011   2012-12-28 09:00:00
28701   2012-12-28 10:00:00
28830   2012-12-28 11:00:00
28353   2012-12-28 12:00:00
28642   2012-12-28 13:00:00
28583   2012-12-28 14:00:00
28800   2012-12-29 09:00:00
28751   2012-12-29 10:00:00
28670   2012-12-29 11:00:00
28621   2012-12-29 12:00:00
28599   2012-12-29 13:00:00
28278   2012-12-29 14:00:00

Desired output

first  last   difference  date
------------------------------------
29011  28583  428         2012-12-28
28800  28278  522         2012-12-29

How to write the query?

Notice that the value and timestamp progression do not correlate as it may appear. So, there is not a rule that if Timestamp2 > Timestamp1 then Value2 < Value1. Otherwise, this simple query would work (using PostgreSQL syntax):

SELECT 
  max(volume)               AS first,
  min(volume)               AS last,
  max(volume) - min(volume) AS difference,
  CAST(tstamp AS DATE)      AS date
FROM t
GROUP BY CAST(tstamp AS DATE);

There are several ways to find the first and last values within a group that do not involve window functions. For example:

  • In Oracle, you can use the FIRST and LAST functions, which for some arcane reason are not written FIRST(...) WITHIN GROUP (ORDER BY ...) or LAST(...) WITHIN GROUP (ORDER BY ...), like other sorted set aggregate functions, but some_aggregate_function(...) KEEP (DENSE_RANK FIRST ORDER BY ...). Go figure
  • In PostgreSQL, you could use the DISTINCT ON syntax along with ORDER BY and LIMIT

More details about the various approaches can be found here:
https://blog.jooq.org/2017/09/22/how-to-write-efficient-top-n-queries-in-sql

The best performing approach would be to use an aggregate function like Oracle’s, but few databases have this function. So, we’ll resort to using the FIRST_VALUE and LAST_VALUE window functions:

SELECT DISTINCT
  first_value(volume) OVER (
    PARTITION BY CAST(tstamp AS DATE) 
    ORDER BY tstamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS first,
  last_value(volume) OVER (
    PARTITION BY CAST(tstamp AS DATE) 
    ORDER BY tstamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS last,
  first_value(volume) OVER (
    PARTITION BY CAST(tstamp AS DATE) 
    ORDER BY tstamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) 
  - last_value(volume) OVER (
    PARTITION BY CAST(tstamp AS DATE) 
    ORDER BY tstamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS diff,
  CAST(tstamp AS DATE) AS date
FROM t
ORDER BY CAST(tstamp AS DATE)

Oops 🤔

That doesn’t look too readable. But it will yield the correct result. Granted, we could wrap the definition for the columns FIRST and LAST in a derived table, but that would still leave us with two repetitions of the window definition:

PARTITION BY CAST(tstamp AS DATE) 
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

WINDOW clause to the rescue

Luckily, at least 3 databases have implemented the SQL standard WINDOW clause:

  • MySQL
  • PostgreSQL
  • Sybase SQL Anywhere

The above query can be refactored to this one:

SELECT DISTINCT
  first_value(volume) OVER w AS first,
  last_value(volume) OVER w AS last,
  first_value(volume) OVER w 
    - last_value(volume) OVER w AS diff,
  CAST(tstamp AS DATE) AS date
FROM t
WINDOW w AS (
  PARTITION BY CAST(tstamp AS DATE) 
  ORDER BY tstamp
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
ORDER BY CAST(tstamp AS DATE)

Notice how I can specify a window name with a window specification in a similar way as I can define a common table expression (WITH clause):

WINDOW 
    <window-name> AS (<window-specification>)
{  ,<window-name> AS (<window-specification>)... }

Not only can I reuse entire specifications, I could also build a specification from a partial specification, and reuse only parts. My previous query could have been rewritten as such:

SELECT DISTINCT
  first_value(volume) OVER w3 AS first,
  last_value(volume) OVER w3 AS last,
  first_value(volume) OVER w3 
    - last_value(volume) OVER w3 AS diff,
  CAST(tstamp AS DATE) AS date
FROM t
WINDOW 
  w1 AS (PARTITION BY CAST(tstamp AS DATE)),
  w2 AS (w1 ORDER BY tstamp),
  w3 AS (w2 ROWS BETWEEN UNBOUNDED PRECEDING 
                     AND UNBOUNDED FOLLOWING)
ORDER BY CAST(tstamp AS DATE)

Each window specification can be created from scratch, or be based on a previously defined window specification. Note this is also true when referencing the window definition. If I wanted to reuse the PARTITION BY clause and the ORDER BY clause, but change the FRAME clause (ROWS ...), then I could have written this:

SELECT DISTINCT
  first_value(volume) OVER (
    w2 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS first,
  last_value(volume) OVER (
    w2 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
  ) AS last,
  first_value(volume) OVER (
    w2 ROWS UNBOUNDED PRECEDING
  ) - last_value(volume) OVER (
    w2 ROWS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING
  ) AS diff,
  CAST(tstamp AS DATE) AS date
FROM t
WINDOW 
  w1 AS (PARTITION BY CAST(tstamp AS DATE)),
  w2 AS (w1 ORDER BY tstamp)
ORDER BY CAST(tstamp AS DATE)

What if my database doesn’t support the WINDOW clause?

In that case, you have to either manually write the window specification on each window function, or you use a SQL builder like jOOQ, which can emulate the window clause:

You can try this translation online on our website: https://www.jooq.org/translate

PostgreSQL 11’s Support for SQL Standard GROUPS and EXCLUDE Window Function Clauses

Exciting discovery when playing around with PostgreSQL 11! New SQL standard window function clauses have been supported. If you want to play with this, you can do so very easily using docker:

docker pull postgres:11
docker run --name POSTGRES11 -e POSTGRES_PASSWORD=postgres -d postgres:11
docker run -it --rm --link POSTGRES11:postgres postgres psql -h postgres -U postgres

See also: https://hub.docker.com/r/_/postgres

The frame clause

When working with window functions, in some cases you want to add the optional frame clause. For example, to get a sliding average over your data, you will write:

SELECT 
  payment_date,
  amount,
  avg(amount) OVER (
    ORDER BY payment_date, payment_id
    ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )::DECIMAL(10, 2),
  array_agg(amount) OVER (
    ORDER BY payment_date, payment_id
    ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )
FROM payment;

As always I will be running queries against the Sakila database. The above query yields:

payment_date        |amount |avg  |array_agg                   |
--------------------|-------|-----|----------------------------|
2005-05-24 22:53:30 |2.99   |3.32 |          {2.99,2.99,3.99}  |
2005-05-24 22:54:33 |2.99   |3.74 |     {2.99,2.99,3.99,4.99}  |
2005-05-24 23:03:39 |3.99   |4.39 |{2.99,2.99,3.99,4.99,6.99}  |
2005-05-24 23:04:41 |4.99   |3.99 |{2.99,3.99,4.99,6.99,0.99}  |
2005-05-24 23:05:21 |6.99   |3.79 |{3.99,4.99,6.99,0.99,1.99}  |
2005-05-24 23:08:07 |0.99   |3.99 |{4.99,6.99,0.99,1.99,4.99}  |
2005-05-24 23:11:53 |1.99   |3.99 |{6.99,0.99,1.99,4.99,4.99}  |
2005-05-24 23:31:46 |4.99   |3.79 |{0.99,1.99,4.99,4.99,5.99}  |

The array_agg function helps display how the sliding average came to be. For each average value, we’re looking 2 rows ahead and 2 rows behind in the ordered window.

In the above query, I’m using the optional frame clause to specify the frame size. It has three “modes” or “units”:

<window frame units> ::=
  ROWS
| RANGE
| GROUPS

Almost all databases that support window functions support the first two unit types. To my knowledge, only PostgreSQL 11 and H2 1.4.198 now also supports GROUPS. The difference is rather simple to explain:

  • ROWS counts the exact number of rows in the frame.
  • RANGE performs logical windowing where we don’t count the number of rows, but look for a value offset.
  • GROUPS counts all groups of tied rows within the window.

I think this is best explained by example. Let’s look at payments with payment timestamps truncated to the hour:

WITH hourly_payment AS (
  SELECT 
    payment_id,
    date_trunc('h', payment_date) AS hour,
    amount
  FROM payment
)
SELECT *
FROM hourly_payment
ORDER BY hour;

This gives us:

payment_id |hour                |amount |
-----------|--------------------|-------|
12377      |2005-05-24 22:00:00 |2.99   | \  Tied group
3504       |2005-05-24 22:00:00 |2.99   | /

6440       |2005-05-24 23:00:00 |4.99   | \
11032      |2005-05-24 23:00:00 |3.99   |  |
8987       |2005-05-24 23:00:00 |4.99   |  | Tied group
6003       |2005-05-24 23:00:00 |6.99   |  |
14728      |2005-05-24 23:00:00 |0.99   |  |
7274       |2005-05-24 23:00:00 |1.99   | /

12025      |2005-05-25 00:00:00 |0.99   | \
3831       |2005-05-25 00:00:00 |8.99   |  |
7044       |2005-05-25 00:00:00 |4.99   |  |
8623       |2005-05-25 00:00:00 |9.99   |  | Tied group
3386       |2005-05-25 00:00:00 |4.99   |  |
8554       |2005-05-25 00:00:00 |4.99   |  |
10785      |2005-05-25 00:00:00 |5.99   |  |
9014       |2005-05-25 00:00:00 |6.99   | /

15394      |2005-05-25 01:00:00 |2.99   | \
10499      |2005-05-25 01:00:00 |4.99   |  |
5020       |2005-05-25 01:00:00 |2.99   |  | Tied group
490        |2005-05-25 01:00:00 |0.99   |  |
12305      |2005-05-25 01:00:00 |4.99   | /

11796      |2005-05-25 02:00:00 |4.99   | \
9463       |2005-05-25 02:00:00 |4.99   |  | Tied group
13711      |2005-05-25 02:00:00 |4.99   | /

Now we can see that for each hour, we have several payments. When we order payments by hour, there are some “tied” payments within that hour (or “group”), i.e. the order among payments on 2005-05-24 22:00:00 are not ordered deterministically among themselves. The payment ids are pretty random.

Now, if we look at the three window frame units again, how do they behave?

ROWS

WITH hourly_payment AS (
  SELECT 
    payment_id,
    date_trunc('h', payment_date) AS hour
  FROM payment
)
SELECT 
  payment_id,
  hour,
  array_agg(payment_id) OVER (
    ORDER BY hour
    ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )
FROM hourly_payment
ORDER BY hour;

We can see that the size of the window is always precisely 5 rows (except at the beginning and end of the data set):

payment_id |hour                |array_agg                      |
-----------|--------------------|-------------------------------|
12377      |2005-05-24 22:00:00 |{12377,3504,6440}              |
3504       |2005-05-24 22:00:00 |{12377,3504,6440,11032}        |
6440       |2005-05-24 23:00:00 |{12377,3504,6440,11032,8987}   |
11032      |2005-05-24 23:00:00 |{3504,6440,11032,8987,6003}    |
8987       |2005-05-24 23:00:00 |{6440,11032,8987,6003,14728}   |
6003       |2005-05-24 23:00:00 |{11032,8987,6003,14728,7274}   |
14728      |2005-05-24 23:00:00 |{8987,6003,14728,7274,12025}   |
7274       |2005-05-24 23:00:00 |{6003,14728,7274,12025,3831}   |
12025      |2005-05-25 00:00:00 |{14728,7274,12025,3831,7044}   |
3831       |2005-05-25 00:00:00 |{7274,12025,3831,7044,8623}    |
7044       |2005-05-25 00:00:00 |{12025,3831,7044,8623,3386}    |
8623       |2005-05-25 00:00:00 |{3831,7044,8623,3386,8554}     |
3386       |2005-05-25 00:00:00 |{7044,8623,3386,8554,10785}    |
8554       |2005-05-25 00:00:00 |{8623,3386,8554,10785,9014}    |
10785      |2005-05-25 00:00:00 |{3386,8554,10785,9014,15394}   |
9014       |2005-05-25 00:00:00 |{8554,10785,9014,15394,10499}  |
15394      |2005-05-25 01:00:00 |{10785,9014,15394,10499,5020}  |
10499      |2005-05-25 01:00:00 |{9014,15394,10499,5020,490}    |
5020       |2005-05-25 01:00:00 |{15394,10499,5020,490,12305}   |
490        |2005-05-25 01:00:00 |{10499,5020,490,12305,11796}   |
12305      |2005-05-25 01:00:00 |{5020,490,12305,11796,9463}    |
11796      |2005-05-25 02:00:00 |{490,12305,11796,9463,13711}   |
9463       |2005-05-25 02:00:00 |{12305,11796,9463,13711,8167}  |
13711      |2005-05-25 02:00:00 |{11796,9463,13711,8167,1011}   |

There is no notion of a “group” among the rows in the window. But the problem is that we’re getting random PAYMENT_ID values unless we also add the PAYMENT_ID to the ORDER BY clause. This isn’t really what we want, most of the time, so we use:

RANGE

WITH hourly_payment AS (
  SELECT 
    payment_id,
    date_trunc('h', payment_date) AS hour
  FROM payment
)
SELECT 
  payment_id,
  hour,
  EXTRACT(epoch FROM hour) / 3600,
  array_agg(payment_id) OVER (
    ORDER BY EXTRACT(epoch FROM hour) / 3600
    RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )
FROM hourly_payment
ORDER BY hour;

I have switched from ROWS to RANGE and now the ORDER BY clause works on a number based on the epoch of the hour. What happens now?

This now yields:

payment_id |hour                |?column? |array_agg                                                                                                                                                              
-----------|--------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
12377      |2005-05-24 22:00:00 |310270   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014}
3504       |2005-05-24 22:00:00 |310270   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014}

6440       |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
11032      |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
8987       |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
6003       |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
14728      |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}
7274       |2005-05-24 23:00:00 |310271   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305}

12025      |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
3831       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
7044       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
8623       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
3386       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
8554       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
10785      |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}
9014       |2005-05-25 00:00:00 |310272   |{12377,3504,  6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711}

15394      |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}
10499      |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}
5020       |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}
490        |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}
12305      |2005-05-25 01:00:00 |310273   |{6440,11032,8987,6003,14728,7274,  12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245}

11796      |2005-05-25 02:00:00 |310274   |{12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245,14396,13055,15984,9975,8188,5596,2388,7347,11598,6186}
9463       |2005-05-25 02:00:00 |310274   |{12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245,14396,13055,15984,9975,8188,5596,2388,7347,11598,6186}
13711      |2005-05-25 02:00:00 |310274   |{12025,3831,7044,8623,3386,8554,10785,9014,  15394,10499,5020,490,12305,  11796,9463,13711,  8167,1011,1203,10019,6245,14396,13055,15984,9975,8188,5596,2388,7347,11598,6186}

I’ve visually separated the rows by their hour and the array aggregation by the “tied” payment_ids, i.e. the payment IDs that have the same hour.

Observations:

  1. We get the same aggregation value for the entire set of tied rows, so if in two rows, HOUR is the same, then ARRAY_AGG is the same as well
  2. The window size is now a logical size, no longer an offset size, so we’re going back 2 hours and ahead 2 hours (instead of 2 rows). This is why I’ve extracted epoch and divided it by hour, so I will get consecutive integer values for consecutive hours

The same result could have been achieved using interval types:

WITH hourly_payment AS (
  SELECT 
    payment_id,
    date_trunc('h', payment_date) AS hour
  FROM payment
)
SELECT 
  payment_id,
  hour,
  EXTRACT(epoch FROM hour) / 3600,
  array_agg(payment_id) OVER (
    ORDER BY hour
    RANGE BETWEEN INTERVAL '2 hours' PRECEDING 
              AND INTERVAL '2 hours' FOLLOWING
  )
FROM hourly_payment
ORDER BY hour;

See also this article for details:
https://blog.jooq.org/2016/10/31/a-little-known-sql-feature-use-logical-windowing-to-aggregate-sliding-ranges/

GROUPS

The third frame unit is quite useful, as we can now frame the window to a number of groups of same values. In our case, all payments of the same hour are in the same group. So, in order to get a similar result again, we can write:

WITH hourly_payment AS (
  SELECT 
    payment_id,
    payment_date,
    date_trunc('h', payment_date) AS hour
  FROM payment
)
SELECT 
  payment_id,
  hour,
  array_agg(payment_id) OVER (
    ORDER BY hour
    GROUPS BETWEEN 2 PRECEDING AND 2 FOLLOWING
  )
FROM hourly_payment
ORDER BY hour;

In fact, this is not exactly the same result, because if we have gaps in the hours, GROUPS will simply jump over the gaps, whereas RANGE will not.

Summary of ROWS, RANGE, GROUPS

The above case was a real world use-case. A more constructed example that might be easier to digest, visually, can be seen here:

WITH t(id, v) AS (
  VALUES (1, 1), (2, 1), (3, 3), (4, 5), (5, 5), (6, 5), (7, 6)
)
SELECT
  id,
  v,
  array_agg(id) OVER rows,
  array_agg(v)  OVER rows,
  array_agg(id) OVER range,
  array_agg(v)  OVER range,
  array_agg(id) OVER groups,
  array_agg(v)  OVER groups
FROM t
WINDOW 
  o AS (ORDER BY v),
  rows AS (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING),
  range AS (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING),
  groups AS (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

Notice, I’m using the SQL standard WINDOW clause to be able to name and reuse a repeated window specification. I’ve seen this clause to be supported in:

  • MySQL 8.0
  • PostgreSQL
  • Sybase SQL Anywhere

The query yields:

id |v |array_agg |array_agg |array_agg |array_agg |array_agg     |array_agg     |
---|--|----------|----------|----------|----------|--------------|--------------|
1  |1 |{1,2}     |{1,1}     |{1,2}     |{1,1}     |{1,2,3}       |{1,1,3}       |
2  |1 |{1,2,3}   |{1,1,3}   |{1,2}     |{1,1}     |{1,2,3}       |{1,1,3}       |
3  |3 |{2,3,4}   |{1,3,5}   |{3}       |{3}       |{1,2,3,4,5,6} |{1,1,3,5,5,5} |
4  |5 |{3,4,5}   |{3,5,5}   |{4,5,6,7} |{5,5,5,6} |{3,4,5,6,7}   |{3,5,5,5,6}   |
5  |5 |{4,5,6}   |{5,5,5}   |{4,5,6,7} |{5,5,5,6} |{3,4,5,6,7}   |{3,5,5,5,6}   |
6  |5 |{5,6,7}   |{5,5,6}   |{4,5,6,7} |{5,5,5,6} |{3,4,5,6,7}   |{3,5,5,5,6}   |
7  |6 |{6,7}     |{5,6}     |{4,5,6,7} |{5,5,5,6} |{4,5,6,7}     |{5,5,5,6}     |

Observation:

  • The ROWS framed window is of size 3 max in this case (1 row preceding, the current row, and 1 row following)
  • The RANGE framed window is a logical window that looks behind a value of 1 and ahead a value of 1
  • The GROUPS framed window is of size 3 groups max in this case (1 group preceding, the current group, and 1 group following)

Neat, huh?

jOOQ 3.12 will add support for this feature: https://github.com/jOOQ/jOOQ/issues/7646

EXCLUDE clause

This is probably a bit less frequently useful than the new GROUPS clause. There is now a new window frame exclusion clause:

<window frame exclusion> ::=
  EXCLUDE CURRENT ROW
| EXCLUDE GROUP
| EXCLUDE TIES
| EXCLUDE NO OTHERS

It can be used to exclude some rows around the current row from being in the window. I have yet to think of a use case for this. Here’s how it works for:

ROWS

WITH t(v) AS (
  VALUES (1), (1), (3), (5), (5), (5), (6)
)
SELECT
  v,
  array_agg(v) OVER (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
                       EXCLUDE CURRENT ROW) AS current_row,
  array_agg(v) OVER (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE GROUP) AS group,
  array_agg(v) OVER (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE TIES) AS ties,
  array_agg(v) OVER (o ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE NO OTHERS) AS no_others
FROM t
WINDOW o AS (ORDER BY v)

Resulting in:

v |current_row |group |ties    |no_others |
--|------------|------|--------|----------|
1 |{1}         |NULL  |{1}     |{1,1}     |
1 |{1,3}       |{3}   |{1,3}   |{1,1,3}   |
3 |{1,5}       |{1,5} |{1,3,5} |{1,3,5}   |
5 |{3,5}       |{3}   |{3,5}   |{3,5,5}   |
5 |{5,5}       |NULL  |{5}     |{5,5,5}   |
5 |{5,6}       |{6}   |{5,6}   |{5,5,6}   |
6 |{5}         |{5}   |{5,6}   |{5,6}     |

As you can see, the window may now be completely empty, which results in NULL being emitted.

  • Excluding the current row seems obvious
  • Excluding the current group also
  • Excluding ties excludes all other rows from the group
  • Excluding no others is the default, just like when you don’t put this EXCLUDE clause

RANGE

The exclusion can be applied to logical windowing as well:

WITH t(v) AS (
  VALUES (1), (1), (3), (5), (5), (5), (6)
)
SELECT
  v,
  array_agg(v) OVER (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE CURRENT ROW) AS current_row,
  array_agg(v) OVER (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE GROUP) AS group,
  array_agg(v) OVER (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE TIES) AS ties,
  array_agg(v) OVER (o RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE NO OTHERS) AS no_others
FROM t
WINDOW o AS (ORDER BY v)

Resulting in:

v |current_row |group   |ties      |no_others |
--|------------|--------|----------|----------|
1 |{1}         |NULL    |{1}       |{1,1}     |
1 |{1}         |NULL    |{1}       |{1,1}     |
3 |NULL        |NULL    |{3}       |{3}       |
5 |{5,5,6}     |{6}     |{5,6}     |{5,5,5,6} |
5 |{5,5,6}     |{6}     |{5,6}     |{5,5,5,6} |
5 |{5,5,6}     |{6}     |{5,6}     |{5,5,5,6} |
6 |{5,5,5}     |{5,5,5} |{5,5,5,6} |{5,5,5,6} |

GROUPS

Same for grouped windows:

WITH t(v) AS (
  VALUES (1), (1), (3), (5), (5), (5), (6)
)
SELECT
  v,
  array_agg(v) OVER (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE CURRENT ROW) AS current_row,
  array_agg(v) OVER (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE GROUP) AS group,
  array_agg(v) OVER (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE TIES) AS ties,
  array_agg(v) OVER (o GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING 
                       EXCLUDE NO OTHERS) AS no_others
FROM t
WINDOW o AS (ORDER BY v)

Resulting in:

v |current_row |group       |ties          |no_others     |
--|------------|------------|--------------|--------------|
1 |{1,3}       |{3}         |{1,3}         |{1,1,3}       |
1 |{1,3}       |{3}         |{1,3}         |{1,1,3}       |
3 |{1,1,5,5,5} |{1,1,5,5,5} |{1,1,3,5,5,5} |{1,1,3,5,5,5} |
5 |{3,5,5,6}   |{3,6}       |{3,5,6}       |{3,5,5,5,6}   |
5 |{3,5,5,6}   |{3,6}       |{3,5,6}       |{3,5,5,5,6}   |
5 |{3,5,5,6}   |{3,6}       |{3,5,6}       |{3,5,5,5,6}   |
6 |{5,5,5}     |{5,5,5}     |{5,5,5,6}     |{5,5,5,6}     |

Needless to say that this clause will be supported in jOOQ 3.12 as well: https://github.com/jOOQ/jOOQ/issues/7647

Bonus points for the reader who can think of a real world use-case for this clause, please leave a comment!

Selecting all Columns Except One in PostgreSQL

Google’s BigQuery has a very interesting SQL language feature, which I’ve missed many times in other databases:

select:
    SELECT  [{ ALL | DISTINCT }]
        { [ expression. ]* [ EXCEPT ( column_name [, ...] ) ]
            [ REPLACE ( expression [ AS ] column_name [, ...] ) ]
        | expression [ [ AS ] alias ] } [, ...]
    [ FROM from_item  [, ...] ]
    [ WHERE bool_expression ]
    ...

Notice the two keywords EXCEPT and REPLACE that can be used along with an asterisked expression.

An Example

For example, when running a query like this one (which fetches the longest film(s) every actor in the Sakila database played in):

SELECT *
FROM (
  SELECT 
    a.*, 
    f.*, 
    RANK() OVER (PARTITION BY actor_id ORDER BY length DESC) rk
  FROM film f
  JOIN film_actor fa USING (film_id)
  JOIN actor a USING (actor_id)
) t
WHERE rk = 1
ORDER BY first_name, last_name

This is one way to apply TOP-N per category filtering in SQL, which works with most modern databases, including MySQL 8.0. Essentially, we’re calculating the rank of a film per actor ordered by the film’s length.

The result looks like this:

actor_id |first_name  |last_name    |..|title                  |length|..|rk |
---------|------------|-------------|..|-----------------------|------|--|---|
71       |ADAM        |GRANT        |..|GLADIATOR WESTWARD     |   173|..|1  |
71       |ADAM        |GRANT        |..|BALLROOM MOCKINGBIRD   |   173|..|1  |
132      |ADAM        |HOPPER       |..|TORQUE BOUND           |   179|..|1  |
165      |AL          |GARLAND      |..|JACKET FRISCO          |   181|..|1  |

Let’s assume for a moment that we really need the entire projection of the ACTOR and FILM tables (so, SELECT * is fine), but we certainly don’t need the useless RK column, which is always 1.

Sometimes, having some excess columns is not going to be a problem, but sometimes it is. How to remove it? We can’t reference the ACTOR and FILM tables anymore in the outer query:

SELECT a.*, f.* -- Would be great, but wrong syntax
FROM (
  SELECT 
    a.*, 
    f.*, 
    RANK() OVER (PARTITION BY actor_id ORDER BY length DESC) rk
  FROM film f
  JOIN film_actor fa USING (film_id)
  JOIN actor a USING (actor_id)
) t
WHERE rk = 1
ORDER BY first_name, last_name

The outer query only has one table, and that’s the (derived) table T.

How to Solve This

In BigQuery syntax, we could now simply write

SELECT * EXCEPT rk
FROM (...) t
WHERE rk = 1
ORDER BY first_name, last_name

Which is really quite convenient! We want to project everything, except this one column. But none of the more popular SQL databases support this syntax.

Luckily, in PostgreSQL, we can use a workaround: Nested records:

SELECT (a).*, (f).* -- Unnesting the records again
FROM (
  SELECT 
    a, -- Nesting the actor table
    f, -- Nesting the film table
    RANK() OVER (PARTITION BY actor_id ORDER BY length DESC) rk
  FROM film f
  JOIN film_actor fa USING (film_id)
  JOIN actor a USING (actor_id)
) t
WHERE rk = 1
ORDER BY (a).first_name, (a).last_name;

Notice how we’re no longer projecting A.* and F.* inside of the derived table T, but instead, the entire table (record). In the outer query, we have to use some slightly different syntax to unnest the record again (e.g. (A).FIRST_NAME), and we’re done.

How Does This Work?

Informix, Oracle, PostgreSQL, and maybe a few lesser known ones, have implemented the SQL standard’s ORDBMS features to various degrees. ORDBMS attempted to combine relational and object oriented features in the SQL language (and in the storage model). For all practical purposes, this essentially just means that we can nest records and collections.

For instance, in PostgreSQL, we can write:

-- Explicit ROW constructor
SELECT 1, ROW(2, ROW(3, 4))

-- Implicit ROW constructor
SELECT 1, (2, (3, 4))

And we’ll get:

x        |row       |
---------|----------|
1        |(2,(3,4)) |

Along with ordinary “scalar” values, we can have nested rows (or nested tuples) constructed very easily. Conveniently, we can also reference a table without its column names in the projection, such as:

SELECT a, f
FROM film f
JOIN film_actor fa USING (film_id)
JOIN actor a USING (actor_id)

Which produces the aforementioned result:

a    |f    |
-----|-----|
(...)|(...)|
(...)|(...)|
(...)|(...)|
...

Similar things are possible in Oracle as well, except that Oracle doesn’t support structural row/tuple types, only nominal ones. We’d have to create some types first, prior to being able to use them:

CREATE TABLE film_t AS OBJECT (...);

Bonus

Of course, if you’re using SQL Server or Oracle, you wouldn’t have this problem, because then you could use the totally underrated WITH TIES clause along with CROSS APPLY:

SQL Server

SELECT *
FROM actor a
CROSS APPLY (
  SELECT TOP 1 WITH TIES f.*
  FROM film f
  JOIN film_actor fa 
    ON f.film_id = fa.film_id
	AND fa.actor_id = a.actor_id
  ORDER BY length DESC
) f
ORDER BY first_name, last_name;

Oracle

(Do check performance on this!)

SELECT *
FROM actor a
CROSS APPLY (
  SELECT f.*
  FROM film f
  JOIN film_actor fa 
    ON f.film_id = fa.film_id
	AND fa.actor_id = a.actor_id
  ORDER BY length DESC
  FETCH FIRST ROW WITH TIES
) f
ORDER BY first_name, last_name;

PostgreSQL and DB2 support the LATERAL keyword, which could be used with FETCH FIRST ROW ONLY semantics (so, no ties are selected).

For more details about TOP N per category queries, see this blog post

Calculating Tupper’s Self-Referential Formula With SQL

A really geeky way to start a Monday morning is to be nerd-sniped by the cool Fermat’s Library twitter account…

… reading up on the cool Tupper’s Self-Referential Formula thinking “Can This be Done in SQL?™”

As we all know from a previous article, SQL is turing complete, so the answer must be yes. And in fact, as it turns out, this is actually super easy, compared to some other problems I’ve been solving with SQL on this blog in the past.

The Formula

The formula is really simple:

Or, in a more programmer-y way:

1/2 < floor(mod(floor(y/17)*2^(-17*floor(x)-mod(floor(y), 17)),2))

Luckily, this syntax also happens to be SQL syntax, so we’re almost done. So, let’s try plotting this formula for the area of x BETWEEN 0 AND 105 and y BETWEEN k AND k + 16, where k is just some random large number, let’s say

96093937991895888497167296212785275471500433966012930665
15055192717028023952664246896428421743507181212671537827
70623355993237280874144307891325963941337723487857735749
82392662971551717371699516523289053822161240323885586618
40132355851360488286933379024914542292886670810961844960
91705183454067827731551705405381627380967602565625016981
48208341878316384911559022561000365235137034387446184837
87372381982248498634650331594100549747005931383392264972
49461751545728366702369745461014655997933798537483143786
841806593422227898388722980000748404719

Unfortunately, most SQL databases cannot handle such large numbers without any additional libraries, except for the awesome PostgreSQL, whose decimal / numeric types can handle up to 131072 digits before the decimal point and up to 16383 digits after the decimal point.

Yet again, unfortunately, even PostgreSQL by default can’t handle such precisions / scales, so we’re using a trick to expand the precision beyond what’s available by default (for a better workaround, see Torsten Grust’s comment in the comments section). Here’s the SQL query:

WITH 
  t1(k, z) AS (
    SELECT 
      ('96093937991895888497167296212785275471500433966012930665'
    || '15055192717028023952664246896428421743507181212671537827'
    || '70623355993237280874144307891325963941337723487857735749'
    || '82392662971551717371699516523289053822161240323885586618'
    || '40132355851360488286933379024914542292886670810961844960'
    || '91705183454067827731551705405381627380967602565625016981'
    || '48208341878316384911559022561000365235137034387446184837'
    || '87372381982248498634650331594100549747005931383392264972'
    || '49461751545728366702369745461014655997933798537483143786'
    || '841806593422227898388722980000748404719')::numeric,
      (repeat('0', 2000) || '.' 
    || repeat('0', 1000) || '1')::numeric
  ),
  tupper(x, y, b) AS (
    SELECT 
      x, y,
      0.5 < floor(mod(floor(y / 17) 
              * 2 ^ (-17 * x - mod(y, 17)), 2))
    FROM 
      t1, 
      LATERAL (
        SELECT z + x AS x 
        FROM generate_series(0, 105) t2(x)) t2,
      LATERAL (
        SELECT z + k + y AS y 
        FROM generate_series(0, 16) t3(y)) t3
  )
SELECT string_agg(
  CASE WHEN b THEN '@@' ELSE '  ' END, '' 
  ORDER BY x DESC)
FROM tupper
GROUP BY y
ORDER BY y ASC;

What’s the result of the above?

string_agg                                                                                                                                                                                                           |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
                @@                                      @@                                @@  @@@@  @@          @@                                @@    @@  @@          @@        @@  @@@@  @@            @@         |
                @@                                      @@  @@            @@              @@    @@  @@          @@                                @@    @@  @@          @@        @@    @@  @@            @@      @@ |
@@@@            @@                                    @@    @@            @@        @@@@  @@    @@  @@  @@  @@  @@  @@@@  @@@@@@@@    @@@@@@  @@@@@@  @@    @@  @@  @@  @@        @@    @@    @@            @@    @@ |
  @@            @@                                    @@    @@    @@  @@  @@              @@  @@    @@    @@    @@        @@  @@  @@  @@  @@  @@  @@  @@    @@  @@  @@  @@        @@  @@      @@            @@    @@ |
  @@            @@                                    @@    @@    @@  @@  @@              @@  @@    @@  @@  @@  @@        @@  @@  @@  @@@@@@  @@@@@@  @@    @@    @@    @@        @@  @@      @@            @@    @@ |
  @@            @@                              @@  @@      @@      @@    @@    @@@@                @@          @@                                    @@    @@  @@      @@    @@              @@      @@@@    @@  @@ |
@@@@@@      @@  @@                              @@  @@      @@    @@      @@  @@    @@              @@          @@                                      @@  @@          @@    @@            @@      @@    @@  @@  @@ |
          @@    @@  @@@@  @@      @@@@      @@@@@@  @@      @@            @@      @@                @@@@@@  @@@@@@                                      @@  @@@@@@  @@@@@@  @@              @@          @@    @@  @@ |
@@@@@@  @@      @@  @@  @@  @@  @@    @@  @@    @@  @@      @@  @@@@@@@@  @@    @@                                                                                                                    @@      @@  @@ |
          @@    @@  @@  @@  @@  @@    @@  @@    @@  @@      @@            @@  @@                                                                                                                    @@        @@  @@ |
@@@@        @@  @@  @@  @@  @@    @@@@      @@@@@@  @@      @@  @@  @@@@  @@  @@@@@@@@                                                                                                              @@@@@@@@  @@  @@ |
    @@          @@                                  @@      @@  @@    @@  @@                                                                                                                    @@            @@  @@ |
  @@            @@                                    @@    @@  @@    @@  @@                                                                                                                    @@          @@    @@ |
@@              @@                                    @@    @@  @@  @@    @@                                                                                                                  @@            @@    @@ |
@@@@@@          @@                                    @@    @@  @@  @@    @@                                                                                                                                @@    @@ |
                @@                                      @@  @@            @@                                                                                                                              @@      @@ |
                @@@@@@                                  @@  @@@@@@    @@@@@@                                                                                                                              @@  @@@@@@ |

It is a formula that can plot itself on a 17-bit wide bitmap. Cool, eh?

Play around with this formula yourself:
https://www.tuppers-formula.tk

JOIN Elimination: An Essential Optimiser Feature for Advanced SQL Usage

The SQL language has one great advantage over procedural, object oriented, and “ordinary” functional programming languages. The fact that it is truly declarative (i.e. a 4GL / fourth generation programming language) means that a sophisticated optimiser can easily transform one SQL expression into another, equivalent SQL expression, which might be faster to execute.

How does this work?

Expression Tree Equivalence

Here’s an introduction to equivalent expression trees from my SQL Masterclass: Let’s assume we want to evaluate the following expression:

A + (A - B)

Now in maths, it can be proven trivially that by the laws of associativity, the above expression is really the same as this one:

(A + A) - B

We didn’t really win anything yet, but we can equally trivially turn the above addition into a multiplication that can be proven to be exactly equivalent:

(2 * A) - B

Now, imagine that A is an extremely “expensive” value, e.g. a value that we have to fetch from the disk, or worse, from the network. The cost of accessing A is thus very high and if we can avoid one access to A by the above transformation, then the resulting expression is much faster to evaluate than the original one, even if mathematically speaking, it does not matter at all.

That’s what optimisers do all the time. They transform expressions into equivalent expressions which are faster to execute. And they constantly adapt, so if the DBA chooses to move A from a remote server to the local server, thus reducing the cost of access to A, perhaps, suddenly, the original plan will be better again, because of the cost of multiplication (just an example).

A SQL Example

An equally trivial SQL example from my SQL Masterclass shows that it really doesn’t matter mathematically if we run this query:

SELECT first_name, last_name
FROM customer
WHERE first_name = 'JAMIE'

Or this one:

SELECT *
FROM (
  SELECT first_name, last_name
  FROM customer
)
WHERE first_name = 'JAMIE'

With the SQL language, it may be a bit harder to see that these are exactly equivalent SQL statements, but if we translate the above queries to relational algebra, it may become more visible:

Selection before projection

… or WHERE before SELECT:

Projection then selection

… or SELECT before WHERE:

Don’t be fooled by relational algebra‘s term “selection”. It does not correspond to the SELECT clause, but to the WHERE clause!

We can prove (let’s leave the exercise to the reader), that both expressions are exactly equivalent, so optimisers can pick whichever they have a more efficient matching algorithm for:

  • Ordinary row stores will probably apply the selection first, before projecting, as the expensive operation is accessing rows and if we can filter rows early (e.g. through indexes) then the whole query is less expensive)
  • A column store might (just a hypothesis) apply the projection first, as the expensive operation is accessing the columns. We then might have to traverse the entire column anyway to filter out the rows

Let’s Talk About JOINs (and Their Elimination)

JOIN elimination is one of the most basic and yet most powerful SQL transformations, which is implemented by most modern databases in one way or another. It is so powerful, because we’re writing (potentially useless) JOINs all the time, when writing a bit more complex queries. See also our article about JOINs for a general overview.

Now, consider the following simplified schema, taken from the Sakila database:

CREATE TABLE address (
  address_id INT NOT NULL,
  address VARCHAR(50) NOT NULL,
  CONSTRAINT pk_address PRIMARY KEY (address_id)
);

CREATE TABLE customer (
  customer_id INT NOT NULL,
  first_name VARCHAR(45) NOT NULL,
  last_name VARCHAR(45) NOT NULL,
  address_id INT NOT NULL,
  CONSTRAINT pk_customer PRIMARY KEY  (customer_id),
  CONSTRAINT fk_customer_address FOREIGN KEY (address_id) 
    REFERENCES address(address_id)
);

Let’s ignore indexing and other useful features for this example.

INNER JOIN Elimination

The following query shows a common JOIN use-case, joining a parent table (ADDRESS) to a child table (CUSTOMER) in a to-one relationship:

SELECT c.*
FROM customer c
JOIN address a 
ON c.address_id = a.address_id

We intended to fetch all customers and their addresses. But observe: We project only columns from the CUSTOMER table and we don’t have any predicates at all, specifically not predicates using the ADDRESS table. So, we’re completely ignoring any contributions from the ADDRESS table. We never really needed the JOIN in the first place!

And in fact, the optimiser can prove this too, because of the FOREIGN KEY constraint on C.ADDRESS_ID, which guarantees that every CUSTOMER record has exactly one corresponding ADDRESS record. The JOIN does not duplicate, nor remove any CUSTOMER rows, so it is unneeded and thus eliminated (by some, not all databases, will list each database at the end of the article).

So, the database can rewrite the SQL statement to the following, equivalent SQL statement in the presence of said FOREIGN KEY:

SELECT *
FROM customer c

Now, quite obviously, this query will be faster than the previous one, if the entire JOIN can be avoided, and thus the entire access to the ADDRESS table. Neat, huh? Who would have thought that FOREIGN KEYs can be so useful in terms of performance.

The above works if there’s also a NOT NULL constraint on the FOREIGN KEY. If there isn’t, e.g. as in this query:

SELECT title
FROM film f
JOIN language l ON f.original_language_id = l.language_id

The JOIN can still be eliminated, but there needs to be a replacement NOT NULL predicate, as such:

SELECT title
FROM film
WHERE original_language_id IS NOT NULL

OUTER JOIN Elimination

A LEFT [ OUTER ] JOIN will JOIN the right table to the left table but keep rows from the left table if there is no match (again, an explanation of joins can be seen here). When we apply LEFT JOIN to the previous query…

SELECT c.*
FROM customer c
LEFT JOIN address a 
ON c.address_id = a.address_id

… then we’ll fetch all rows from CUSTOMER regardless if that customer has any ADDRESS. This is useful if the FOREIGN KEY is optional (nullable), or completely absent, e.g. through:

ALTER TABLE customer DROP CONSTRAINT fk_customer_address

OUTER JOIN is even easier to eliminate, as it doesn’t require any FOREIGN KEY constraint for the database to prove that it is unneeded. A UNIQUE constraint on the parent table (here: ADDRESS.ADDRESS_ID) is sufficient to show that for every CUSTOMER there can be at most one ADDRESS, so the LEFT JOIN won’t duplicate any CUSTOMER rows (Unlike INNER JOIN, OUTER JOIN never remove rows).

Hence, the above query can again be rewritten to the more optimal:

SELECT *
FROM customer c

OUTER JOIN Elimination with DISTINCT

Another interesting case of OUTER JOIN elimination is the following one, which unfortunately didn’t work on Oracle for a customer of ours, recently, in a complex query that ran rogue. Let’s look at some other tables of the Sakila database, which expose a to-many relationship:

CREATE TABLE actor (
  actor_id numeric NOT NULL ,
  first_name VARCHAR(45) NOT NULL,
  last_name VARCHAR(45) NOT NULL,
  CONSTRAINT pk_actor PRIMARY KEY (actor_id)
);

CREATE TABLE film (
  film_id int NOT NULL,
  title VARCHAR(255) NOT NULL,
  CONSTRAINT pk_film PRIMARY KEY (film_id)
);

CREATE TABLE film_actor (
  actor_id INT NOT NULL,
  film_id  INT NOT NULL,
  CONSTRAINT pk_film_actor PRIMARY KEY (actor_id, film_id),
  CONSTRAINT fk_film_actor_actor FOREIGN KEY (actor_id)
    REFERENCES actor (actor_id),
  CONSTRAINT fk_film_actor_film FOREIGN KEY (film_id)
    REFERENCES film (film_id)
);

Now, consider this query:

SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id

We’re looking for all actors and their films, but then we project only distinct actors. Again, the JOIN to FILM_ACTOR doesn’t contribute anything to the result, but because we’re joining a to-many relationship here (from parent table ACTOR to child table FILM_ACTOR), the JOIN is producing duplicate rows. Without DISTINCT, we’d get something like:

FIRST_NAME   LAST_NAME
----------------------
...
PENELOPE     GUINESS
PENELOPE     GUINESS
PENELOPE     GUINESS
NICK         WAHLBERG
NICK         WAHLBERG
...

But thanks to the DISTINCT keyword, the result is (provably) no different from the result of this much simpler query:

SELECT DISTINCT first_name, last_name
FROM actor a

(Note, DISTINCT cannot be eliminated, unless we already have a UNIQUE constraint on (FIRST_NAME, LAST_NAME)).

Why not Just Refactor the SQL Manually?

Of course, all of this shouldn’t be needed if our SQL were perfect. In the above trivial examples, the SQL can (and should) be re-written manually to improve quality. But note that:

  • Developers make mistakes, and those mistakes may be very subtle when queries get more complex. I’ll show an example below.
  • The presence of this feature actually encourages writing more complex SQL, especially when using reusable views. I’ll show another example below.
  • Finally, I’ve previously advocated avoiding needless, mandatory work, like SELECT *. Such work is mandatory because the optimiser cannot prove its needlessness. In the case of these JOINs, the optimiser can prove the needlessness, so the work is no longer mandatory. It can be eliminated.

Here are some complex examples as promised, where this optimiser feature really shines:

Subtle Mistakes

Let’s consider the following query (in PostgreSQL syntax):

SELECT c.name, count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
JOIN film f USING (film_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

What does it do? For ACTOR_ID = 1 (Penelope Guiness), we’re looking for all the different film categories she played in, and the number of films per category. This is easier to understand when we look at the result:

NAME         COUNT
Horror       3
Family       2
New          2
Classics     2
Games        2
Music        1
Sci-Fi       1
Animation    1
Sports       1
Children     1
Comedy       1
Documentary  1
Foreign      1

Now, can you spot the unneeded JOINs? In fact, we never needed ACTOR, nor did we need FILM

SELECT c.name, count(*)
FROM film_actor fa
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

Cool, eh? The JOINs can be eliminated (again, in some databases, see below) and our “mistake” is no longer relevant to the query. The mistake could have also snuck (or sneaked?) in from a previous query version, which may have looked like this, projecting also the actor information and the list of films per category, in case of which the additional JOIN are needed:

SELECT 
  c.name, count(*), 
  a.first_name, a.last_name, 
  array_agg(f.title ORDER BY f.title)
FROM actor a
JOIN film_actor fa USING (actor_id)
JOIN film f USING (film_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name, a.first_name, a.last_name
ORDER BY count(*) DESC

The result being:

NAME         COUNT  FIRST_NAME  LAST_NAME  FILMS
Horror       3      PENELOPE    GUINESS    {"ELEPHANT TROJAN","LADY STAGE","RULES HUMAN"}
Family       2      PENELOPE    GUINESS    {"KING EVOLUTION","SPLASH GUMP"}
New          2      PENELOPE    GUINESS    {"ANGELS LIFE","OKLAHOMA JUMANJI"}
Classics     2      PENELOPE    GUINESS    {"COLOR PHILADELPHIA","WESTWARD SEABISCUIT"}
Games        2      PENELOPE    GUINESS    {"BULWORTH COMMANDMENTS","HUMAN GRAFFITI"}
Music        1      PENELOPE    GUINESS    {"WIZARD COLDBLOODED"}
Sci-Fi       1      PENELOPE    GUINESS    {"CHEAPER CLYDE"}
Animation    1      PENELOPE    GUINESS    {"ANACONDA CONFESSIONS"}
Sports       1      PENELOPE    GUINESS    {"GLEAMING JAWBREAKER"}
Children     1      PENELOPE    GUINESS    {"LANGUAGE COWBOY"}
Comedy       1      PENELOPE    GUINESS    {"VERTIGO NORTHWEST"}
Documentary  1      PENELOPE    GUINESS    {"ACADEMY DINOSAUR"}
Foreign      1      PENELOPE    GUINESS    {"MULHOLLAND BEAST"}

As you can see, this optimisation can be very useful on your legacy SQL, because if we maintain a complex query, we might not always be able to see all the JOINs that are really needed.

Reusable Views

Sometimes, we simply add additional JOINs for convenience, when building complex queries from simpler ones, e.g. by using views (which is a completely underrated RDBMS feature! You should all write more views).

Consider this view:

CREATE VIEW v_customer AS
SELECT 
  c.first_name, c.last_name, 
  a.address, ci.city, co.country
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)

It’s not unlikely that we will write a view like this, simply because we’re incredibly bored to constantly join all these tables all the time. Every time we do something with customers and addresses, we need the CITY and COUNTRY table as well.

From now on (with this view), we can simply select from the view and “forget” about how it came to be. Now, let’s consider we completely forget about the underlying table, because the view was so useful all the time. We could think about doing this:

SELECT first_name, last_name
FROM v_customer

What do you think will happen? Exactly. JOIN elimination. A view isn’t really anything special, just a “macro” of some stored SQL (beware of some databases, where this isn’t always the case, e.g. MySQL, which Bill Karwin was kind enough to hint me at). So the above statement will be transformed into:

SELECT first_name, last_name
FROM (
  SELECT 
    c.first_name, c.last_name, 
    a.address, ci.city, co.country
  FROM customer c
  JOIN address a USING (address_id)
  JOIN city ci USING (city_id)
  JOIN country co USING (country_id)
) v_customer

… which can be transformed into this (we don’t need all columns in the nested select):

SELECT first_name, last_name
FROM (
  SELECT 
    c.first_name, c.last_name
  FROM customer c
  JOIN address a USING (address_id)
  JOIN city ci USING (city_id)
  JOIN country co USING (country_id)
) v_customer

… which can be transformed into this (JOINs can be eliminated):

SELECT first_name, last_name
FROM (
  SELECT 
    c.first_name, c.last_name
  FROM customer c
) v_customer

… and finally (the subquery is not useful):

SELECT first_name, last_name
FROM customer

The view is even very useful for this particular query, thanks to JOIN elimination!

Note, the SQL transformations exposed above are simply educational. Actual optimisers may perform transformations in an entirely differently, or in a different order. This is just to show what’s possible, and what kind of stuff is being done.

Cool, So Can My Database Do It?

Perhaps! Let’s look at the three different types of JOIN elimination in the context of these databases:

  • DB2 LUW 10.5
  • MySQL 8.0.2
  • Oracle 12.2.0.1
  • PostgreSQL 9.6
  • SQL Server 2014

INNER JOIN Elimination

Remember, this depends on the presence (and usefulness) of a FOREIGN KEY constraint. The SQL statement we’re using here is:

SELECT first_name, last_name
FROM customer c
JOIN address a ON c.address_id = a.address_id

We’re hoping to get:

SELECT first_name, last_name
FROM customer c

DB2 LUW

The following execution plan (created with Markus Winand’s cool utility) shows that this works in DB2, there’s no access to the ADDRESS table:

Explain Plan                                                           |
-----------------------------------------------------------------------|
ID | Operation                           |                 Rows | Cost |
 1 | RETURN                              |                      |   61 |
 2 |  FETCH CUSTOMER                     | 599 of 599 (100.00%) |   61 |
 3 |   IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) |   20 |

MySQL

MySQL 8, apart from finally introducing CTE and window functions (yay), has a lot of new optimiser features, read Morgan Tocker’s useful optimiser guide for details. Unfortunately, INNER JOIN elimination is not implemented:

ID  TABLE  TYPE    REF                  ROWS   EXTRA
1   c      ALL                          599    
1   a      eq_ref  sakila.c.address_id  1      Using index

Not only is the JOIN executed, but it is executed using a nested loop with 599 index lookups, as MySQL still only supports NESTED LOOP JOINs, not HASH JOINs.

Bummer.

Oracle

No problem at all for Oracle:

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |   599 | 28752 |     5   (0)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| CUSTOMER |   599 | 28752 |     5   (0)| 00:00:01 |
------------------------------------------------------------------------------

… the JOIN is eliminated as expected.

PostgreSQL

Unfortunately, PostgreSQL cannot eliminate INNER JOIN:

Hash Join  (cost=19.57..42.79 rows=599 width=13)
  Hash Cond: (c.address_id = a.address_id)
  ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=15)
  ->  Hash  (cost=12.03..12.03 rows=603 width=4)
        ->  Seq Scan on address a  (cost=0.00..12.03 rows=603 width=4)

Not as bad as in MySQL, though, as PostgreSQL chose to use a HASH JOIN to combine the two tables.

SQL Server

No problemo for SQL Server, the ADDRESS table access is gone!

  |--Table Scan(OBJECT:([sakila].[dbo].[customer] AS [c]))

Notice, however, that SQL Server can only eliminate INNER JOIN on NOT NULL FOREIGN KEYs!

Excellent. What about…

OUTER JOIN Elimination

This one is a bit easier to prove for the database, remember? We don’t rely on any FOREIGN KEY anymore. A UNIQUE key in the parent table is sufficient to eliminate an OUTER JOIN. We can safely expect that if the INNER JOIN could be eliminated (DB2, Oracle, SQL Server), then an OUTER JOIN can be eliminated, too.

Here’s the query:

SELECT first_name, last_name
FROM customer c
LEFT JOIN address a ON c.address_id = a.address_id

And the outcome:

DB2 LUW

Good

Explain Plan                                                           |
-----------------------------------------------------------------------|
ID | Operation                           |                 Rows | Cost |
 1 | RETURN                              |                      |   61 |
 2 |  FETCH CUSTOMER                     | 599 of 599 (100.00%) |   61 |
 3 |   IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) |   20 |

MySQL

Still nope:

ID  TABLE  TYPE    REF                  ROWS   EXTRA
1   c      ALL                          599    
1   a      eq_ref  sakila.c.address_id  1      Using index

Oracle

Perfect:

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |   599 | 28752 |     5   (0)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| CUSTOMER |   599 | 28752 |     5   (0)| 00:00:01 |
------------------------------------------------------------------------------

PostgreSQL

Unlike INNER JOIN elimination, this works. Great!

Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=13)

SQL Server

As expected, good:

  |--Table Scan(OBJECT:([sakila].[dbo].[customer] AS [c]))

Finally…

OUTER JOIN Elimination with DISTINCT

Remember, the query here navigates a to-many relationship, producing duplicate records of the parent table, but then removes all those duplicates again by

  • Ignoring contributions from the child table
  • Removing duplicates with DISTINCT

It’s easy to prove that this:

SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id

Is equivalent to this:

SELECT DISTINCT first_name, last_name
FROM actor a

Remember also, this only works with OUTER JOIN, not with INNER JOIN, as the latter might remove rows, so we have to execute it to see if it does.

DB2 LUW

Cool, this actually works!

Explain Plan                                                          |
----------------------------------------------------------------------|
ID | Operation        |                 Rows | Cost                   |
 1 | RETURN           |                      |   20                   |
 2 |  TBSCAN          | 200 of 200 (100.00%) |   20                   |
 3 |   SORT (UNIQUE)  | 200 of 200 (100.00%) |   20                   |
 4 |    TBSCAN ACTOR  | 200 of 200 (100.00%) |   20                   |

There’s no access to the FILM_ACTOR table, nor to its indexes. Very nice.

MySQL

As this is a more sophisticated transformation than the previous ones, we don’t have high hopes here.

ID  TABLE  TYPE    REF                  ROWS   EXTRA
1   a      ALL                          200    Using temporary
1   fa     ref     sakila.a.actor_id    27     Using index; Distinct

This has become a rather expensive query, again because of the lack of HASH JOIN support in MySQL!

Oracle

I’m very surprised to see that Oracle doesn’t support this optimisation, we’re executing the full query:

---------------------------------------------------------------
| Id  | Operation           | Name                    | Rows  |
---------------------------------------------------------------
|   0 | SELECT STATEMENT    |                         |  5462 |
|   1 |  HASH UNIQUE        |                         |  5462 |
|   2 |   NESTED LOOPS OUTER|                         |  5462 |
|   3 |    TABLE ACCESS FULL| ACTOR                   |   200 |
|*  4 |    INDEX RANGE SCAN | IDX_FK_FILM_ACTOR_ACTOR |    27 |
---------------------------------------------------------------

Curiously, Oracle also chose a NESTED LOOP JOIN in this case, even if we could have loaded the entire index on the FILM_ACTOR table into memory and then HASH JOINed it to ACTOR. Note that the cardinality estimate of the resulting query is quite off, too, despite the DISTINCT operation. This can lead to significant effects in an upstream query, which selects from this query (e.g. if stored as a view) – which is what happened to our customer.

PostgreSQL

PostgreSQL also doesn’t support this elimination, but at least gets cardinality estimates much more accurately and chooses a HASH JOIN operation:

HashAggregate  (cost=193.53..194.81 rows=128 width=13)
  Group Key: a.first_name, a.last_name
  ->  Hash Right Join  (cost=6.50..166.22 rows=5462 width=13)
        Hash Cond: (fa.actor_id = a.actor_id)
        ->  Seq Scan on film_actor fa  (cost=0.00..84.62 rows=5462 width=2)
        ->  Hash  (cost=4.00..4.00 rows=200 width=17)
              ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=17)

SQL Server

The pleasant surprise came from SQL Server, which does support this optimisation too:

  |--Sort(DISTINCT ORDER BY:([a].[first_name] ASC, [a].[last_name] ASC))
       |--Table Scan(OBJECT:([sakila].[dbo].[actor] AS [a]))

As you can see, no access to any FILM_ACTOR related objects!

Summary

Here’s a summary of what databases can eliminate:

Database INNER JOIN:
to-one
INNER JOIN nullable:
to-one
OUTER JOIN:
to-one
OUTER JOIN DISTINCT:
to-many
DB2 LUW 10.5 Yep Yep Yep Yep
MySQL 8.0.2 Nope Nope Nope Nope
Oracle 12.2.0.1 Yep Yep Yep Nope
PostgreSQL 9.6 Nope Nope Yep Nope
SQL Server 2014 Yep Nope Yep Yep

Conclusion

JOIN elimination is a very simple to understand, yet incredibly powerful feature that modern databases support to help developers build and maintain complex SQL queries without worrying too much about performance side effects.

It is possible because SQL is a 4GL (fourth-generation programming language), i.e. a declarative language whose parsed expression trees can “easily” be transformed into equivalent, simpler, and faster to execute expression trees. This work is constantly done by the optimiser behind the scenes, often without you even noticing (see example where unnecessary ACTOR and FILM tables were removed).

Currenly, DB2 and SQL Server are the leaders here with Oracle being a close runner-up, at least in my investigation. There’s hope for PostgreSQL, and a bit less hope for MySQL right now. There are tons of other interesting SQL transformations, which I’ll blog about in future blog posts, which may make the difference in other kinds of complex queries.

If you were intrigued by this sort of functionality, do have a look at my most recent SQL talk, which helps developers understand the real value of SQL in the context of the optimiser: