Functional Dependencies in SQL GROUP BY

The SQL standard knows an interesting feature where you can project any functional dependencies of a primary (or unique) key that is listed in the GROUP BY clause without having to add that functional dependency to the GROUP BY clause explicitly.

What does this mean? Consider this simple schema:

CREATE TABLE author (
  id INT NOT NULL PRIMARY KEY,
  name TEXT NOT NULL
);

CREATE TABLE book (
  id INT NOT NULL PRIMARY KEY,
  author_id INT NOT NULL REFERENCES author,
  title TEXT NOT NULL
);

In order to count the number of books by author, we tend to write:

SELECT a.name, count(b.id)
FROM author a
LEFT JOIN book b ON a.id = b.author_id
GROUP BY 
  a.id,  -- Required, because names aren't unique
  a.name -- Required in some dialects, but not in others

We have to group by something unique in this case, because if two authors are called John Doe, we still want them to produce separate groups. So GROUP BY a.id is a given.

We’re used to also GROUP BY a.name, especially in these dialects that require this, since we list a.name in the SELECT clause:

Db2
Derby
Exasol
Firebird
HANA
Informix
Oracle
SQL Server

But is it really required? It isn’t as per the SQL standard, because there is a functional dependency between author.id and author.name. In other words, for each value of author.id, there is exactly one possible value of author.name, or author.name is a function of author.id

This means that it does not matter if we GROUP BY both columns, or only the primary key. The result must be the same in both cases, hence this is possible:

SELECT a.name, count(b.id)
FROM author a
LEFT JOIN book b ON a.id = b.author_id
GROUP BY a.id

Which SQL dialects support this?

At least the following SQL dialects support this language feature:

CockroachDB
H2
HSQLDB
MariaDB
MySQL
PostgreSQL
SQLite
Yugabyte

It’s noteworthy that MySQL used to simply ignore whether a column could be projected unambiguously or not, in the presence of GROUP BY. While the following query was rejected in most dialects, it was not, in MySQL, prior to the introduction of the ONLY_FULL_GROUP_BY mode:

SELECT author_id, title, count(*)
FROM author
GROUP BY author_id

What should we display for author.title, if an author has written more than one book? It doesn’t make sense, yet MySQL still used to allow it, and would just project any arbitrary value from the group.

Today, MySQL only allows for projecting columns with a functional dependency on the GROUP BY clause, as is permitted by the SQL standard.

Pros & Cons

While the shorter syntax that avoids the extra columns might be easier to maintain (easy to project additional columns, if required), there is some risk of queries breaking in production, namely when underlying constraints are disabled, e.g. for a migration. While it is unlikely that a primary key is disabled in a live system, it could still be the case, and without the key, a previously valid query will no longer be valid for the same reason why MySQL’s old interpretation was invalid: There’s no longer a guarantee of functional dependency.

Other syntax

Starting from jOOQ 3.16, and #11834, it will be possible to reference tables directly in the GROUP BY clause, instead of individual columns. For example:

SELECT a.name, count(b.id)
FROM author a
LEFT JOIN book b ON a.id = b.author_id
GROUP BY a

The semantics will be:

If the table has a primary key (composite or not), use that in the GROUP BY clause, instead
If the table doesn’t have a primary key, list all the columns from the table instead.

Since none of the RDBMS supported by jOOQ currently supports this syntax, it is a purely synthetic jOOQ feature.

Which SQL dialects support this?

Pros & Cons

Other syntax

Like this:

Published by lukaseder

Leave a ReplyCancel reply

Discover more from Java, SQL and jOOQ.