The Useful BigQuery * EXCEPT Syntax

One of the coolest things about using and making jOOQ is that we get to discover the best extensions to the standard SQL language by vendors, and add support for those clauses in jOOQ via emulations.

One of these syntaxes is BigQuery’s * EXCEPT syntax. Everyone who ever wrote ad-hoc SQL queries would have liked this syntax to be available, occasionally. Why is it needed? Look at this query:

SELECT * FROM actor

The result being:

|actor_id|first_name|last_name   |last_update            |
|--------|----------|------------|-----------------------|
|1       |PENELOPE  |GUINESS     |2006-02-15 04:34:33.000|
|2       |NICK      |WAHLBERG    |2006-02-15 04:34:33.000|
|3       |ED        |CHASE       |2006-02-15 04:34:33.000|
|4       |JENNIFER  |DAVIS       |2006-02-15 04:34:33.000|
|5       |JOHNNY    |LOLLOBRIGIDA|2006-02-15 04:34:33.000|
|6       |BETTE     |NICHOLSON   |2006-02-15 04:34:33.000|
|...

But that LAST_UPDATE column is mighty annoying, especially if we want to NATURAL JOIN things. E.g. this doesn’t work:

SELECT actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor AS a
NATURAL LEFT JOIN film_actor AS fa
GROUP BY actor_id

The result is just actors without films, because accidentally, the LAST_UPDATE column was included in the NATURAL JOIN:

|actor_id|first_name|last_name|count|
|--------|----------|---------|-----|
|58      |CHRISTIAN |AKROYD   |0    |
|8       |MATTHEW   |JOHANSSON|0    |
|116     |DAN       |STREEP   |0    |
|184     |HUMPHREY  |GARLAND  |0    |
|87      |SPENCER   |PECK     |0    |

This is the biggest flaw of NATURAL JOIN, making it almost useless for schemas that aren’t perfectly designed for NATURAL JOIN usage, but this is ad-hoc SQL, and it would have been nice to do that.

We could, if we had * EXCEPT like this:

SELECT 
  a.actor_id, 
  a.first_name, 
  a.last_name, 
  count(fa.film_id)
FROM (
  SELECT * EXCEPT (last_update) FROM actor
) AS a
NATURAL LEFT JOIN (
  SELECT * EXCEPT (last_update) FROM film_actor
) AS fa
GROUP BY 
  a.actor_id, 
  a.first_name, 
  a.last_name

Unfortunately, this doesn’t work in PostgreSQL and other dialects, but jOOQ can emulate it. If you provide the online SQL translator with the sakila database table definitions:

CREATE TABLE actor (
    actor_id integer NOT NULL,
    first_name character varying(45) NOT NULL,
    last_name character varying(45) NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);
CREATE TABLE film_actor (
    actor_id smallint NOT NULL,
    film_id smallint NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);

Then, it can resolve the EXCEPT syntax of the query and produce this:

SELECT
  a.actor_id,
  a.first_name,
  a.last_name,
  count(fa.film_id)
FROM (
  SELECT actor.actor_id, actor.first_name, actor.last_name
  FROM actor
) a
  NATURAL LEFT OUTER JOIN (
    SELECT film_actor.actor_id, film_actor.film_id
    FROM film_actor
  ) fa
GROUP BY
  a.actor_id,
  a.first_name,
  a.last_name

Obviously, we could have just omitted NATURAL JOIN to achieve the same result, but sometimes, it’s just nice to have yet another tool in the tool chain to write a nice query. With jOOQ, the query would read:

Actor a = ACTOR.as("a");
FilmActor fa = FILM_ACTOR.as("fa");

ctx.select(
        a.ACTOR_ID,
        a.FIRST_NAME,
        a.LAST_NAME,
        count(fa.FILM_ID))
   .from(
        select(asterisk().except(a.LAST_UPDATE)).from(a).asTable(a))
   .naturalLeftOuterJoin(
        select(asterisk().except(fa.LAST_UPDATE)).from(fa).asTable(fa))
   .groupBy(a.ACTOR_ID, a.FIRST_NAME, a.LAST_NAME)
   .fetch();

Selecting all Columns Except One in PostgreSQL

Google’s BigQuery has a very interesting SQL language feature, which I’ve missed many times in other databases:
select:
    SELECT  [{ ALL | DISTINCT }]
        { [ expression. ]* [ EXCEPT ( column_name [, ...] ) ]
            [ REPLACE ( expression [ AS ] column_name [, ...] ) ]
        | expression [ [ AS ] alias ] } [, ...]
    [ FROM from_item  [, ...] ]
    [ WHERE bool_expression ]
    ...
Notice the two keywords EXCEPT and REPLACE that can be used along with an asterisked expression.

An Example

For example, when running a query like this one (which fetches the longest film(s) every actor in the Sakila database played in):

SELECT *
FROM (
  SELECT 
    a.*, 
    f.*, 
    RANK() OVER (PARTITION BY actor_id ORDER BY length DESC) rk
  FROM film f
  JOIN film_actor fa USING (film_id)
  JOIN actor a USING (actor_id)
) t
WHERE rk = 1
ORDER BY first_name, last_name

This is one way to apply TOP-N per category filtering in SQL, which works with most modern databases, including MySQL 8.0. Essentially, we’re calculating the rank of a film per actor ordered by the film’s length. The result looks like this:
actor_id |first_name  |last_name    |..|title                  |length|..|rk |
---------|------------|-------------|..|-----------------------|------|--|---|
71       |ADAM        |GRANT        |..|GLADIATOR WESTWARD     |   173|..|1  |
71       |ADAM        |GRANT        |..|BALLROOM MOCKINGBIRD   |   173|..|1  |
132      |ADAM        |HOPPER       |..|TORQUE BOUND           |   179|..|1  |
165      |AL          |GARLAND      |..|JACKET FRISCO          |   181|..|1  |
Let’s assume for a moment that we really need the entire projection of the ACTOR and FILM tables (so, SELECT * is fine), but we certainly don’t need the useless RK column, which is always 1. Sometimes, having some excess columns is not going to be a problem, but sometimes it is. How to remove it? We can’t reference the ACTOR and FILM tables anymore in the outer query:

SELECT a.*, f.* -- Would be great, but wrong syntax
FROM (
  SELECT 
    a.*, 
    f.*, 
    RANK() OVER (PARTITION BY actor_id ORDER BY length DESC) rk
  FROM film f
  JOIN film_actor fa USING (film_id)
  JOIN actor a USING (actor_id)
) t
WHERE rk = 1
ORDER BY first_name, last_name

The outer query only has one table, and that’s the (derived) table T.

How to Solve This

In BigQuery syntax, we could now simply write

SELECT * EXCEPT rk
FROM (...) t
WHERE rk = 1
ORDER BY first_name, last_name

Which is really quite convenient! We want to project everything, except this one column. But none of the more popular SQL databases support this syntax. Luckily, in PostgreSQL, we can use a workaround: Nested records:

SELECT (a).*, (f).* -- Unnesting the records again
FROM (
  SELECT 
    a, -- Nesting the actor table
    f, -- Nesting the film table
    RANK() OVER (PARTITION BY actor_id ORDER BY length DESC) rk
  FROM film f
  JOIN film_actor fa USING (film_id)
  JOIN actor a USING (actor_id)
) t
WHERE rk = 1
ORDER BY (a).first_name, (a).last_name;

Notice how we’re no longer projecting A.* and F.* inside of the derived table T, but instead, the entire table (record). In the outer query, we have to use some slightly different syntax to unnest the record again (e.g. (A).FIRST_NAME), and we’re done.

How Does This Work?

Informix, Oracle, PostgreSQL, and maybe a few lesser known ones, have implemented the SQL standard’s ORDBMS features to various degrees. ORDBMS attempted to combine relational and object oriented features in the SQL language (and in the storage model). For all practical purposes, this essentially just means that we can nest records and collections. For instance, in PostgreSQL, we can write:

-- Explicit ROW constructor
SELECT 1, ROW(2, ROW(3, 4))

-- Implicit ROW constructor
SELECT 1, (2, (3, 4))

And we’ll get:
x        |row       |
---------|----------|
1        |(2,(3,4)) |
Along with ordinary “scalar” values, we can have nested rows (or nested tuples) constructed very easily. Conveniently, we can also reference a table without its column names in the projection, such as:

SELECT a, f
FROM film f
JOIN film_actor fa USING (film_id)
JOIN actor a USING (actor_id)

Which produces the aforementioned result:
a    |f    |
-----|-----|
(...)|(...)|
(...)|(...)|
(...)|(...)|
...
Similar things are possible in Oracle as well, except that Oracle doesn’t support structural row/tuple types, only nominal ones. We’d have to create some types first, prior to being able to use them:

CREATE TABLE film_t AS OBJECT (...);

Bonus

Of course, if you’re using SQL Server or Oracle, you wouldn’t have this problem, because then you could use the totally underrated WITH TIES clause along with CROSS APPLY: SQL Server

SELECT *
FROM actor a
CROSS APPLY (
  SELECT TOP 1 WITH TIES f.*
  FROM film f
  JOIN film_actor fa 
    ON f.film_id = fa.film_id
	AND fa.actor_id = a.actor_id
  ORDER BY length DESC
) f
ORDER BY first_name, last_name;

Oracle (Do check performance on this!)

SELECT *
FROM actor a
CROSS APPLY (
  SELECT f.*
  FROM film f
  JOIN film_actor fa 
    ON f.film_id = fa.film_id
	AND fa.actor_id = a.actor_id
  ORDER BY length DESC
  FETCH FIRST ROW WITH TIES
) f
ORDER BY first_name, last_name;

PostgreSQL and DB2 support the LATERAL keyword, which could be used with FETCH FIRST ROW ONLY semantics (so, no ties are selected). For more details about TOP N per category queries, see this blog post