How to Write Efficient TOP N Queries in SQL

A very common type of SQL query is the TOP-N query, where we need the “TOP N” records ordered by some value, possibly per category. In this blog post, we’re going to look into a variety of different aspects to this problem, as well as how to solve them with standard and non-standard SQL.

These are the different aspects we’ll discuss:

  • Top values
  • Top values with ties
  • Top values per category

Getting the Top Value

When looking at the Sakila database, we might want to find the actor who played in the most films. The simplest solution here would be to use GROUP BY to find the number of films per actor, and then ORDER BY and LIMIT to find the “TOP 1” actor.

Here’s the query in PostgreSQL:

SELECT actor_id, first_name, last_name, count(film_id)
FROM actor
LEFT JOIN film_actor USING (actor_id)
GROUP BY actor_id, first_name, last_name
ORDER BY count(film_id) DESC
LIMIT 1;

Yielding

ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT
--------------------------------------
107       GINA        DEGENERES  42

Other databases have different syntaxes for LIMIT – check out the jOOQ manual for a complete list of emulations of this useful clause, e.g. in Oracle 12c, we would use FETCH:

SELECT actor_id, first_name, last_name, count(film_id)
FROM actor
LEFT JOIN film_actor USING (actor_id)
GROUP BY actor_id, first_name, last_name
ORDER BY count(*) DESC
FETCH FIRST ROW ONLY;

Or, in SQL Server, we could use TOP

SELECT TOP 1 a.actor_id, first_name, last_name, count(film_id)
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
GROUP BY a.actor_id, first_name, last_name
ORDER BY count(*) DESC;

… which kinda makes sense. We want the TOP 1 actor, right?

Alternatives with Subqueries

In my SQL Masterclass, we’re also looking at the performance of various variants of solving this particular problem. For instance, we could calculate the COUNT(*) value in a derived value, prior to joining:

SELECT actor_id, first_name, last_name, COALESCE(c, 0)
FROM actor
LEFT JOIN (
  SELECT actor_id, count(*) AS c
  FROM film_actor
  GROUP BY actor_id
) fa USING (actor_id)
ORDER BY COALESCE(c, 0) DESC
LIMIT 1;

Or, we could calculate the COUNT(*) value in a correlated subquery:

SELECT actor_id, first_name, last_name, (
  SELECT count(*)
  FROM film_actor fa
  WHERE fa.actor_id = a.actor_id
) AS c
FROM actor a
ORDER BY c DESC
LIMIT 1;

These perform vastly differently, depending on the database as can be seen in this tweet:

Anyway, the different techniques do not really influence the TOP-N semantics themselves, as we’re always using the same two clauses to get the TOP 1 actor:

ORDER BY c DESC -- Some means of calculating "c"
LIMIT 1;        -- Some means of limiting results to 1 row

Window function alternative

In the old days when databases like Oracle didn’t really support LIMIT (or when using DB2, SQL Server, and Sybase, which support “LIMIT” but not “OFFSET”), people used to resort to using window functions. In order to get the FETCH FIRST n ROWS ONLY semantics, we can use ROW_NUMBER():

SELECT *
FROM (
  SELECT 
    actor_id, first_name, last_name, count(film_id),
    row_number() OVER (ORDER BY count(film_id) DESC) rn
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  GROUP BY actor_id, first_name, last_name
) t
WHERE rn <= 5
ORDER BY rn;

The derived table contains that ROW_NUMBER() window function, which produces a unique, consecutive row number for each actor ordered by the number of films the actor played in.

Note that thanks to the aggregation step “happening before” the windowing step, we can use aggregation functions inside of window functions.

We’ll revisit the window function approach again later, when implementing “WITH TIES” semantics using RANK().

Oracle Specific Alternative

In Oracle, we have those cool FIRST and LAST functions, with a bit of an arcane syntax. They help finding the first or last element in a sorted group.

SELECT 
  max(actor_id)   KEEP (DENSE_RANK FIRST ORDER BY c DESC, actor_id),
  max(first_name) KEEP (DENSE_RANK FIRST ORDER BY c DESC, actor_id),
  max(last_name)  KEEP (DENSE_RANK FIRST ORDER BY c DESC, actor_id),
  max(c)          KEEP (DENSE_RANK FIRST ORDER BY c DESC, actor_id)
FROM (
  SELECT actor_id, first_name, last_name, count(film_id) c
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  GROUP BY actor_id, first_name, last_name
) t;

It’s fair to say that the syntax is a bit verbose.

How does this work? The nested query counts again all the films per actor, producing the entire result set. The outer query, however, aggregates all these actors and their film counts into a single row, displaying in that row only the first value(s) within the group, ordered by some specific ordering criteria. Note that the ACTOR_ID was also added to the ordering criteria, to get a unique ordering (see “with ties” semantics, below).

Why do we need the MAX() aggregate function? Simply because this feature applies only to aggregate functions, and the FIRST value(s) could be more than one value within the group, namely when the ordering would produce a “tie” (again, see below).

I’ve seen this slightly outperform the alternatives in TOP 1 cases (that ignore “TIES”, see below), including this one. A benchmark showing relative execution times shows:

Statement 1 : 1.88945  (FETCH version)
Statement 2 : 1.84442  (window function version)
Statement 3 : 1        (KEEP version)

The benchmark technique is described in this blog, and also again below. Measure yourself though!

Getting the Top Value “With Ties”

In the previous query, we’re getting a single top value, and the result is correct as we can see with this query that returns the TOP 5 values:

SELECT actor_id, first_name, last_name, count(film_id)
FROM actor
LEFT JOIN film_actor USING (actor_id)
GROUP BY actor_id, first_name, last_name
ORDER BY count(film_id) DESC
LIMIT 5;

The result being:

ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT
--------------------------------------
107       GINA        DEGENERES  42
102       WALTER      TORN       41
198       MARY        KEYTEL     40
181       MATTHEW     CARREY     39
23        SANDRA      KILMER     37

But what if WALTER TORN had played in yet one more film? We would have a “tie” between GINA DEGENERES and WALTER TORN, so which one would we pick as the TOP 1 actor? There are several possible answers:

  • Pick a random one by accident: This is what our query does right now. This might be considered fair, as the preference for the winner might be merely technical and might change between releases
  • Pick a random one explicitly: We could add another sort criteria that introduces randomness (e.g. a RANDOM() function call). Let’s just hope that this call is only made when needed, i.e. when we have a tie. Otherwise, that would be rather dangerous. But it would be fair.
  • Specify a unique additional sort criteria: E.g. the ACTOR_ID. That would make the result predictable, but in this case a bit unfair.
  • Fetch both tied actors. Yes, we can have several winners. That’s what sports events do, and that’s what we’ll discuss now.

So, let’s add one more film for WALTER TORN:

INSERT INTO film_actor (actor_id, film_id)
VALUES (102, 1);

The SQL standard specifies how we can fetch the first rows “with their ties”, namely by using … well … the FETCH FIRST ROWS WITH TIES syntax!

This is implemented as such in Oracle 12c, the only database I’ve seen with this standard syntax so far.

SELECT actor_id, first_name, last_name, count(film_id)
FROM actor
LEFT JOIN film_actor USING (actor_id)
GROUP BY actor_id, first_name, last_name
ORDER BY count(film_id) DESC
FETCH FIRST ROWS WITH TIES;

Note that all of these syntaxes are possible, they’re all equivalent. Use the one that resembles the English language the most, in your opinion (in other words: everyone does it differently):

FETCH FIRST 1 ROWS WITH TIES
FETCH FIRST 1 ROW WITH TIES
FETCH FIRST ROWS WITH TIES
FETCH FIRST ROW WITH TIES
FETCH NEXT 1 ROWS WITH TIES
FETCH NEXT 1 ROW WITH TIES
FETCH NEXT ROWS WITH TIES
FETCH NEXT ROW WITH TIES

The only other database I’m aware of that knows this feature (but not the standard syntax) is SQL Server:

SELECT TOP 1 WITH TIES 
  a.actor_id, first_name, last_name, count(*)
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
GROUP BY a.actor_id, first_name, last_name
ORDER BY count(*) DESC;

While SQL Server also supports the standard OFFSET .. FETCH syntax for the ROWS ONLY semantics (i.e. pick a random row among the winners), it supports WITH TIES only with the proprietary TOP syntax.

That’s OK, we can always use TOP, because OFFSET and OFFSET pagination is verboten anyway, right?

Using Window Functions in Other Databases

As promised, we’re now revisiting a window function based solution to make the “WITH TIES” semantics work in other databases as well. We can make use of ranking functions, namely RANK() (or in more rare cases DENSE_RANK()).

Here’s how to find the TOP 1 actor WITH TIES in PostgreSQL, for instance:

SELECT *
FROM (
  SELECT 
    actor_id, first_name, last_name, count(film_id),
    rank() OVER (ORDER BY count(film_id) DESC) rk
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  GROUP BY actor_id, first_name, last_name
) t
WHERE rk = 1;

We’re now nesting the original query in a derived table, adding an additional column RK, which contains the RANK() of the row given the desired ordering. The result is:

ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT  RK
------------------------------------------
107       GINA        DEGENERES  42     1
102       WALTER      TORN       42     1

If we wanted to find the TOP 5 again, we simply search for ranks less or equal to 5. And don’t forget explicit ordering again. Even if the ordering from the subquery will probably be stable (namely the one from the window function, which was the last one applied to the subquery), we must never rely on a non-explicit ordering. NEVER.

SELECT *
FROM (
  SELECT 
    actor_id, first_name, last_name, count(film_id),
    rank() OVER (ORDER BY count(film_id) DESC) rk
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  GROUP BY actor_id, first_name, last_name
) t
WHERE rk <= 5
ORDER BY rk;
ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT  RK
------------------------------------------
107       GINA        DEGENERES  42     1
102       WALTER      TORN       42     1
198       MARY        KEYTEL     40     3
181       MATTHEW     CARREY     39     4
23        SANDRA      KILMER     37     5

Now, if we had another actor with 37 films, that other actor would also appear in this list and the list would contain 6 actors, even if we searched for the TOP 5 (WITH TIES). Cool, eh?

Oracle Specific Solution

Remember we were discussing the Oracle-specific FIRST and LAST functions? Unfortunately, this cannot be used to get the WITH TIES semantics because the aggregation can produce only a single row, not multiple tied rows.

As a workaround, I’ve tried combining LISTAGG() with KEEP with no avail:

SELECT 
  LISTAGG (first_name, ', ')
    WITHIN GROUP (ORDER BY count(film_id) DESC)
    KEEP (DENSE_RANK FIRST ORDER BY count(film_id) DESC),
...

While this could have produced all the ties in a comma separated list of values, this syntax is simply not allowed.

Top Values Per Category

Now, this is the coolest! So far, we’ve run single queries getting the single TOP N values over our entire data set. I.e. the actor with the most films.

Imagine, however, we’d like to get the TOP N something per actor. For instance, the TOP 3 most successful films an actor played in. That would be quite a query. To keep it simple on the Sakila database, let’s simply find the TOP 3 films with the longest titles per actor.

This will again use LIMIT, but this time, we need to do it in a subquery. The two tools we’ve seen so far won’t really work well:

  • Derived tables (subqueries in the FROM clause) cannot implement the per actor semantics easily, at least not with LIMIT. It’s possible with window functions as we’ll see later.
  • Correlated subqueries (subqueries in the SELECT or WHERE clauses) can only return one row and one column. Not quite useful when we want to return, e.g. the TOP 3 rows.

Luckily, the SQL standard specifies LATERAL (implemented by Oracle 12c, PostgreSQL, DB2) and SQL Server has always had APPLY (also available in Oracle 12c). Both syntaxes are exactly equivalent. Let’s look at APPLY first, as I much prefer this syntax:

SELECT actor_id, first_name, last_name, title
FROM actor a
OUTER APPLY (
  SELECT title
  FROM film f
  JOIN film_actor fa USING (film_id)
  WHERE fa.actor_id = a.actor_id
  ORDER BY length(title) DESC
  FETCH FIRST 3 ROWS ONLY
) t
ORDER BY actor_id, length(title) DESC;

This will produce:

ACTOR_ID  FIRST_NAME  LAST_NAME  TITLE
------------------------------------------------------
1         PENELOPE    GUINESS    BULWORTH COMMANDMENTS
1         PENELOPE    GUINESS    ANACONDA CONFESSIONS
1         PENELOPE    GUINESS    GLEAMING JAWBREAKER
2         NICK        WAHLBERG   GOODFELLAS SALUTE
2         NICK        WAHLBERG   DESTINY SATURDAY
2         NICK        WAHLBERG   ADAPTATION HOLES
3         ED          CHASE      ARTIST COLDBLOODED
3         ED          CHASE      NECKLACE OUTBREAK
3         ED          CHASE      BOONDOCK BALLROOM
...

Cool. How does it work?

Think of the derived table as a function. A function with a single argument:

SELECT actor_id, first_name, last_name, title
FROM actor a
OUTER APPLY (
  SELECT title
  FROM film f
  JOIN film_actor fa USING (film_id)
  WHERE fa.actor_id = a.actor_id -- this is the function argument
  ORDER BY length(title) DESC
  FETCH FIRST 3 ROWS ONLY
) t
ORDER BY actor_id, length(title) DESC;

Or, if we would be storing this as an actual table valued function, e.g. in PostgreSQL syntax:

CREATE FUNCTION top_3_films_per_actor(p_actor_id IN INTEGER)
RETURNS TABLE (
  title VARCHAR(50)
)
AS $$
  SELECT title
  FROM film f
  JOIN film_actor fa USING (film_id)
  WHERE fa.actor_id = p_actor_id
  ORDER BY length(title) DESC
  LIMIT 3
$$ LANGUAGE sql;

Table valued functions are pretty cool. They work like parameterised views in case your database can inline it. DB2 and SQL Server can, Oracle 12c and PostgreSQL 9.6 cannot. It’s not a surprise in Oracle, which doesn’t support SQL functions, only PL/SQL functions.

The function can now be used as such:

SELECT actor_id, first_name, last_name, title
FROM actor a
LEFT JOIN LATERAL top_3_films_per_actor(a.actor_id) t 
ON 1 = 1 -- Silly predicate is required because of LEFT JOIN
ORDER BY actor_id, length(title) DESC;

Note that the LATERAL syntax is exactly the same as APPLY with a bit more syntactic ceremony, especially when used with OUTER JOIN.

Now, normally, we aren’t allowed to do this. We aren’t allowed to access the column A.ACTOR_ID from within a derived table or function expression. Inside the derived table, the table A is not defined, and the outer scope is not accessible either.

APPLY (and LATERAL) changes this. With these tools, we can now access all the tables and their columns to the left of the APPLY (or LATERAL) operator. This means our subquery now works like a correlated subquery, but it is allowed to return:

  • More than one row
  • More than one column

That’s really cool! If you’re working with Java 8 streams, think of it as the equivalent of the Stream.flatMap() operation. If we wanted to write this query in Java with streams, we could write something like:

actors
  .stream()
  .flatMap(a -> films
    .stream()
    .filter(f -> f.hasActor(a)) // Assuming some API
    .sorted(comparing(f -> f.title.length()).reversed())
    .limit(3)
    .map(f -> tuple(a, f)));

This would be more or less the same, with a few simplifications. E.g. the FILM_ACTOR JOIN has been cheated away with an auxiliary method Film.hasActor(Actor).

So, following the flatMap() “metaphor”, APPLY applies a table function (e.g. a subquery) to another table (in our case: ACTOR). That function produces a table for each record of the ACTOR table. In our case, we chose OUTER APPLY (instead of CROSS APPLY) because like a LEFT JOIN that will keep the actors without films in the result set – something that isn’t as easy to do with flatMap(), which corresponds to CROSS APPLY.

For more info about LATERAL and APPLY read these interesting posts:

WITH TIES Semantics

What if we wanted to list the TOP 3 films by their longest title per actor WITH TIES? I.e. if there are several films of equal length, we might get 4 or 5 or more films for any given actor?

In databases that support the WITH TIES syntax, this is really simple. In Oracle, just replace ROWS ONLY by ROWS WITH TIES:

SELECT actor_id, first_name, last_name, title
FROM actor a
OUTER APPLY (
  SELECT title
  FROM film f
  JOIN film_actor fa USING (film_id)
  WHERE fa.actor_id = a.actor_id
  ORDER BY length(title) DESC
  FETCH FIRST 3 ROWS WITH TIES
) t
ORDER BY actor_id, length(title) DESC;

In SQL Server, add the clause to the TOP clause:

SELECT actor_id, first_name, last_name, title
FROM actor a
OUTER APPLY (
  SELECT TOP 3 WITH TIES title
  FROM film f
  JOIN film_actor fa ON f.film_id = fa.film_id
  WHERE fa.actor_id = a.actor_id
  ORDER BY len(title) DESC
) t
ORDER BY actor_id, len(title) DESC;

In both cases, we’re now getting:

ACTOR_ID  FIRST_NAME  LAST_NAME  TITLE
------------------------------------------------------
1         PENELOPE    GUINESS    BULWORTH COMMANDMENTS
1         PENELOPE    GUINESS    ANACONDA CONFESSIONS
1         PENELOPE    GUINESS    GLEAMING JAWBREAKER
1         PENELOPE    GUINESS    WESTWARD SEABISCUIT
2         NICK        WAHLBERG   GOODFELLAS SALUTE
2         NICK        WAHLBERG   ADAPTATION HOLES
2         NICK        WAHLBERG   WARDROBE PHANTOM
2         NICK        WAHLBERG   RUSHMORE MERMAID
2         NICK        WAHLBERG   HAPPINESS UNITED
2         NICK        WAHLBERG   FIGHT JAWBREAKER
2         NICK        WAHLBERG   DESTINY SATURDAY
3         ED          CHASE      ARTIST COLDBLOODED
3         ED          CHASE      NECKLACE OUTBREAK
3         ED          CHASE      BOONDOCK BALLROOM
...

Quite a different result!

If we don’t have native support for WITH TIES, then we can use RANK() again, and this time, we don’t need APPLY or LATERAL:

SELECT actor_id, first_name, last_name, title
FROM actor a
LEFT JOIN (
  SELECT 
    actor_id,
    title, 
    rank() OVER (
      PARTITION BY actor_id
      ORDER BY length(title) DESC
    ) rk
  FROM film f
  JOIN film_actor fa USING (film_id)
) t USING (actor_id)
WHERE rk <= 3
ORDER BY actor_id, rk;

What are we doing here, compared to the APPLY version of the query?

  • First off, we’re no longer filtering anything in the subquery. The original WHERE clause that accessed the outer A.ACTOR_ID column is gone
  • … instead, we’re using that ACTOR_ID column in an ordinary JOIN predicate between the ACTOR table and our derived table that produces the films
  • We calculate the RANK() of films per actor. Unlike before, where we calculated the RANK() over the entire data set (the default PARTITION in window functions is always the entire data set), we now partition our data set by ACTOR_ID. So, for each actor, we’re getting the TOP 3 (and 4 and 5 and 6, …) films pre-calculated
  • … only then, in the outer query, we can filter again by this pre-calculated rank, hoping that the database will be smart enough and somehow push down the predicate into the ranking function.

Which is the faster solution? NEVER GUESS! ALWAYS MEASURE! 🙂

Benchmarking time

As a matter of fact, FETCH is just syntax sugar for filtering on window functions in Oracle. So the two queries should really perform equally well. I’m using the benchmarking technique from this article here.

Oracle 12.2.0.1.0 first:

Here’s the full code for Oracle:

SET SERVEROUTPUT ON
CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 100;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT actor_id, first_name, last_name, title
        FROM actor a
        OUTER APPLY (
          SELECT title
          FROM film f
          JOIN film_actor fa USING (film_id)
          WHERE fa.actor_id = a.actor_id
          ORDER BY length(title) DESC
          FETCH FIRST 3 ROWS WITH TIES
        ) t
        ORDER BY actor_id, length(title) DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, 1, 
	  SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT actor_id, first_name, last_name, title
        FROM actor a
        LEFT JOIN (
          SELECT 
            actor_id,
            title, 
            rank() OVER (
              PARTITION BY actor_id
              ORDER BY length(title) DESC
            ) rk
          FROM film f
          JOIN film_actor fa USING (film_id)
        ) t USING (actor_id)
        WHERE rk <= 3
        ORDER BY actor_id, rk
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    INSERT INTO results VALUES (r, 2, 
	  SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio 
    FROM results
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio);
  END LOOP;
END;
/

DROP TABLE results;

Yielding:

Run 1, Statement 1 :  9.19943
Run 1, Statement 2 :  1.07469
Run 2, Statement 1 : 12.86258
Run 2, Statement 2 :  1.03741
Run 3, Statement 1 : 13.80538
Run 3, Statement 2 :  1.09832
Run 4, Statement 1 : 13.91985
Run 4, Statement 2 :  1.16206
Run 5, Statement 1 :  9.37335
Run 5, Statement 2 :  1

Results are relative to each other (according to my understanding, I’m doing this in order to comply with the Oracle license regarding the publication of benchmark results – no actual times are published). Statement 2 (using explicit ranking) is repeatedly at least 9x faster than statement 1 (using FETCH) on Oracle 12.2.0.1.0! Bummer!

What about SQL Server 2014?

Benchmarking logic:

DECLARE @ts DATETIME;
DECLARE @repeat INT = 100;
DECLARE @r INT;
DECLARE @i INT;
DECLARE @dummy1 INT;
DECLARE @dummy2 VARCHAR;
DECLARE @dummy3 VARCHAR;
DECLARE @dummy4 VARCHAR;

DECLARE @s1 CURSOR;
DECLARE @s2 CURSOR;

DECLARE @results TABLE (
  run     INT,
  stmt    INT,
  elapsed DECIMAL
);

SET @r = 0;
WHILE @r < 5
BEGIN
  SET @r = @r + 1

  SET @s1 = CURSOR FOR 
    SELECT actor_id, first_name, last_name, title
	FROM actor a
	OUTER APPLY (
	  SELECT TOP 3 WITH TIES title
	  FROM film f
	  JOIN film_actor fa ON f.film_id = fa.film_id
	  WHERE fa.actor_id = a.actor_id
	  ORDER BY len(title) DESC
	) t
	ORDER BY actor_id, len(title) DESC;

  SET @s2 = CURSOR FOR 
    SELECT a.actor_id, first_name, last_name, title
    FROM actor a
    LEFT JOIN (
      SELECT 
        actor_id,
        title, 
        rank() OVER (
          PARTITION BY actor_id
          ORDER BY len(title) DESC
        ) rk
      FROM film f
      JOIN film_actor fa ON f.film_id = fa.film_id
	) t ON a.actor_id = t.actor_id
    WHERE rk <= 3
    ORDER BY a.actor_id, rk

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s1;
    FETCH NEXT FROM @s1 INTO @dummy1, @dummy2, @dummy3, @dummy4;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s1 INTO @dummy1, @dummy2, @dummy3, @dummy4;
    END;

    CLOSE @s1;
  END;

  DEALLOCATE @s1;
  INSERT INTO @results 
  VALUES (@r, 1, DATEDIFF(ms, @ts, current_timestamp));

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s2;
    FETCH NEXT FROM @s2 INTO @dummy1, @dummy2, @dummy3, @dummy4;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s2 INTO @dummy1, @dummy2, @dummy3, @dummy4;
    END;

    CLOSE @s2;
  END;

  DEALLOCATE @s2;
  INSERT INTO @results 
  VALUES (@r, 2, DATEDIFF(ms, @ts, current_timestamp));
END;

SELECT 'Run ' + CAST(run AS VARCHAR) + ', Statement ' + 
  CAST(stmt AS VARCHAR) + ': ' + 
  CAST(CAST(elapsed / MIN(elapsed) OVER() AS DECIMAL(10, 5)) AS VARCHAR)
FROM @results;

Surprisingly, also SQL Server has worse performance with the APPLY approach than with window functions, although the difference is smaller (I’m using SQL Server 2014):

Run 1, Statement 1: 5.07019
Run 1, Statement 2: 1.11988
Run 2, Statement 1: 5.48820
Run 2, Statement 2: 1.20683
Run 3, Statement 1: 5.08882
Run 3, Statement 2: 1.31429
Run 4, Statement 1: 5.31863
Run 4, Statement 2: 1.00000
Run 5, Statement 1: 5.07453
Run 5, Statement 2: 1.21491

I would have expected SQL Server to be much better here, given that this syntax has been available for a long time in SQL Server. Challenge to the reader: Why do both databases perform so poorly with CROSS APPLY? I’ll explain Oracle’s problem in a bit.

Let’s look at PostgreSQL 9.6

The “WITH TIES” semantics can only be implemented with window functions in PostgreSQL, so let’s stick to the “ROWS” semantics of the LIMIT clause.

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 100;
  rec RECORD;
  run INT[];
  stmt INT[];
  elapsed DECIMAL[];
  max_elapsed DECIMAL;
  i INT := 1;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT actor_id, first_name, last_name, title
        FROM actor a
        LEFT JOIN LATERAL (
          SELECT title
          FROM film f
          JOIN film_actor fa USING (film_id)
          WHERE fa.actor_id = a.actor_id
          ORDER BY length(title) DESC
          LIMIT 3
        ) t ON true
        ORDER BY actor_id, length(title) DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    run[i] := r;
    stmt[i] := 1;
    elapsed[i] := 
       (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP)) 
      - EXTRACT(EPOCH FROM v_ts));
    i := i + 1;
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT actor_id, first_name, last_name, title
        FROM actor a
        LEFT JOIN (
          SELECT 
            actor_id,
            title, 
            row_number() OVER (
              PARTITION BY actor_id
              ORDER BY length(title) DESC
            ) rn
          FROM film f
          JOIN film_actor fa USING (film_id)
        ) t USING (actor_id)
        WHERE rn <= 3
        ORDER BY actor_id, rn
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    run[i] := r;
    stmt[i] := 2;
    elapsed[i] := 
       (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP)) 
      - EXTRACT(EPOCH FROM v_ts));
    i := i + 1;
  END LOOP;

  SELECT max(t.elapsed)
  INTO max_elapsed
  FROM unnest(elapsed) AS t(elapsed);

  FOR i IN 1..array_length(run, 1) LOOP
    RAISE INFO 'RUN %, Statement %: %', run[i], stmt[i], 
      CAST(elapsed[i] / max_elapsed AS DECIMAL(10, 5));
  END LOOP;
END$$;

This time, the results are more even. It doesn’t really seem to matter which approach you pick:

00000: RUN 1, Statement 1: 0.75904
00000: RUN 1, Statement 2: 0.99784
00000: RUN 2, Statement 1: 0.75907
00000: RUN 2, Statement 2: 0.95206
00000: RUN 3, Statement 1: 0.82102
00000: RUN 3, Statement 2: 0.94841
00000: RUN 4, Statement 1: 0.96208
00000: RUN 4, Statement 2: 0.96218
00000: RUN 5, Statement 1: 0.83398
00000: RUN 5, Statement 2: 1.00000

This can have two reasons:

  • Optimistic: LATERAL is better optimised in PostgreSQL
  • Pessimistic: window functions are less well optimised in PostgreSQL

From experience with tuning PostgreSQL queries, I’ll go for the pessimistic guess, because indeed, there is no hint in the execution plan about any optimisation being done on the window function predicate:

Sort  (cost=911.26..915.81 rows=1821 width=40)
  Sort Key: a.actor_id, t.rn
  -> Hash Join  (cost=596.44..812.65 rows=1821 width=40)
     Hash Cond: (t.actor_id = a.actor_id)
     -> Subquery Scan on t  (cost=589.94..781.11 rows=1821 width=25)
        Filter: (t.rn <= 3)
        -> WindowAgg  (cost=589.94..712.83 rows=5462 width=29)
           -> Sort  (cost=589.94..603.59 rows=5462 width=21)
              Sort Key: fa.actor_id, (length((f.title)::text)) DESC
              -> Hash Join  (cost=77.50..250.88 rows=5462 width=21)
                 Hash Cond: (fa.film_id = f.film_id)
                 -> Seq Scan on film_actor fa (cost=0.0..84.62 rows=5462 width=4)
                 -> Hash  (cost=65.00..65.00 rows=1000 width=19)
                    ->  Seq Scan on film f  (cost=0.00..65.00 rows=1000 width=19)
     -> Hash  (cost=4.00..4.00 rows=200 width=17)
        -> Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=17)

What about DB2 10.5?

BEGIN
  DECLARE CONTINUE HANDLER FOR SQLSTATE '42710' BEGIN END;
  EXECUTE IMMEDIATE 'CREATE TABLE print_relative (run INTEGER, stmt INTEGER, elapsed DECIMAL(20, 4))';
END 

BEGIN
  DECLARE v_ts TIMESTAMP;
  DECLARE v_repeat INTEGER DEFAULT 100;
  DECLARE v_i INTEGER;
  DECLARE v_j INTEGER;

  -- Repeat benchmark several times to avoid warmup penalty
  SET v_i = 1;
  
  DELETE FROM print_relative;
  
  REPEAT
    SET v_j = 1;
    SET v_ts = CURRENT_TIMESTAMP;
    
    REPEAT
      FOR rec AS cur CURSOR FOR
        SELECT actor_id, first_name, last_name, title
        FROM actor a
        LEFT JOIN LATERAL (
          SELECT title
          FROM film f
          JOIN film_actor fa ON f.film_id = fa.film_id
          WHERE fa.actor_id = a.actor_id
          ORDER BY length(title) DESC
          FETCH FIRST 3 ROWS ONLY
        ) t ON 1 = 1
        ORDER BY actor_id, length(title) DESC
      DO
        BEGIN END;
      END FOR;
    
      SET v_j = v_j + 1;
      UNTIL v_j = v_repeat
    END REPEAT;
      
    INSERT INTO print_relative 
    VALUES (v_i, 1, (CURRENT_TIMESTAMP - v_ts));
    
    SET v_j = 1;
    SET v_ts = CURRENT_TIMESTAMP;
    
    REPEAT
      FOR rec AS cur CURSOR FOR
        SELECT a.actor_id, first_name, last_name, title
        FROM actor a
        LEFT JOIN (
          SELECT 
            actor_id,
            title, 
            row_number() OVER (
              PARTITION BY actor_id
              ORDER BY length(title) DESC
            ) rn
          FROM film f
          JOIN film_actor fa ON f.film_id = fa.film_id
        ) t ON t.actor_id = a.actor_id
        WHERE rn <= 3
        ORDER BY a.actor_id, rn
      DO
        BEGIN END;
      END FOR;
      
      SET v_j = v_j + 1;
      UNTIL v_j = v_repeat
    END REPEAT;

    INSERT INTO print_relative 
    VALUES (v_i, 2, (CURRENT_TIMESTAMP - v_ts));
    
    SET v_i = v_i + 1;
    UNTIL v_i = 5
  END REPEAT;
END

SELECT
  run, 
  stmt,
  CAST(elapsed / MIN(elapsed) OVER() AS DECIMAL(20, 4)) ratio
FROM print_relative;

DROP TABLE print_relative;

Result, again, window functions outperform LATERAL joining:

1	1	6.0353
1	2	1.0783
2	1	6.2549
2	2	1.0093
3	1	6.1067
3	2	1.0000
4	1	6.1987
4	2	1.0128

Explanation for benchmarks

For this explanation, I’m looking at Oracle execution plans only – showing what kind of rationale could be derived from their plans. I’m assuming that similar problems may appear in the other 3 databases.

Here’s the plan for the window function query:

The following is the execution plan obtained through

SELECT * FROM TABLE (
  dbms_xplan.display_cursor(format => 'ALLSTATS LAST')
)

It displays the actual execution plan with gathered statistics (using the /*+GATHER_PLAN_STATISTICS*/ hint), not the estimated plan.

----------------------------------------------------------------------
| Id  | Operation                  | Name          | Starts | A-Rows |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT           |               |      1 |    767 |
|   1 |  SORT ORDER BY             |               |      1 |    767 |
|*  2 |   HASH JOIN                |               |      1 |    767 |
|   3 |    TABLE ACCESS FULL       | ACTOR         |      1 |    200 |
|*  4 |    VIEW                    |               |      1 |    767 |
|*  5 |     WINDOW SORT PUSHED RANK|               |      1 |   1079 |
|*  6 |      HASH JOIN             |               |      1 |   5463 |
|   7 |       TABLE ACCESS FULL    | FILM          |      1 |   1000 |
|   8 |       INDEX FAST FULL SCAN | PK_FILM_ACTOR |      1 |   5463 |
----------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("A"."ACTOR_ID"="T"."ACTOR_ID")
   4 - filter("T"."RK"<=3)
   5 - filter(RANK() OVER ( PARTITION BY "FA"."ACTOR_ID" 
         ORDER BY LENGTH("F"."TITLE") DESC )<=3)
   6 - access("F"."FILM_ID"="FA"."FILM_ID")

As can be seen, there’s some initial work of calculating the rank for most FILM_ACTOR rows from the nested query. I say most because the PUSHED RANK in the WINDOW SORT PUSHED RANK operation seems to indicate that the rank is calculated only as long as it is needed, i.e. until the ranking predicate rk <= 3 is no longer true.

In particular, this also means that we do not necessarily execute the entire sort, but we can keep a buffer of the current TOP 3, which can be done in O(N) rather than the full sort’s O(N log N). Whether this is really done in Oracle, I don’t know.

In any case, the result of the ranking operation is then efficiently hash joined with the ACTOR table.

What’s most interesting, though, is the “Starts” column, which indicates how many times an operation has been started. Each operation is started only once!

What about APPLY?

The plan is much worse:

---------------------------------------------------------------------------
| Id  | Operation                     | Name            | Starts | A-Rows |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                 |      1 |    767 |
|   1 |  SORT ORDER BY                |                 |      1 |    767 |
|   2 |   MERGE JOIN OUTER            |                 |      1 |    767 |
|   3 |    TABLE ACCESS FULL          | ACTOR           |      1 |    200 |
|   4 |    BUFFER SORT                |                 |    200 |    767 |
|   5 |     VIEW                      | VW_LAT_14BC7596 |    200 |    767 |
|   6 |      VIEW                     | VW_LAT_A18161FF |    200 |    767 |
|*  7 |       VIEW                    |                 |    200 |    767 |
|*  8 |        WINDOW SORT PUSHED RANK|                 |    200 |   1079 |
|   9 |         NESTED LOOPS          |                 |    200 |   5463 |
|  10 |          TABLE ACCESS FULL    | FILM            |    200 |    200K|
|* 11 |          INDEX UNIQUE SCAN    | PK_FILM_ACTOR   |    200K|   5463 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   7 - filter("from$_subquery$_008"."rowlimit_$$_rank"<=3)
   8 - filter(RANK() OVER ( ORDER BY LENGTH("F"."TITLE") DESC )<=3)
  11 - access("FA"."ACTOR_ID"="A"."ACTOR_ID" AND "F"."FILM_ID"="FA"."FILM_ID")

Observe the “Starts” column. On Id 4, we’re experiencing 200 executions of the BUFFER SORT operation. That’s because there are exactly 200 rows in the ACTOR table (as can be seen on Id 3). So, indeed, there’s no optimisation happening that would calculate the “laterally correlated subquery” for all actors in one go.

Furthermore, quite unfortunately, the JOIN order between FILM and FILM_ACTOR seems quite wrong. For each of the 200 actors, we’re loading the entire 1000 films, which produces 200 * 1000 = 200K rows to scan for those belonging to the single actor from the outer query (Id 11). That’s unreasonably repetitive.

I’d expect Oracle to inverse the table accesses. We can do this with a hint:

SELECT actor_id, first_name, last_name, title
FROM actor a
OUTER APPLY (
  SELECT /*+LEADING(fa f) USE_NL(fa f)*/ title
  FROM film f
  JOIN film_actor fa USING (film_id)
  WHERE actor_id = a.actor_id
  ORDER BY length(title) DESC
  FETCH FIRST 3 ROWS WITH TIES
) t
ORDER BY actor_id, length(title) DESC;

Observe the two hints:

  • LEADING to indicate that the FILM_ACTOR table should be the driving row source of the join
  • USE_NL to enforce a NESTED LOOP JOIN (without this, and with LEADING, Oracle would prefer a HASH JOIN)

With the hints, the benchmark results look a lot better:

Statement 1 : 1          (window functions)
Statement 2 : 9.17483    (APPLY without hints)
Statement 3 : 4.88774    (APPLY with LEADING hint)
Statement 4 : 1.65269    (APPLY with LEADING and USE_NL hints)

Also, the execution plan now no longer has those excessive “Starts” and “A-Rows” values:

--------------------------------------------------------------------------------
| Id  | Operation                           | Name           | Starts | A-Rows |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                |      1 |     50 |
|   1 |  SORT ORDER BY                      |                |      1 |     50 |
|   2 |   MERGE JOIN OUTER                  |                |      1 |    767 |
|   3 |    TABLE ACCESS FULL                | ACTOR          |      1 |    200 |
|   4 |    BUFFER SORT                      |                |    200 |    767 |
|   5 |     VIEW                            | VW_LAT_2880E7C5|    200 |    767 |
|   6 |      VIEW                           | VW_LAT_A18161FF|    200 |    767 |
|*  7 |       VIEW                          |                |    200 |    767 |
|*  8 |        WINDOW SORT PUSHED RANK      |                |    200 |   1079 |
|   9 |         NESTED LOOPS                |                |    200 |   5463 |
|  10 |          NESTED LOOPS               |                |    200 |   5463 |
|* 11 |           INDEX RANGE SCAN          | PK_FILM_ACTOR  |    200 |   5463 |
|* 12 |           INDEX UNIQUE SCAN         | PK_FILM        |   5463 |   5463 |
|  13 |          TABLE ACCESS BY INDEX ROWID| FILM           |   5463 |   5463 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   7 - filter("from$_subquery$_006"."rowlimit_$$_rank"<=3)
   8 - filter(RANK() OVER ( ORDER BY LENGTH("F"."TITLE") DESC )<=3)
  11 - access("FA"."ACTOR_ID"="A"."ACTOR_ID")
  12 - access("F"."FILM_ID"="FA"."FILM_ID")

The highest number of “A-Rows” is easily explained. There are exactly 5463 rows in FILM_ACTOR. Still, I’d wish that the 200 TOP-N lookups per actor could be optimised somehow.

Interestingly, this plan is associated with a much higher cost than the original one, even if in this case, it is much better.

Conclusion

TOP N queries are very common, we need them all the time. The simplest way is to use an ORDER BY clause and a LIMIT clause.

Sometimes, we need WITH TIES semantics, and Oracle 12c as well as SQL Server can provide this with standard or vendor specific syntax, out of the box. Other databases can emulate WITH TIES using window functions rather easily.

When we need TOP N per category, an interesting and cool syntax comes in handy: APPLY or LATERAL. However, very unfortunately, simple benchmarks have shown that these can be much slower than their equivalent window function counterparts. Of course, this will heavily depend on the size of the data set. Do not underestimate the cost of O(N log N) sorts that occur in window functions. As always: Do measure, never guess.

JOIN Elimination: An Essential Optimiser Feature for Advanced SQL Usage

The SQL language has one great advantage over procedural, object oriented, and “ordinary” functional programming languages. The fact that it is truly declarative (i.e. a 4GL / fourth generation programming language) means that a sophisticated optimiser can easily transform one SQL expression into another, equivalent SQL expression, which might be faster to execute.

How does this work?

Expression Tree Equivalence

Here’s an introduction to equivalent expression trees from my SQL Masterclass: Let’s assume we want to evaluate the following expression:

A + (A - B)

Now in maths, it can be proven trivially that by the laws of associativity, the above expression is really the same as this one:

(A + A) - B

We didn’t really win anything yet, but we can equally trivially turn the above addition into a multiplication that can be proven to be exactly equivalent:

(2 * A) - B

Now, imagine that A is an extremely “expensive” value, e.g. a value that we have to fetch from the disk, or worse, from the network. The cost of accessing A is thus very high and if we can avoid one access to A by the above transformation, then the resulting expression is much faster to evaluate than the original one, even if mathematically speaking, it does not matter at all.

That’s what optimisers do all the time. They transform expressions into equivalent expressions which are faster to execute. And they constantly adapt, so if the DBA chooses to move A from a remote server to the local server, thus reducing the cost of access to A, perhaps, suddenly, the original plan will be better again, because of the cost of multiplication (just an example).

A SQL Example

An equally trivial SQL example from my SQL Masterclass shows that it really doesn’t matter mathematically if we run this query:

SELECT first_name, last_name
FROM customer
WHERE first_name = 'JAMIE'

Or this one:

SELECT *
FROM (
  SELECT first_name, last_name
  FROM customer
)
WHERE first_name = 'JAMIE'

With the SQL language, it may be a bit harder to see that these are exactly equivalent SQL statements, but if we translate the above queries to relational algebra, it may become more visible:

Selection before projection

… or WHERE before SELECT:

Projection then selection

… or SELECT before WHERE:

Don’t be fooled by relational algebra‘s term “selection”. It does not correspond to the SELECT clause, but to the WHERE clause!

We can prove (let’s leave the exercise to the reader), that both expressions are exactly equivalent, so optimisers can pick whichever they have a more efficient matching algorithm for:

  • Ordinary row stores will probably apply the selection first, before projecting, as the expensive operation is accessing rows and if we can filter rows early (e.g. through indexes) then the whole query is less expensive)
  • A column store might (just a hypothesis) apply the projection first, as the expensive operation is accessing the columns. We then might have to traverse the entire column anyway to filter out the rows

Let’s Talk About JOINs (and Their Elimination)

JOIN elimination is one of the most basic and yet most powerful SQL transformations, which is implemented by most modern databases in one way or another. It is so powerful, because we’re writing (potentially useless) JOINs all the time, when writing a bit more complex queries. See also our article about JOINs for a general overview.

Now, consider the following simplified schema, taken from the Sakila database:

CREATE TABLE address (
  address_id INT NOT NULL,
  address VARCHAR(50) NOT NULL,
  CONSTRAINT pk_address PRIMARY KEY (address_id)
);

CREATE TABLE customer (
  customer_id INT NOT NULL,
  first_name VARCHAR(45) NOT NULL,
  last_name VARCHAR(45) NOT NULL,
  address_id INT NOT NULL,
  CONSTRAINT pk_customer PRIMARY KEY  (customer_id),
  CONSTRAINT fk_customer_address FOREIGN KEY (address_id) 
    REFERENCES address(address_id)
);

Let’s ignore indexing and other useful features for this example.

INNER JOIN Elimination

The following query shows a common JOIN use-case, joining a parent table (ADDRESS) to a child table (CUSTOMER) in a to-one relationship:

SELECT c.*
FROM customer c
JOIN address a 
ON c.address_id = a.address_id

We intended to fetch all customers and their addresses. But observe: We project only columns from the CUSTOMER table and we don’t have any predicates at all, specifically not predicates using the ADDRESS table. So, we’re completely ignoring any contributions from the ADDRESS table. We never really needed the JOIN in the first place!

And in fact, the optimiser can prove this too, because of the FOREIGN KEY constraint on C.ADDRESS_ID, which guarantees that every CUSTOMER record has exactly one corresponding ADDRESS record. The JOIN does not duplicate, nor remove any CUSTOMER rows, so it is unneeded and thus eliminated (by some, not all databases, will list each database at the end of the article).

So, the database can rewrite the SQL statement to the following, equivalent SQL statement in the presence of said FOREIGN KEY:

SELECT *
FROM customer c

Now, quite obviously, this query will be faster than the previous one, if the entire JOIN can be avoided, and thus the entire access to the ADDRESS table. Neat, huh? Who would have thought that FOREIGN KEYs can be so useful in terms of performance.

OUTER JOIN Elimination

A LEFT [ OUTER ] JOIN will JOIN the right table to the left table but keep rows from the left table if there is no match (again, an explanation of joins can be seen here). When we apply LEFT JOIN to the previous query…

SELECT c.*
FROM customer c
LEFT JOIN address a 
ON c.address_id = a.address_id

… then we’ll fetch all rows from CUSTOMER regardless if that customer has any ADDRESS. This is useful if the FOREIGN KEY is optional (nullable), or completely absent, e.g. through:

ALTER TABLE customer DROP CONSTRAINT fk_customer_address

OUTER JOIN is even easier to eliminate, as it doesn’t require any FOREIGN KEY constraint for the database to prove that it is unneeded. A UNIQUE constraint on the parent table (here: ADDRESS.ADDRESS_ID) is sufficient to show that for every CUSTOMER there can be at most one ADDRESS, so the LEFT JOIN won’t duplicate any CUSTOMER rows (Unlike INNER JOIN, OUTER JOIN never remove rows).

Hence, the above query can again be rewritten to the more optimal:

SELECT *
FROM customer c

OUTER JOIN Elimination with DISTINCT

Another interesting case of OUTER JOIN elimination is the following one, which unfortunately didn’t work on Oracle for a customer of ours, recently, in a complex query that ran rogue. Let’s look at some other tables of the Sakila database, which expose a to-many relationship:

CREATE TABLE actor (
  actor_id numeric NOT NULL ,
  first_name VARCHAR(45) NOT NULL,
  last_name VARCHAR(45) NOT NULL,
  CONSTRAINT pk_actor PRIMARY KEY (actor_id)
);

CREATE TABLE film (
  film_id int NOT NULL,
  title VARCHAR(255) NOT NULL,
  CONSTRAINT pk_film PRIMARY KEY (film_id)
);

CREATE TABLE film_actor (
  actor_id INT NOT NULL,
  film_id  INT NOT NULL,
  CONSTRAINT pk_film_actor PRIMARY KEY (actor_id, film_id),
  CONSTRAINT fk_film_actor_actor FOREIGN KEY (actor_id)
    REFERENCES actor (actor_id),
  CONSTRAINT fk_film_actor_film FOREIGN KEY (film_id)
    REFERENCES film (film_id)
);

Now, consider this query:

SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id

We’re looking for all actors and their films, but then we project only distinct actors. Again, the JOIN to FILM_ACTOR doesn’t contribute anything to the result, but because we’re joining a to-many relationship here (from parent table ACTOR to child table FILM_ACTOR), the JOIN is producing duplicate rows. Without DISTINCT, we’d get something like:

FIRST_NAME   LAST_NAME
----------------------
...
PENELOPE     GUINESS
PENELOPE     GUINESS
PENELOPE     GUINESS
NICK         WAHLBERG
NICK         WAHLBERG
...

But thanks to the DISTINCT keyword, the result is (provably) no different from the result of this much simpler query:

SELECT DISTINCT first_name, last_name
FROM actor a

(Note, DISTINCT cannot be eliminated, unless we already have a UNIQUE constraint on (FIRST_NAME, LAST_NAME)).

Why not Just Refactor the SQL Manually?

Of course, all of this shouldn’t be needed if our SQL were perfect. In the above trivial examples, the SQL can (and should) be re-written manually to improve quality. But note that:

  • Developers make mistakes, and those mistakes may be very subtle when queries get more complex. I’ll show an example below.
  • The presence of this feature actually encourages writing more complex SQL, especially when using reusable views. I’ll show another example below.
  • Finally, I’ve previously advocated avoiding needless, mandatory work, like SELECT *. Such work is mandatory because the optimiser cannot prove its needlessness. In the case of these JOINs, the optimiser can prove the needlessness, so the work is no longer mandatory. It can be eliminated.

Here are some complex examples as promised, where this optimiser feature really shines:

Subtle Mistakes

Let’s consider the following query (in PostgreSQL syntax):

SELECT c.name, count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
JOIN film f USING (film_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

What does it do? For ACTOR_ID = 1 (Penelope Guiness), we’re looking for all the different film categories she played in, and the number of films per category. This is easier to understand when we look at the result:

NAME         COUNT
Horror       3
Family       2
New          2
Classics     2
Games        2
Music        1
Sci-Fi       1
Animation    1
Sports       1
Children     1
Comedy       1
Documentary  1
Foreign      1

Now, can you spot the unneeded JOINs? In fact, we never needed ACTOR, nor did we need FILM

SELECT c.name, count(*)
FROM film_actor fa
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

Cool, eh? The JOINs can be eliminated (again, in some databases, see below) and our “mistake” is no longer relevant to the query. The mistake could have also snuck (or sneaked?) in from a previous query version, which may have looked like this, projecting also the actor information and the list of films per category, in case of which the additional JOIN are needed:

SELECT 
  c.name, count(*), 
  a.first_name, a.last_name, 
  array_agg(f.title ORDER BY f.title)
FROM actor a
JOIN film_actor fa USING (actor_id)
JOIN film f USING (film_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name, a.first_name, a.last_name
ORDER BY count(*) DESC

The result being:

NAME         COUNT  FIRST_NAME  LAST_NAME  FILMS
Horror       3      PENELOPE    GUINESS    {"ELEPHANT TROJAN","LADY STAGE","RULES HUMAN"}
Family       2      PENELOPE    GUINESS    {"KING EVOLUTION","SPLASH GUMP"}
New          2      PENELOPE    GUINESS    {"ANGELS LIFE","OKLAHOMA JUMANJI"}
Classics     2      PENELOPE    GUINESS    {"COLOR PHILADELPHIA","WESTWARD SEABISCUIT"}
Games        2      PENELOPE    GUINESS    {"BULWORTH COMMANDMENTS","HUMAN GRAFFITI"}
Music        1      PENELOPE    GUINESS    {"WIZARD COLDBLOODED"}
Sci-Fi       1      PENELOPE    GUINESS    {"CHEAPER CLYDE"}
Animation    1      PENELOPE    GUINESS    {"ANACONDA CONFESSIONS"}
Sports       1      PENELOPE    GUINESS    {"GLEAMING JAWBREAKER"}
Children     1      PENELOPE    GUINESS    {"LANGUAGE COWBOY"}
Comedy       1      PENELOPE    GUINESS    {"VERTIGO NORTHWEST"}
Documentary  1      PENELOPE    GUINESS    {"ACADEMY DINOSAUR"}
Foreign      1      PENELOPE    GUINESS    {"MULHOLLAND BEAST"}

As you can see, this optimisation can be very useful on your legacy SQL, because if we maintain a complex query, we might not always be able to see all the JOINs that are really needed.

Reusable Views

Sometimes, we simply add additional JOINs for convenience, when building complex queries from simpler ones, e.g. by using views (which is a completely underrated RDBMS feature! You should all write more views).

Consider this view:

CREATE VIEW v_customer AS
SELECT 
  c.first_name, c.last_name, 
  a.address, ci.city, co.country
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)

It’s not unlikely that we will write a view like this, simply because we’re incredibly bored to constantly join all these tables all the time. Every time we do something with customers and addresses, we need the CITY and COUNTRY table as well.

From now on (with this view), we can simply select from the view and “forget” about how it came to be. Now, let’s consider we completely forget about the underlying table, because the view was so useful all the time. We could think about doing this:

SELECT first_name, last_name
FROM v_customer

What do you think will happen? Exactly. JOIN elimination. A view isn’t really anything special, just a “macro” of some stored SQL (beware of some databases, where this isn’t always the case, e.g. MySQL, which Bill Karwin was kind enough to hint me at). So the above statement will be transformed into:

SELECT first_name, last_name
FROM (
  SELECT 
    c.first_name, c.last_name, 
    a.address, ci.city, co.country
  FROM customer c
  JOIN address a USING (address_id)
  JOIN city ci USING (city_id)
  JOIN country co USING (country_id)
) v_customer

… which can be transformed into this (we don’t need all columns in the nested select):

SELECT first_name, last_name
FROM (
  SELECT 
    c.first_name, c.last_name
  FROM customer c
  JOIN address a USING (address_id)
  JOIN city ci USING (city_id)
  JOIN country co USING (country_id)
) v_customer

… which can be transformed into this (JOINs can be eliminated):

SELECT first_name, last_name
FROM (
  SELECT 
    c.first_name, c.last_name
  FROM customer c
) v_customer

… and finally (the subquery is not useful):

SELECT first_name, last_name
FROM customer

The view is even very useful for this particular query, thanks to JOIN elimination!

Note, the SQL transformations exposed above are simply educational. Actual optimisers may perform transformations in an entirely differently, or in a different order. This is just to show what’s possible, and what kind of stuff is being done.

Cool, So Can My Database Do It?

Perhaps! Let’s look at the three different types of JOIN elimination in the context of these databases:

  • DB2 LUW 10.5
  • MySQL 8.0.2
  • Oracle 12.2.0.1
  • PostgreSQL 9.6
  • SQL Server 2014

INNER JOIN Elimination

Remember, this depends on the presence (and usefulness) of a FOREIGN KEY constraint. The SQL statement we’re using here is:

SELECT first_name, last_name
FROM customer c
JOIN address a ON c.address_id = a.address_id

We’re hoping to get:

SELECT first_name, last_name
FROM customer c

DB2 LUW

The following execution plan (created with Markus Winand’s cool utility) shows that this works in DB2, there’s no access to the ADDRESS table:

Explain Plan                                                           |
-----------------------------------------------------------------------|
ID | Operation                           |                 Rows | Cost |
 1 | RETURN                              |                      |   61 |
 2 |  FETCH CUSTOMER                     | 599 of 599 (100.00%) |   61 |
 3 |   IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) |   20 |

MySQL

MySQL 8, apart from finally introducing CTE and window functions (yay), has a lot of new optimiser features, read Morgan Tocker’s useful optimiser guide for details. Unfortunately, INNER JOIN elimination is not implemented:

ID  TABLE  TYPE    REF                  ROWS   EXTRA
1   c      ALL                          599    
1   a      eq_ref  sakila.c.address_id  1      Using index

Not only is the JOIN executed, but it is executed using a nested loop with 599 index lookups, as MySQL still only supports NESTED LOOP JOINs, not HASH JOINs.

Bummer.

Oracle

No problem at all for Oracle:

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |   599 | 28752 |     5   (0)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| CUSTOMER |   599 | 28752 |     5   (0)| 00:00:01 |
------------------------------------------------------------------------------

… the JOIN is eliminated as expected.

PostgreSQL

Unfortunately, PostgreSQL cannot eliminate INNER JOIN:

Hash Join  (cost=19.57..42.79 rows=599 width=13)
  Hash Cond: (c.address_id = a.address_id)
  ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=15)
  ->  Hash  (cost=12.03..12.03 rows=603 width=4)
        ->  Seq Scan on address a  (cost=0.00..12.03 rows=603 width=4)

Not as bad as in MySQL, though, as PostgreSQL chose to use a HASH JOIN to combine the two tables.

SQL Server

No problemo for SQL Server, the ADDRESS table access is gone!

  |--Table Scan(OBJECT:([sakila].[dbo].[customer] AS [c]))

Excellent. What about…

OUTER JOIN Elimination

This one is a bit easier to prove for the database, remember? We don’t rely on any FOREIGN KEY anymore. A UNIQUE key in the parent table is sufficient to eliminate an OUTER JOIN. We can safely expect that if the INNER JOIN could be eliminated (DB2, Oracle, SQL Server), then an OUTER JOIN can be eliminated, too.

Here’s the query:

SELECT first_name, last_name
FROM customer c
LEFT JOIN address a ON c.address_id = a.address_id

And the outcome:

DB2 LUW

Good

Explain Plan                                                           |
-----------------------------------------------------------------------|
ID | Operation                           |                 Rows | Cost |
 1 | RETURN                              |                      |   61 |
 2 |  FETCH CUSTOMER                     | 599 of 599 (100.00%) |   61 |
 3 |   IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) |   20 |

MySQL

Still nope:

ID  TABLE  TYPE    REF                  ROWS   EXTRA
1   c      ALL                          599    
1   a      eq_ref  sakila.c.address_id  1      Using index

Oracle

Perfect:

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |   599 | 28752 |     5   (0)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| CUSTOMER |   599 | 28752 |     5   (0)| 00:00:01 |
------------------------------------------------------------------------------

PostgreSQL

Unlike INNER JOIN elimination, this works. Great!

Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=13)

SQL Server

As expected, good:

  |--Table Scan(OBJECT:([sakila].[dbo].[customer] AS [c]))

Finally…

OUTER JOIN Elimination with DISTINCT

Remember, the query here navigates a to-many relationship, producing duplicate records of the parent table, but then removes all those duplicates again by

  • Ignoring contributions from the child table
  • Removing duplicates with DISTINCT

It’s easy to prove that this:

SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id

Is equivalent to this:

SELECT DISTINCT first_name, last_name
FROM actor a

Remember also, this only works with OUTER JOIN, not with INNER JOIN, as the latter might remove rows, so we have to execute it to see if it does.

DB2 LUW

Cool, this actually works!

Explain Plan                                                          |
----------------------------------------------------------------------|
ID | Operation        |                 Rows | Cost                   |
 1 | RETURN           |                      |   20                   |
 2 |  TBSCAN          | 200 of 200 (100.00%) |   20                   |
 3 |   SORT (UNIQUE)  | 200 of 200 (100.00%) |   20                   |
 4 |    TBSCAN ACTOR  | 200 of 200 (100.00%) |   20                   |

There’s no access to the FILM_ACTOR table, nor to its indexes. Very nice.

MySQL

As this is a more sophisticated transformation than the previous ones, we don’t have high hopes here.

ID  TABLE  TYPE    REF                  ROWS   EXTRA
1   a      ALL                          200    Using temporary
1   fa     ref     sakila.a.actor_id    27     Using index; Distinct

This has become a rather expensive query, again because of the lack of HASH JOIN support in MySQL!

Oracle

I’m very surprised to see that Oracle doesn’t support this optimisation, we’re executing the full query:

---------------------------------------------------------------
| Id  | Operation           | Name                    | Rows  |
---------------------------------------------------------------
|   0 | SELECT STATEMENT    |                         |  5462 |
|   1 |  HASH UNIQUE        |                         |  5462 |
|   2 |   NESTED LOOPS OUTER|                         |  5462 |
|   3 |    TABLE ACCESS FULL| ACTOR                   |   200 |
|*  4 |    INDEX RANGE SCAN | IDX_FK_FILM_ACTOR_ACTOR |    27 |
---------------------------------------------------------------

Curiously, Oracle also chose a NESTED LOOP JOIN in this case, even if we could have loaded the entire index on the FILM_ACTOR table into memory and then HASH JOINed it to ACTOR. Note that the cardinality estimate of the resulting query is quite off, too, despite the DISTINCT operation. This can lead to significant effects in an upstream query, which selects from this query (e.g. if stored as a view) – which is what happened to our customer.

PostgreSQL

PostgreSQL also doesn’t support this elimination, but at least gets cardinality estimates much more accurately and chooses a HASH JOIN operation:

HashAggregate  (cost=193.53..194.81 rows=128 width=13)
  Group Key: a.first_name, a.last_name
  ->  Hash Right Join  (cost=6.50..166.22 rows=5462 width=13)
        Hash Cond: (fa.actor_id = a.actor_id)
        ->  Seq Scan on film_actor fa  (cost=0.00..84.62 rows=5462 width=2)
        ->  Hash  (cost=4.00..4.00 rows=200 width=17)
              ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=17)

SQL Server

The pleasant surprise came from SQL Server, which does support this optimisation too:

  |--Sort(DISTINCT ORDER BY:([a].[first_name] ASC, [a].[last_name] ASC))
       |--Table Scan(OBJECT:([sakila].[dbo].[actor] AS [a]))

As you can see, no access to any FILM_ACTOR related objects!

Summary

Here’s a summary of what databases can eliminate:

Database INNER JOIN:
to-one
OUTER JOIN:
to-one
OUTER JOIN DISTINCT:
to-many
DB2 LUW 10.5 Yep Yep Yep
MySQL 8.0.2 Nope Nope Nope
Oracle 12.2.0.1 Yep Yep Nope
PostgreSQL 9.6 Nope Yep Nope
SQL Server 2014 Yep Yep Yep

Conclusion

JOIN elimination is a very simple to understand, yet incredibly powerful feature that modern databases support to help developers build and maintain complex SQL queries without worrying too much about performance side effects.

It is possible because SQL is a 4GL (fourth-generation programming language), i.e. a declarative language whose parsed expression trees can “easily” be transformed into equivalent, simpler, and faster to execute expression trees. This work is constantly done by the optimiser behind the scenes, often without you even noticing (see example where unnecessary ACTOR and FILM tables were removed).

Currenly, DB2 and SQL Server are the leaders here with Oracle being a close runner-up, at least in my investigation. There’s hope for PostgreSQL, and a bit less hope for MySQL right now. There are tons of other interesting SQL transformations, which I’ll blog about in future blog posts, which may make the difference in other kinds of complex queries.

If you were intrigued by this sort of functionality, do have a look at my most recent SQL talk, which helps developers understand the real value of SQL in the context of the optimiser:

jOOQ 3.10 Supports JPA AttributeConverter

One of the cooler hidden features in jOOQ is the JPADatabase, which allows for reverse engineering a pre-existing set of JPA-annotated entities to generate jOOQ code.

For instance, you could write these entities here:

@Entity
public class Actor {

    @Id
    @GeneratedValue(strategy = IDENTITY)
    public Integer actorId;

    @Column
    public String firstName;

    @Column
    public String lastName;

    @ManyToMany(fetch = LAZY, mappedBy = "actors", 
        cascade = CascadeType.ALL)
    public Set<Film> films = new HashSet<>();

    public Actor(String firstName, String lastName) {
        this.firstName = firstName;
        this.lastName = lastName;
    }
}

@Entity
public class Film {

    @Id
    @GeneratedValue(strategy = IDENTITY)
    public Integer filmId;

    @Column
    public String title;

    @Column(name = "RELEASE_YEAR")
    @Convert(converter = YearConverter.class)
    public Year releaseYear;

    @ManyToMany(fetch = LAZY, cascade = CascadeType.ALL)
    public Set<Actor> actors = new HashSet<>();

    public Film(String title, Year releaseYear) {
        this.title = title;
        this.releaseYear = releaseYear;
    }
}

// Imagine also a Language entity here...

(Just a simple example. Let’s not discuss the caveats of @ManyToMany mapping).

For more info, the full example can be found on Github:

Now observe the fact that we’ve gone through all the trouble of mapping the database type INT for the RELEASE_YEAR column to the cool JSR-310 java.time.Year type for convenience. This has been done using a JPA 2.1 AttributeConverter, which simply looks like this:

public class YearConverter 
implements AttributeConverter<Year, Integer> {

    @Override
    public Integer convertToDatabaseColumn(Year attribute) {
        return attribute == null ? null : attribute.getValue();
    }

    @Override
    public Year convertToEntityAttribute(Integer dbData) {
        return dbData == null ? null : Year.of(dbData);
    }
}

Using jOOQ’s JPADatabase

Now, the JPADatabase in jOOQ allows you to simply configure the input entities (e.g. their package names) and generate jOOQ code from it. This works behind the scenes with this algorithm:

  • Spring is used to discover all the annotated entities on the classpath
  • Hibernate is used to generate an in-memory H2 database from those entities
  • jOOQ is used to reverse-engineer this H2 database again to generate jOOQ code

This works pretty well for most use-cases as the JPA annotated entities are already very vendor-agnostic and do not provide access to many vendor-specific features. We can thus perfectly easily write the following kind of query with jOOQ:

ctx.select(
        ACTOR.FIRSTNAME,
        ACTOR.LASTNAME,
        count().as("Total"),
        count().filterWhere(LANGUAGE.NAME.eq("English"))
          .as("English"),
        count().filterWhere(LANGUAGE.NAME.eq("German"))
          .as("German"),
        min(FILM.RELEASE_YEAR),
        max(FILM.RELEASE_YEAR))
   .from(ACTOR)
   .join(FILM_ACTOR)
     .on(ACTOR.ACTORID.eq(FILM_ACTOR.ACTORS_ACTORID))
   .join(FILM)
     .on(FILM.FILMID.eq(FILM_ACTOR.FILMS_FILMID))
   .join(LANGUAGE)
     .on(FILM.LANGUAGE_LANGUAGEID.eq(LANGUAGE.LANGUAGEID))
   .groupBy(
        ACTOR.ACTORID,
        ACTOR.FIRSTNAME,
        ACTOR.LASTNAME)
   .orderBy(ACTOR.FIRSTNAME, ACTOR.LASTNAME, ACTOR.ACTORID)
   .fetch()

(more info about the awesome FILTER clause here)

In this example, we’re also using the LANGUAGE table, which we omitted in the article. The output of the above query is something along the lines of:

+---------+---------+-----+-------+------+----+----+
|FIRSTNAME|LASTNAME |Total|English|German|min |max |
+---------+---------+-----+-------+------+----+----+
|Daryl    |Hannah   |    1|      1|     0|2015|2015|
|David    |Carradine|    1|      1|     0|2015|2015|
|Michael  |Angarano |    1|      0|     1|2017|2017|
|Reece    |Thompson |    1|      0|     1|2017|2017|
|Uma      |Thurman  |    2|      1|     1|2015|2017|
+---------+---------+-----+-------+------+----+----+

As we can see, this is a very suitable combination of jOOQ and JPA. JPA was used to insert the data through JPA’s useful object graph persistence capabilities, whereas jOOQ is used for reporting on the same tables.

Now, since we already wrote this nice AttributeConverter, we certainly want to apply it also to the jOOQ query and get the java.time.Year data type also in jOOQ, without any additional effort.

jOOQ 3.10 auto conversion

In jOOQ 3.10, we don’t have to do anything anymore. The existing JPA converter will automatically mapped to a jOOQ converter as the generated jOOQ code reads:

// Don't worry about this generated code
public final TableField<FilmRecord, Year> RELEASE_YEAR = 
    createField("RELEASE_YEAR", org.jooq.impl.SQLDataType.INTEGER, 
        this, "", new JPAConverter(YearConverter.class));

… which leads to the previous jOOQ query now returning a type:

Record7<String, String, Integer, Integer, Integer, Year, Year>

Luckily, this was rather easy to implement as the Hibernate meta model allows for navigating the binding between entities and tables very conveniently as described in this article here:

How to get the entity mapping to database table binding metadata from Hibernate

More similar features are coming up in jOOQ 3.11, e.g. when we look into reverse engineering JPA @Embedded types as well. See https://github.com/jOOQ/jOOQ/issues/6518

If you want to run this example, do check out our jOOQ/JPA example on GitHub:

Finding all Palindromes Contained in Strings with SQL

SQL is a really cool language. I can write really complex business logic with this logic programming language. I was again thrilled about SQL recently, at a customer site:

But whenever I tweet something like the above, the inevitable happened. I was nerd sniped. Oleg Šelajev from ZeroTurnaround challenged me to prove that SQL is so awesome:

Given a string, find all substrings from that string, which are palindromes. Challenge accepted! (For the moment, let’s forget about algorithmic complexity.)

Here’s the Full Query

Spoiler first.

Bear with me, I’ll explain it step by step afterwards. Here’s with PostgreSQL syntax:

WITH RECURSIVE
  words (word) AS (
    VALUES
      ('pneumonoultramicroscopicsilicovolcanoconiosis'),
      ('pseudopseudohypoparathyroidism'),
      ('floccinaucinihilipilification'),
      ('antidisestablishmentarianism'),
      ('supercalifragilisticexpialidocious'),
      ('incomprehensibilities'),
      ('honorificabilitudinitatibus'),
      ('tattarrattat')
  ),
  starts (word, start) AS (
    SELECT word, 1 FROM words
    UNION ALL
    SELECT word, start + 1 FROM starts WHERE start < length(word)
  ),
  palindromes (word, palindrome, start, length) AS (
    SELECT word, substring(word, start, x), start, x
    FROM starts CROSS JOIN (VALUES(0), (1)) t(x)
    UNION ALL
    SELECT word, palindrome, start, length + 2
    FROM (
      SELECT 
        word, 
        substring(word, start - length / 2, length) AS palindrome,
        start, length
      FROM palindromes
    ) AS p
    WHERE start - length / 2 > 0 
    AND start + (length - 1) / 2 <= length(word) 
    AND substring(palindrome, 1, 1) = 
        substring(palindrome, length(palindrome), 1)
  )
SELECT DISTINCT 
  word, 
  trim(replace(word, palindrome, ' ' || upper(palindrome) || ' '))
    AS palindromes
FROM palindromes
WHERE length(palindrome) > 1
ORDER BY 2

(You can run it yourself on SQLFiddle)

The result being:

word                                          |palindromes                                    
----------------------------------------------|-----------------------------------------------
antidisestablishmentarianism                  |ant  IDI  sestablishmentarianism               
antidisestablishmentarianism                  |antidi  SES  tablishmentarianism               
floccinaucinihilipilification                 |flo  CC  inaucinihilipilification              
floccinaucinihilipilification                 |floccinauc  INI  hilipilification              
floccinaucinihilipilification                 |floccinaucin  IHI  lipilification              
floccinaucinihilipilification                 |floccinaucinih  ILI  p  ILI  fication          
floccinaucinihilipilification                 |floccinaucinih  ILIPILI  fication              
floccinaucinihilipilification                 |floccinaucinihi  LIPIL  ification              
floccinaucinihilipilification                 |floccinaucinihil  IPI  lification              
floccinaucinihilipilification                 |floccinaucinihilipil  IFI  cation              
honorificabilitudinitatibus                   |h  ONO  rificabilitudinitatibus                
honorificabilitudinitatibus                   |honor  IFI  cabilitudinitatibus                
honorificabilitudinitatibus                   |honorificab  ILI  tudinitatibus                
honorificabilitudinitatibus                   |honorificabilitud  INI  tatibus                
honorificabilitudinitatibus                   |honorificabilitudin  ITATI  bus                
honorificabilitudinitatibus                   |honorificabilitudini  TAT  ibus                
incomprehensibilities                         |incompr  EHE  nsibilities                      
incomprehensibilities                         |incomprehens  IBI  lities                      
incomprehensibilities                         |incomprehensib  ILI  ties                      
incomprehensibilities                         |incomprehensibil  ITI  es                      
pneumonoultramicroscopicsilicovolcanoconiosis |pneum  ONO  ultramicroscopicsilicovolcanoconios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopics  ILI  covolcanoconios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopicsilic  OVO  lcanoconios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopicsilicovolca  NOCON  ios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopicsilicovolcan  OCO  nios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopicsilicovolcanoconio  SIS
pseudopseudohypoparathyroidism                |pseudopseudohy  POP  arathyroidism             
pseudopseudohypoparathyroidism                |pseudopseudohypop  ARA  thyroidism             
pseudopseudohypoparathyroidism                |pseudopseudohypoparathyro  IDI  sm             
supercalifragilisticexpialidocious            |supercalifrag  ILI  sticexpialidocious         
tattarrattat                                  |t  ATTA  rr  ATTA  t                           
tattarrattat                                  |t  ATTARRATTA  t                               
tattarrattat                                  |ta  TT  arra  TT  at                           
tattarrattat                                  |ta  TTARRATT  at                               
tattarrattat                                  |tat  TARRAT  tat                               
tattarrattat                                  |TAT  tarrat  TAT                               
tattarrattat                                  |tatt  ARRA  ttat                               
tattarrattat                                  |tatta  RR  attat                               
tattarrattat                                  |TATTARRATTAT                                    

This query uses a couple of nice features:

Common Table Expressions

They’re the only way to declare variables in SQL – i.e. you can “store” a query in such a table expression, assign a name to it, and reuse it several times. This is done with the WITH clause. The nice thing about common table expressions is that they are allowed to be RECURSIVE (depending on the database, the RECURSIVE keyword may be required / optional / not available).

VALUES() clause

The VALUES() clause is a very handy tool to create ad-hoc data in the form of tables. We did this to create a table called WORDS, which contains a couple of words inside of which we’d like to look for palindromes. In some databases (including DB2 and PostgreSQL), it’s totally possible to just use the VALUES() clause as a standalone clause instead of SELECT:

VALUES
  ('pneumonoultramicroscopicsilicovolcanoconiosis'),
  ('pseudopseudohypoparathyroidism'),
  ('floccinaucinihilipilification'),
  ('antidisestablishmentarianism'),
  ('supercalifragilisticexpialidocious'),
  ('incomprehensibilities'),
  ('honorificabilitudinitatibus'),
  ('tattarrattat')

Recursive Generation of Integers

In order to look for palindromes, the algorithm used here lists, for each word, the character index of each individual character in the word:

WITH RECURSIVE
  ...
  starts (word, start) AS (
    SELECT word, 1 FROM words
    UNION ALL
    SELECT word, start + 1 FROM starts WHERE start < length(word)
  ),
  ...

For example, if we used the word “word”…

WITH RECURSIVE
  words (word) AS (VALUES('word')),
  starts (word, start) AS (
    SELECT word, 1 FROM words
    UNION ALL
    SELECT word, start + 1 FROM starts WHERE start < length(word)
  )
SELECT *
FROM starts

We’d get the following result:

word |start 
-----|------
word |1     
word |2     
word |3     
word |4     

The idea of the algorithm is that we start from a given character and “fan out” in both directions of a character, again recursively. The characters to the left and to the right of such a character must be the same and so on.

If you’re interested in the syntax, stay tuned. There will be a blog post coming up soon.

Using CROSS JOIN to run the Algorithm Twice

There are two types of palindromes:

  • Those with an even amount of characters: “”, “aa”, “abba”, “babbab”
  • Those with an odd amount of characters: “b”, “aba”, “aabaa”

In our algorithm, we’re treating them in almost the same way by running the algorithm twice. Whenever you think “hey, I need to run this twice in SQL”, CROSS JOIN could be a good candidate, as we’re creating a cartesian product between:

  • The previous “starts” table, giving the starting character index
  • A table containing values 0 (palindromes with even amounts of letters) and 1 (palindromes with odd amounts of letters)

For more information about CROSS JOIN, read our article about JOINs.

Run this query for illustration:

WITH RECURSIVE
  words (word) AS (VALUES('word')),
  starts (word, start) AS (
    SELECT word, 1 FROM words
    UNION ALL
    SELECT word, start + 1 FROM starts WHERE start < length(word)
  )
SELECT *
FROM starts CROSS JOIN (VALUES(0),(1)) AS t(x)

Output:

word |start |x 
-----|------|--
word |1     |0  <-- Even length palindromes centering around position 1 length 0
word |1     |1  <-- Odd length palindromes centering around position 1 length 1
word |2     |0 
word |2     |1 
word |3     |0 
word |3     |1 
word |4     |0 
word |4     |1 

Trivially, a palindrome of length 0 (empty string) or length 1 (“w”, “o”, “r”, “d”) is acceptable in principle, but boring. We’ll filter them out later, but keep them in the algorithm as this simplifies the algorithm. If you want to tune the query, you could prevent generating them in the first place.

The Palindrome Algorithm

Thus far, we’ve just prepared utility data to calculate palindromes:

  • The words inside of which to search for palindromes
  • The individual character indexes to start fanning out from, in each individual word
  • The minimum palindrome length 0 (even) or 1 (odd)

Now comes the interesting part. Recursively fanning out from a starting character to find more palindromes, stopping the fanning out as soon as the new candidate is no longer a plaindrome:

WITH RECURSIVE
  ...
  palindromes (word, palindrome, start, length) AS (

    -- This part just creates the even/odd semantics
    SELECT word, substring(word, start, x), start, x
    FROM starts CROSS JOIN (VALUES(0), (1)) t(x)
    UNION ALL

    -- This part recurses by "fanning out"
    SELECT word, palindrome, start, length + 2
    FROM (
      SELECT 
        word, 
        substring(word, start - length / 2, length) AS palindrome,
        start, length
      FROM palindromes
    ) AS p
    WHERE start - length / 2 > 0 
    AND start + (length - 1) / 2 <= length(word) 
    AND substring(palindrome, 1, 1) = 
        substring(palindrome, length(palindrome), 1)
  )
  ...

It isn’t really so hard in fact. The recursion part selects recursively from the PALINDROMES table (the previously calculated palindromes). That table has 4 columns:

  • WORD: The word we’re looking for palindromes in. This is always the same per recursion
  • PALINDROME: The palindrome, i.e. the substring inside of a word. This changes per recursion
  • START: The start character index from which we started fanning out. This is always the same per recursion
  • LENGTH: The palindrome length. This increases by 2 per recursion

Let’s look at the result for “floccinaucinihilipilification”:

flo  CC  inaucinihilipilification              
floccinauc  INI  hilipilification              
floccinaucin  IHI  lipilification              
floccinaucinih  ILI  p  ILI  fication          
floccinaucinih  ILIPILI  fication              
floccinaucinihi  LIPIL  ification              
floccinaucinihil  IPI  lification              
floccinaucinihilipil  IFI  cation              

There are a total of 8 distinct palindromes contained in this word (and believe it or not, the word exists, too. Pronunciation is another thing).

Let’s “debug” the algorithm to get to this list (remember, SQL indexes are 1 based):

Start 1-4: No palindromes
Start 5: Even palindrome [4:5]
  flo  CC  inaucinihilipilification              

Start 6-11: No palindromes (I wont' repeat this further down)
Start 12: Odd palindrome [11:13]
  floccinauc  INI  hilipilification              

Start 14: Odd palindrome [13:15]
  floccinaucin  IHI  lipilification              

Start 16: Odd palindrome [15:17]
  floccinaucinih  ILI  p  ILI  fication          

Start 18: Odd palindrome [17:19], [16:20], [15:21] (Fanning out 3 times)
  floccinaucinihil  IPI  lification              
  floccinaucinihi  LIPIL  ification              
  floccinaucinih  ILIPILI  fication              

Start 20: Odd palindrome [19:17] (already found this)
  floccinaucinih  ILI  p  ILI  fication          

Start 22: Odd palindrome [21:23]
floccinaucinihilipil  IFI  cation              

The IPI, LIPIL, ILIPILI chain is the most interesting. We’ve succeeded to fan out 3 times adding new characters from WORD on both sides of the initial character.

When do we stop fanning out? Whenever one of these conditions hold true:

    WHERE start - length / 2 > 0 
    AND start + (length - 1) / 2 <= length(word) 
    AND substring(palindrome, 1, 1) = 
        substring(palindrome, length(palindrome), 1)

I.e.

  • When we’ve reached the beginning of WORD (no more characters to the left)
  • When we’ve reached the end of WORD (no more characters to the right)
  • When the letter to the left of the palindrome of the previous recursion doesn’t match the letter to the right of the palindrome

That last predicate could also simply read:

    AND palindrome = reverse(palindrome)

But that might be a bit slower as it compares something that we’ve already proven to be true in the previous recursion.

Finally, formatting the result

The final part isn’t too interesting anymore:

SELECT DISTINCT 
  word, 
  trim(replace(word, palindrome, ' ' || upper(palindrome) || ' '))
    AS palindromes
FROM palindromes
WHERE length(palindrome) > 1
ORDER BY 2

We’ll simply:

  • Select DISTINCT results only, as palindromes might appear several times in a word
  • Replace the palindrome substring by its upper case version and add some whitespace to better visualise it. That’s totally optional, of course
  • Remove the trivial palindromes of length 0 and 1

And we’re done! Again, feel free to play around with this on SQLFiddle, or, much better, provide a cool palindrome algorithm implementation in your favourite language in the comments section, or on Twitter:

More beautiful SQL in these articles here:

jOOQ Tuesdays: Oliver Gierke Talks About Spring Data

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

I’m very excited to feature today Oliver Gierke, the Spring Data Project Lead at Pivotal with strong opinions on DDD and REST.

Hi Oliver – Since 2011, you have mostly been working for Pivotal (previously SpringSource) on Spring Data as the project lead. What is it that fascinates you most about working with data?

To be completely honest it’s not data in itself that got me into Spring Data, or even the predecessor of the JPA module which I was working on even before I joined SpringSource, but the interest in managing complexity in software. When you talk about that, there’s no way you can avoid talking about Domain-Driven Design and its building blocks like value objects, entities, aggregates and the concept of a repository which then again gets you into the realms of data access.

So we as the Spring Data team always have to trade different driving forces against each other: first, the level of abstraction and the programming model that you use in your application code and how well and easy it actually allows you to implement domain logic that solves your business problem at hand. Second, the different tradeoffs that different data stores have already made and in how we actually allow our users to leverage them and at the same time expose some commonalities in the programming model so that developers can transfer knowledge between projects that might use different stores for certain reasons easily. Spring Data is trying to bridge that gap, provide a low entry barrier in modelling aggregates and repositories but at the same time give users the tools to fall back to very efficient means of data access that require a lot of developer control for cases where that’s the top priority.

Spring Data has an impressive number of officially and unofficially supported modules that reach far beyond relational data models. What are the biggest challenges in working with so many models and technologies in a single API?

Definitely the diversity in approaches and tradeoffs by the underlying persistence technologies. Actually that’s one of the reasons, Spring Data does not try to provide a singular unifying API. It’s not even trying to do that for stores of a certain category, for example document databases. We’ve rather taken a general Spring ecosystem philosophy and implemented it in the repositories and data access space: let’s have a consistent programming model with repeatable patterns so that it’s easy to understand the purpose of the abstraction, but then let the abstraction expose store specific specialties so that we’re not abstracting away those but rather allow developers to leverage them when needed.

The general concept of an interface based repository abstraction is not the most revolutionary thing on earth, I admit. But we now look back at almost a decade of experience in designing the parts of the programming model that are indeed API and I think we’ve learned a lot from mistakes we made. Java 8 will be the baseline for the upcoming second generation of Spring Data which enriches our options in terms of APIs. Reactive programming is a hot topic at them moment, too. So there are a lot of balls to juggle in that space.

What’s your favourite module, and why?

That’s of course hard to say as it’s been awesome to see what the individual store modules have turned into over time. However, I’ve grown a bit of a special relationship for the Commons module which is the foundation for all of the store ones as it basically contains the heart and soul of Spring Data: the object mapping facilities, the repository proxy implementation etc. And it’s great to see how we often times can add some functionality there and that functionality is immediately available for repositories of all stores.

On the other end of the spectrum there’s Spring Data REST, a module that exposes RESTful resources based on your aggregate and repository definitions. I like that very much as well, as it works across all the Spring Data modules exposing a repository API and is a great showcase of what you can achieve on top of such an abstraction. Also, it has really helped us to make developers aware of a couple of often overseen aspects of REST, but I guess we’re gonna get to that in a second.

Maintaining a big and widely used API is hard, balancing tons of user requests, integrating third party functionality, maintaining backwards compatibility. What are some maintainer battle stories you’d like to share?

It certainly is. Especially with so many concurrent — sometimes contradicting — forces in play. That starts with the question of versioning the modules: what do we actually version here? User facing API. But which part of the API is that? That totally depends on how much the user is customizing behavior. Is a developer building a Spring Data module for some new data store a user, too? Of course, but a very different kind of user. We usually try to be very conservative with changes that could affect application developers but a bit more demanding when it comes to the implementors of a store module.

That’s all stuff we sort of had and have to deal with on a day to day basis. Interestingly, we’ve been the first ones in the broader Spring engineering team that have picked up the notion of a release train — we group together releases of all modules and name them after famous computer scientists —, that had been popularized by the Eclipse team. That approach worked well for us and has now been adopted by Spring Cloud, Reactor and other teams as well.

Oliver, I have to ask, why does everyone misunderstand REST?

I’m kind of surprised this question comes up in this context, but I guess I have build up some reputation on the internet to complain about people being from unspecific to — in my opinion — outright wrong about this topic :D.

I guess the fundamental problem that REST has is that some parts of it are moderately easy to understand and implement. These days everyone agrees that URIs are a cool thing and that using the right HTTP verb for a given task is a good idea. But even with the latter you’ll easily find people that don’t understand why it’s a good idea to prefer a PUT request over a POST one. Which already brings us to the second part.

Then there are parts that are harder to grasp and a bit harder to implement. The hypermedia aspect comes to mind. Unfortunately theses aspects are the ones that heavily influence whether what you build delivers on the promises that REST makes: being an architectural style that gives you e.g. scalability and evolvability. So people basically start ignoring these aspects, sometimes even outright arguing they don’t need them but then turn around and criticize REST for not delivering on its promises.

In my opinion that’s a way to common pattern observable in the wild, but I guess the only way to improve the situation here is to work on making it easier to implement those aspects and good examples of the benefits you get when you follow those advanced constraints.

5 Things You May Not Have Known About jOOQ

jOOQ has been around for a while now (since 2009!) and by now we can say we’ve seen quite a bit of things about the SQL and Java languages. Some of our design decisions are particular in the way jOOQ thinks about programming with SQL. These include:

  • Nullability (let’s stop fighting it)
  • Value types (let’s stop pretending SQL has identities)
  • Everything is a table (this really helps get the most out of SQL)
  • Queries are side-effect free functions

jOOQ incorporates all of these ideas. Here are 5 Things You May Not Have Known About jOOQ:

1. Every Column Type is Nullable

SQL NULL is a subtly different beast from Java null, even if programmers often use it for the same thing: Something that is “uninitialised”, some value that we don’t “care about” (yet), or some value that we “don’t need”. A good example would be a middle name:

CREATE TABLE person (
  first_name  VARCHAR(50) NOT NULL,
  middle_name VARCHAR(50),
  last_name   VARCHAR(50) NOT NULL,
  ..
);

Of course, a sufficiently pessimistic programmer will immediately see tons of flaws with the above design. Go read this article about falsehoods programmers believe about names for details.

But anyway, the important thing to understand about NOT NULL constraints in SQL is the fact that they’re… constaints. Just like UNIQUE constraints, FOREIGN KEY constraints, or CHECK constraints.

In fact, they are CHECK constraints. We could rewrite the above table as such:

CREATE TABLE person (
  first_name  VARCHAR(50) CHECK (first_name IS NOT NULL),
  middle_name VARCHAR(50),
  last_name   VARCHAR(50) CHECK (last_name IS NOT NULL),
  ..
);

… and the table would be semantically equivalent. This constraint is just so common that it has a special syntax for it (which is also sometimes better optimised than the equivalent check constraint).

Sidenote: An even more sophisticated constraint type is the SQL standard assertion, which unfortunately hasn’t been implemented in any database I’m aware of yet. There are discussions of adding it to a future Oracle version, though. Assertions are like CHECK constraints, but they work on the entire table / schema / whatever scope. For instance, we could assert that every department of a company must have at least one manager. Currently, we can do this sort of thing only through triggers.

The important message here is that a constraint is a validation on the entire data set (or on a subset, down to an individual row). It is not a type modifier, because even if the NOT NULL constraint may have direct optimisation implications on the column type it is attached to, it is a separate construct that can even be deferred. While languages don’t have to be this way (e.g. Ceylon models constraints directly on types), SQL works like this.

Two examples:

  1. DEFAULT columns: When you have an identity column or some sort of GENERATED BY DEFAULT AS... clause on your column, then the value in the column may be generated by default (duh), which may include – depending on the RDBMS vendor – the generation of a value when it is NULL.
  2. DEFERRED constraints: Some databases (e.g. PostgreSQL) support deferred constraints, i.e. constraints that are validated only when the transaction is committed. This can be specified on individual constraints, or on the session. Which means that the value NULL is a totally acceptable value for a NOT NULL column for a certain amount of time.

Both of the above imply that we must not take NOT NULL as a type modifier, the way some languages have started doing it, like Ceylon or Kotlin:

val a: String? = null;
val b: String = a; // Error

In such languages, String? and String are distinct types, specifically in Ceylon where String? is just syntax sugar for the union type String|Null.

But not in SQL. If a Java API wants to properly reflect the SQL language the way jOOQ does, then all types must be nullable. It is a mistake to:

  • Use primitive types
  • Use Option(al) (there are other caveats with these related to generic type erasure)
  • Use non-null types in languages that have them
  • Use validation annotations (we made that mistake, unfortunately)
  • Use JSR-305 or JSR-308 annotations

Sidenote: If this constraint information should be annotated in Java classes, then JPA @Column(nullable=true) annotations are acceptable, because they simply map to the constraint without any implications on the type. The implications are applied on the persistence behaviour, which is reasonable.

Besides, even if at first, encoding nullability through e.g. Option(al) seems reasonable, it breaks as soon as you outer join anything, e.g.:

SELECT p.*
FROM dual
LEFT JOIN person p
ON p.first_name = 'Ooops, no one by that name'

The above query will produce a single person record with only NULL values in its columns. DESPITE the NOT NULL constraints. Ooops. We’ll get null in non-optional types.

Similar things can happen with unions, grouping sets, functions, and a few other operations.

Takeaway

In SQL, all types are always nullable. We simply have to deal with this. Every clever type safety is contrary to SQL logic. If your API does this, you may get some minor convenience in 80% of the use-cases for the price of a major annoyance in 20% of the use-cases. That’s not a reasonable tradeoff given that in Java, every non-primitive type is nullable as well, so we got a perfect and intuitive match.

2. SQL is a Set-Based, Values-Only Language

Values or Objects? That’s a tricky question for people who work with Java, a language that claims to be mainly object-oriented. Java has value support as well. There are 8 different value types as of Java 8:

  • byte
  • short
  • int
  • long
  • float
  • double
  • boolean
  • char

Values have a couple of nice properties:

  • They are immutable. It may be possible to mutate a variable holding such a value, but we cannot mutate the value itself. 42 will always stay 42
  • Two values that are equal are undistinguishable. 42 == 42 really means that they’re the exact same thing. Reusing == for value equality and identity equality has been a bit of an unfortunate choice in Java, because technically, a String is also a value, yet we cannot compare it with == because there’s a possibility of two identical strings having different identity. (True) values don’t have identity.

Java 8 introduced the notion of a “ValueBased” class, which is really a weird thing, because a “ValueBased” wrapper like Optional can reference a non-value based type, say, a java.sql.Connection. Not a good idea, but certainly possible.

A future Java might have more complex value types, for instance:

// Hypothetical syntax
value Point(int x, int y) {}
value Box(Point a, Point b) {
  int area() {
    return Math.abs(a.x - b.x * a.y - b.y);
  }
}

This will certainly be helpful (as soon as we’ll figure out how to model nullability in such scenarios).

In SQL, all records are values. They do not have a true identity (although most databases choose to provide implementation specific identities like ROWIDs). Do not confuse primary keys with identity descriptors. A primary key is a special value that is guaranteed to be unique within a table. It happens to be used as a logical identity (at least when using surrogate keys). But as NOT NULL constraints, PRIMARY KEY constraints are constraints, and they’re deferrable in some databases.

And there are many ways how we can produce results where primary keys are no longer meaningful, e.g. this:

SELECT * FROM person
UNION ALL
SELECT * FROM person

SQL, unlike relational algebra, doesn’t operate on sets but on bags (or multisets), i.e. data structures that allow for duplicate values. Multisets make analytics much more powerful, while making OLTP quite harder. As always, with useful things, they come at a price.

jOOQ, by consequence, also works in the value-oriented multi set paradigm. This is completely contrary to what Hibernate / JPA does, as Hibernate emulates entity identity through the primary key, which it has to do, being an object-graph persistence API. It doesn’t have to do this because of working with sets rather than multisets, although having identities does make things easier in that paradigm. If you want to read an interesting and entertaining discussion on the subject, check out these tweets between Gavin King and myself:

The importance here is to understand: Neither approach is absolutely better. Both have their advantages. If a RDBMS vendor had implemented a database following a set-based approach instead of SQL’s multiset-based approach, a lot of persistence problems would have been much easier to implement on that RDBMS. On the other hand, a lot of reporting and analytics would have been harder, because with sets being sets, we’d have to constantly prevent “duplicates” from being removed early by keeping primary keys around in queries until the final aggregation.

Now even if we could re-start this interesting discussion, fact is, that we have SQL and it is multiset-based. The default is SELECT "ALL", not SELECT DISTINCT (the ALL keyword being specified in the standard, but not available in most implementations).

When using jOOQ, a value-based record-centric programming approach is recommended, where result sets from jOOQ queries are really “just” streams of records, which will be further transformed without ever thinking about persisting any elements from those streams again. Sure there can be write operations as well, but a jOOQ (or SQL) write operation is also a multiset-based streaming of records (values) back into the database. That’s important to know, because all of

  • INSERT
  • UPDATE
  • DELETE
  • MERGE

statements are multiset-based, i.e. they take a set of values, not just a single row. For instance, INSERT:

-- Not all databases support this standard syntax:
INSERT INTO t (a, b, c)
VALUES (1, 2, 3),
       (4, 5, 6),
       (7, 8, 9);

-- But all databases support this one:
INSERT INTO t1 (a, b, c)
SELECT a, b, c
FROM t2;

Notice how this has absolutely nothing to do with identity-based object-graph persistence. In SQL, we’re always streaming a set of values from one place to another, possibly targeting a table where we store that set. The approach is really beautiful, try to think this way and it’ll open up a whole new world to the SQL-oriented programmer.

In a future article, I’ll even go a step further and claim that SQL is an (almost) completely side-effect free language (and this includes statements like INSERT – stay tuned).

Takeaway

In SQL, everything is a value. There is no identity. It is not needed, because SQL is a multiset-based language, where we’re always operating on the entire data set, not on individual records, even if CRUD operations may make you think otherwise. jOOQ encourages this way of thinking by putting the table and the “value-based” record into the center of the programming model.

3. ResultQuery is an Iterable

I’ve blogged about this before, and some users may have discovered it by accident, intrigued. A jOOQ ResultQuery is an Iterable, meaning that you can “foreach it”:

ResultQuery<?> query =
DSL.using(configuration)
   .select(PERSON.FIRST_NAME, PERSON.LAST_NAME)
   .from(PERSON);

// Java 5 style
for (Record record : query)
  System.out.println(record);

// Java 8 style
query.forEach(System.out::println);

It makes a lot of sense. A SQL query is a description of a set of tuples. SQL is a functional programming language, and if you forget about some concurrency aspects, it is, in principle, side-effect free. This means that the query really IS the set of tuples (another nice way to think about SQL!). With that thought in mind, we can simply iterate it.

To the procedural mind of many Java developers, this might be a bit funky and surprising, but give this a little thought and it might “click”. Consider also this previous article, claiming that streams, for comprehensions, and SQL are all the same:

Or also this fun tweet:

Takeaway

We’re not there yet in Java, we still explicitly iterate, but when we do, and the data source is a SQL query, make it a jOOQ query because that helps you forget about the difference between the query and the data, which are really the same thing in SQL.

4. Ordering is Nice When It’s Cheap. Let’s Retain It

You should avoid ORDER BY in SQL if you don’t really need it. Why? Because unless you can profit from an index that has already pre-ordered your result sets, sorting is a super expensive operation in all programming languages, including SQL. It’s essentially O(n log n).

But let’s assume you do have to sort your results, well, we better want to make sure that this ordering stays the same for as long as possible.

By default, jOOQ returns a Result type, or List types, but there are many utility methods like the ResultQuery.fetchMap() method, which can return something like this:

Map<Integer, String> people =
DSL.using(configuration)
   .select(PERSON.ID, PERSON.FIRST_NAME)
   .from(PERSON)
   .orderBy(PERSON.ID)
   .fetchMap(PERSON.ID, PERSON.FIRST_NAME);

Internally, jOOQ collects all data into a LinkedHashMap, which is a slightly more resource intensive map than the similar HashMap. In case you haven’t used this very often, it’s a map that preserves the insertion order when iterating the map using Map.entrySet() and all the other methods. Quite useful when displaying the map, too. After all, if you do specify the ordering, then you wanted that order to appear in the results, right?

In a similar way, when using Collections.sort() in Java, the sort algorithm guarantees that sorting is stable. If you sort a list twice, then the original ordering will be retained for elements that are not re-ordered. I.e. when sorting by first name, and then by last name, the first name ordering will be retained for equal last names.

Takeaway

ORDER BY is expensive, so if you go through the trouble of actually doing it, you want to retain that order.

5. Dynamic SQL is the Default

In the old days, people mostly wrote static SQL, e.g. using stored procedures in languages like PL/SQL. When you write an implicit cursor loop in PL/SQL:

FOR rec IN (SELECT * FROM person)
LOOP
  dbms_output.put_line(rec.first_name || ' ' || rec.last_name);
END LOOP;

… then, this SQL statement is compiled along with the surrounding procedural code and it will never be changed again. That’s useful for batch processing, reporting, etc. (Strictly speaking it isn’t really “static”, because the SQL statement will still be parsed by the SQL engine like any other query, but the PL/SQL programming model allows for hiding this from you).

In modern days, we require dynamic SQL very often, because the SQL code is often generated from user input. Mostly, because:

  • Users can add predicates through the UI
  • Users can specify aggregations through the UI
  • Users can specify ordering through the UI

In some more remote use-cases, users might also influence the JOIN tree and other parts of a dynamically created query.

From a JDBC perspective, all queries are dynamic, even if you’re doing something like this:

try (ResultSet rs = stmt.executeQuery(
  "SELECT * FROM person"
)) {
  while (rs.next())
    out.println(rs.getString(1) + " " + rs.getString(2));
}

Clearly, the SQL string seems “static” in the way that the Java compiler will compile it once and then never touch it again. The above program will always send the exact same SQL string to the server. Yet from a JDBC API perspective, the string is just an argument to the executeQuery() method, just as if we wrote it like this:

try (ResultSet rs = stmt.executeQuery(
  "SELECT * FROM person" + 
  (active ? " WHERE active = 1" : "")
)) {
  while (rs.next())
    out.println(rs.getString(1) + " " + rs.getString(2));
}

Yuck! String concatenation to build SQL strings. There’s a substantial risk of:

Of course, the above example is SQL injection “safe”, because the SQL string is entirely constructed from constants, not user input. But how quickly could the code be refactored to this?

try (ResultSet rs = stmt.executeQuery(
  "SELECT * FROM person" + 
  (active ? (" WHERE active = " + active) : "")
)) {
  while (rs.next())
    out.println(rs.getString(1) + " " + rs.getString(2));
}

SQL builders like jOOQ help prevent SQL injection, even for dynamic SQL queries. The above query will be written as follows in jOOQ:

for (PersonRecord rec : DSL.using(configuration)
        .selectFrom(person)
		.where(active
		    ? PERSON.ACTIVE.eq(active)
			: trueCondition()))
  out.println(rec.getFirstName() + " " + rec.getLastName());

The active flag check that is added to the SQL query dynamically will default to creating a bind variable, and even if it is inlined, it will be escaped, depending on its type.

The interesting bit here, however, is that the jOOQ query is always a dynamic SQL query. The above approach used an inline expression to decide whether a certain predicate needs to be added to the statement. If that predicate gets more complex, we can extract the construction of the predicate to a local variable, or a function.

Local variable

Condition condition = trueCondition();

if (active)
  condition = PERSON.ACTIVE.eq(active);
	
if (searchForFirstName)
  condition = condition.and(PERSON.FIRST_NAME.like(pattern));

for (PersonRecord rec : DSL.using(configuration)
        .selectFrom(person)
		.where(condition))
  out.println(rec.getFirstName() + " " + rec.getLastName());

This is quite neat.

Functions

Or, if things get even more complex, we might like to factor out the logic to a method, or a function. Some people have started calling such an approach “functional relational mapping”:

Condition personCondition(boolean active, String pattern) {
  Condition condition = trueCondition();

  if (active)
    condition = PERSON.ACTIVE.eq(active);
	
  if (pattern != null)
    condition = condition.and(PERSON.FIRST_NAME.like(pattern));
	
  return condition;
}

// And then:
for (PersonRecord rec : DSL.using(configuration)
        .selectFrom(person)
		.where(personCondition(active, pattern)))
  out.println(rec.getFirstName() + " " + rec.getLastName());

Or even:

BiFunction<Boolean, String, Condition> personCondition() {
  return (active, pattern) -> {
    Condition condition = trueCondition();

    if (active)
      condition = PERSON.ACTIVE.eq(active);
	
    if (pattern != null)
      condition = condition.and(PERSON.FIRST_NAME.like(pattern));
	
    return condition;
  };
}

// And then:
for (PersonRecord rec : DSL.using(configuration)
        .selectFrom(person)
		.where(personCondition.apply(active, pattern)))
  out.println(rec.getFirstName() + " " + rec.getLastName());

Not only is this approach to writing dynamic SQL extremely useful for client code that relies on dynamic SQL, the expression tree that is built behind the scenes is also available at runtime for more complex transformations, such as applying row level security to certain queries, or more simply to apply something like schema-based multi-tenancy. While the Java code stays exactly the same, the generated SQL string may be transformed by your own library code, behind the scenes.

Static SQL

Of course, jOOQ doesn’t imply that you have to write dynamic SQL. You can store jOOQ-generated SQL strings in caches, or you can use stored procedures with jOOQ. In fact, jOOQ encourages you to use stored procedures!

Takeaway

Dynamic SQL is really useful. jOOQ defaults to writing dynamic SQL in a way that you don’t even notice. A SQL query is a function just as much as it is a collection description. jOOQ helps you think about SQL this way.

Conclusion

SQL is a beautiful language with an interesting syntax. If we look at the concepts that are the foundation of the SQL language, we see that SQL queries are functional / declarative collection descriptions. With this paradigm in mind, we can write really powerful SQL statements, and jOOQ encourages this as this paradigm is at the core of the jOOQ API design.

Enjoy writing functional-relational mapping code.

jOOQ 3.10 Supports Exciting MySQL 8.0 Features

In recent months, there had been some really exciting news from the MySQL team:

These two SQL standard language features are among the most powerful SQL features that are available from most other databases. I frequently include them in conference talks about SQL (see my article about 10 SQL Tricks That You Didn’t Think Were Possible), and as well in the Data Geekery SQL Masterclass. With MySQL 8.0 now supporting these exciting features, the masterclass will be including MySQL as well (along with Oracle, SQL Server, PostgreSQL, and DB2). And, of course, these features are now supported in the upcoming jOOQ 3.10 as well.

Want to try it out yourself? Just run:

docker pull mysql:8.0.2
docker run --name MYSQL802 --net=host -p 3306:3306 -e MYSQL_ROOT_PASSWORD=test -d mysql:8.0.2

Then, connect to this instance and run this nice little query in it:

WITH RECURSIVE t(a, b) AS (
  SELECT 1, CAST('a' AS CHAR(15))
  UNION ALL
  SELECT t.a + 1, CONCAT(t.b, 'a')
  FROM t
  WHERE t.a < 10
)
SELECT a, SUM(a) OVER (ORDER BY a) AS ∑, b
FROM t

And get this result:

a       ∑       b
--------------------------
1       1       a
2       3       aa
3       6       aaa
4       10      aaaa
5       15      aaaaa
6       21      aaaaaa
7       28      aaaaaaa
8       36      aaaaaaaa
9       45      aaaaaaaaa
10      55      aaaaaaaaaa

Would you believe this is MySQL?

Bonus

A nice “hidden” feature is the support of new pessimistic locking clauses, in particular FOR UPDATE SKIP LOCKED. This has been available in Oracle for ages and since recently in PostgreSQL as well, and now in MySQL. A very useful feature when implementing simple message queues or reservation systems. More details in this article here:

http://mysqlserverteam.com/mysql-8-0-1-using-skip-locked-and-nowait-to-handle-hot-rows/

Of course, SKIP LOCKED (and NOWAIT) will be supported in jOOQ 3.10 as well.