How to Write Multiset Conditions With Oracle VARRAY Types

Oracle is one of the few databases that implements the SQL standard ORDBMS extensions, which essentially allow for nested collections. Other databases that have these features to some extent are CUBRID, Informix, PostgreSQL.

Oracle has two types of nested collections:

-- Nested tables
CREATE TYPE t1 AS TABLE OF VARCHAR2(10);
/

-- Varrays
CREATE TYPE t2 AS VARRAY(10) OF VARCHAR2(10);
/

The main difference at first is that a nested table can be of arbitrary size, whereas a varray has a fixed maximum size. Other than that, they behave in similar ways.

When storing a nested collection in a table, there is another difference. Varrays can be inlined into the table just like any other data type, whereas nested tables have to be accompanied by an additional storage clause:

CREATE TABLE t (
  id NUMBER(10),
  t1 t1,
  t2 t2
)
NESTED TABLE t1 STORE AS t1_nt;

This is a minor hassle in terms of DDL. The runtime implications are more significant.

Multiset Conditions

The most important difference is the fact that all the useful multiset conditions are not available with varrays. For instance, consider running these statements:

INSERT INTO t VALUES (1, NULL, NULL);
INSERT INTO t VALUES (2, t1(), t2());
INSERT INTO t VALUES (
  3, 
  t1('abc', 'xyz', 'zzz'), 
  t2('abc', 'xyz', 'zzz')
);
INSERT INTO t VALUES (
  4, 
  t1('dup', 'dup', 'dup'), 
  t2('dup', 'dup', 'dup')
);

SELECT * FROM t WHERE 'abc' MEMBER OF t1;
SELECT * FROM t WHERE 'abc' MEMBER OF t2;

The result of these queries is:

ID  T1                        T2
-----------------------------------------------------
3   T1('abc', 'xyz', 'zzz')   T2('abc', 'xyz', 'zzz')

ORA-00932: inconsistent datatypes: expected UDT got TEST.T2

Bummer. The documentation is a bit unclear about this. It reads (emphasis mine):

he return value is TRUE if expr is equal to a member of the specified nested table or varray. The return value is NULL if expr is null or if the nested table is empty.

There is some explicit mention of varrays supporting these operations, but in most of the documentation, varrays are not mentioned. So, how can we write such operations with varrays? Here’s an list of translations of the nested table operator to the equivalent SQL expression for use with varrays.

These are the multiset conditions:

IS A SET condition

In SQL, everything is a (partially ordered) multiset by default. Sometimes, however, we want to work with sets, i.e. a special type of multiset that has no duplicate values. We can easily check whether nested tables are sets (or whether they aren’t):

-- Nested table version
SELECT * FROM t WHERE t1 IS A SET;

-- Varray version
SELECT * 
FROM t 
WHERE t2 IS NOT NULL
AND (SELECT count(*) FROM TABLE(t2)) 
  = (SELECT count(DISTINCT column_value) FROM TABLE(t2));

The IS A SET operation yields UNKNOWN if the nested table is NULL, so we have to take that into account as well. If it isn’t NULL, we can count the total values in the varray and compare that with the total distinct values in the varray.

The result is:

ID  T1                        T2
-----------------------------------------------------
2   T1()                      T2()
3   T1('abc', 'xyz', 'zzz')   T2('abc', 'xyz', 'zzz')

IS EMPTY condition

This predicate needs no explanation. It can be written as such:

-- Nested table version
SELECT * FROM t WHERE t1 IS EMPTY;

-- Varray version
SELECT * 
FROM t 
WHERE t2 IS NOT NULL
AND NOT EXISTS (
  SELECT * FROM TABLE (t2)
);

The result being:

ID  T1                 T2
---------------------------------------
2   T1()               T2()

MEMBER condition

This handy predicate can help check if a specific value is contained in a nested collection. It can be written as such:

-- Nested table version
SELECT * FROM t WHERE 'abc' MEMBER OF t1;

-- Varray version
SELECT *
FROM t
WHERE t2 IS NOT NULL
AND EXISTS (
  SELECT 1 FROM TABLE(t2) WHERE column_value = 'abc'
);

Yielding:

ID  T1                        T2
-----------------------------------------------------
3   T1('abc', 'xyz', 'zzz')   T2('abc', 'xyz', 'zzz')

SUBMULTISET condition

Just like the previous MEMBER condition, this predicate can help check if specific values (more than one) are contained in a nested collection. This is a bit more tricky than the previous emulations. The MEMBER condition works the same way for sets and multisets, as we’re checking if exactly one element is contained in the (multi)set.

When working with multisets, duplicates are allowed, and in the case of the SUBMULTISET operation, the following can be observed:

-- Equal multisets
t1() SUBMULTISET OF t1();
t1('a', 'a') SUBMULTISET OF t1('a', 'a');

-- Subsets
t1('a') SUBMULTISET OF t1('a', 'a');

-- But this is not true
t1('a', 'a') SUBMULTISET OF t1('a');

When we omit the fact that nested collections can be multisets and pretend we’re working with sets only, then the emulation of the SUBMULTISET operator is relatively easy:

-- Nested table version
SELECT * FROM t WHERE t1('abc', 'xyz') SUBMULTISET OF t1;

-- Varray version
SELECT *
FROM t
WHERE t2 IS NOT NULL
AND EXISTS (
  SELECT 1 FROM TABLE(t2) 
  WHERE column_value = 'abc'
  INTERSECT
  SELECT 1 FROM TABLE(t2) 
  WHERE column_value = 'xyz'
);

Yielding, once more:

ID  T1                        T2
-----------------------------------------------------
3   T1('abc', 'xyz', 'zzz')   T2('abc', 'xyz', 'zzz')

If we’re really working with multisets, things are a bit more tricky:

-- Nested table version
SELECT * FROM t WHERE t1('dup', 'dup') SUBMULTISET OF t1;

-- Varray version
SELECT *
FROM t
WHERE t2 IS NOT NULL
AND NOT EXISTS (
  SELECT column_value, count(*)
  FROM TABLE (t2('dup', 'dup')) x
  GROUP BY column_value
  HAVING count(*) > (
    SELECT count(*)
    FROM TABLE (t2) y
    WHERE y.column_value = x.column_value
  )
);

Yielding:

ID  T1                        T2
-----------------------------------------------------
4   T1('dup', 'dup', 'dup')   T2('dup', 'dup', 'dup')

How does it work? In the NOT EXISTS correlated subquery, we’re counting the number of duplicate values in the potential SUBMULTISET, effectively turning that SUBMULTISET into a SET using the GROUP BY operation.

We’re then comparing that count value from the left operand with the corresponding count value from the right operand. If there is no value in the left operand whose number of occurrences is bigger than the number of occurrences of that value in the right operand, then the whole left operand is a SUBMULTISET of the right operand.

Cool, eh? We’ll talk about performance another time :-)

MULTISET operators

Also very interesting, the multiset operators:

  • MULTISET EXCEPT [ ALL | DISTINCT ]
  • MULTISET INTERSECT [ ALL | DISTINCT ]
  • MULTISET UNION [ ALL | DISTINCT ]

Notice how there are some differences to the ordinary set operators that can be used in SELECT statements. In particular:

  • EXCEPT is used as defined in the standard, not MINUS
  • ALL is supported on all three operators, not just on UNION
  • ALL is the default, not DISTINCT

How can we work with these operators? Consider these queries:

SELECT id, t1 MULTISET EXCEPT t1('aaa', 'abc', 'dup', 'dup') r 
FROM t;

SELECT id, t1 MULTISET EXCEPT ALL t1('aaa', 'abc', 'dup', 'dup') r 
FROM t;

Both yielding:

ID   R
---------------------
1    (null)
2    T1()
3    T1('xyz', 'zzz')
4    T1('dup')

With this operator, we’re removing each element of the right operand once from the left operand:

  • 'aaa' does not appear in the left operand, so nothing happens
  • 'abc' appears on row with ID = 3 and we remove it
  • 'dup' appears on row with ID = 4, 3 times, and we remove it twice, leaving one value

Conversely, when adding DISTINCT, we’ll get:

SELECT t1 MULTISET EXCEPT DISTINCT t1('aaa', 'abc', 'dup') FROM t;

Yielding:

ID   R
---------------------
1    (null)
2    T1()
3    T1('xyz', 'zzz')
4    T1('')

The only difference is on row with ID = 4, where all 'dup' values were removed, regardless how many there were on either side of the MULTISET EXCEPT DISTINCT operator.

How to emulate this for varrays?

DISTINCT version

This is a bit easier, because we can now use MINUS:

-- Nested table version
SELECT t1 MULTISET EXCEPT DISTINCT t1('aaa', 'abc', 'dup', 'dup') 
FROM t;

-- Varray version
SELECT 
  id,
  CASE 
    WHEN t2 IS NULL THEN NULL 
    ELSE 
      CAST(MULTISET(
        SELECT column_value
        FROM TABLE (t2)
        MINUS
        SELECT column_value
        FROM TABLE (t2('aaa', 'abc', 'dup', 'dup'))
      ) AS t2)
  END r
FROM t;

Luckily, we can still cast a structural MULTISET type that we can obtain using the MULTISET() operator to a varray type. This greatly simplifies the task.

ALL version

If we want the MULTISET EXCEPT or MULTISET EXCEPT ALL semantics, things are trickier. Here’s a solution that resorts to using window functions, in order to turn a MULTISET back into a SET:

-- Nested table version
SELECT t1 MULTISET EXCEPT ALL t1('aaa', 'abc', 'dup', 'dup') 
FROM t;

-- Varray version
SELECT 
  id,
  CASE 
    WHEN t2 IS NULL THEN NULL 
    ELSE 
      CAST(MULTISET(
        SELECT column_value
        FROM (
          SELECT 
            column_value,
            row_number() OVER (
              PARTITION BY column_value 
              ORDER BY column_value) rn
          FROM TABLE (t2)
          MINUS
          SELECT 
            column_value, 
            row_number() OVER (
              PARTITION BY column_value 
              ORDER BY column_value) rn
          FROM TABLE (t2('aaa', 'abc', 'dup', 'dup'))
        )
      ) AS t2)
  END r
FROM t;

How does this work? Ideally, we’ll look at what this ROW_NUMBER() evaluates to on each row. For this, we use OUTER APPLY:

SELECT id, t2, column_value, rn
FROM t
OUTER APPLY (
  SELECT 
    column_value,
    row_number() OVER (
      PARTITION BY column_value
      ORDER BY column_value) rn
  FROM TABLE (t2)
);

The result is:

ID      T2                       COLUMN_VALUE  RN
-----------------------------------------------------
1       (null)                   (null)        (null)
2       T2()                     (null)        (null)
3       T2('abc', 'xyz', 'zzz')  abc           1
3       T2('abc', 'xyz', 'zzz')  xyz           1
3       T2('abc', 'xyz', 'zzz')  zzz           1
4       T2('dup', 'dup', 'dup')  dup           1
4       T2('dup', 'dup', 'dup')  dup           2
4       T2('dup', 'dup', 'dup')  dup           3

As can be seen, each duplicate value gets assigned a unique row number due to the nature of how ROW_NUMBER() works (this property can be very useful for solving the gaps-and-islands-problem. See trick #4).

Now that we turned our (COLUMN_VALUE) multiset into a (COLUMN_VALUE, RN) set (without duplicates), we can use MINUS again.

MULTISET INTERSECT and MULTISET UNION

MULTISET INTERSECT works exactly the same way as MULTISET EXCEPT, with the same window function based emulation in the MULTISET INTERSECT ALL case. MULTISET UNION is simpler, because Oracle knows UNION ALL, so we do not need to resort to such trickery.

Conclusion

Nested collections are a very powerful tool in Oracle SQL. Oracle knows two types of nested collections:

  • Nested tables
  • Varrays

Nested tables are trickier to maintain as you have to think of their storage more explicitly. Varrays can just be embedded into ordinary tables like any other column. But there’s a price to pay for using varrays. Oracle regrettably doesn’t support all of the above very useful multiset conditions and multiset operators.

Luckily, when you encounter a situation where you have varrays and cannot change that, you can still emulate each of the operators using more traditional SQL.

How to Write Efficient TOP N Queries in SQL

A very common type of SQL query is the TOP-N query, where we need the “TOP N” records ordered by some value, possibly per category. In this blog post, we’re going to look into a variety of different aspects to this problem, as well as how to solve them with standard and non-standard SQL.

These are the different aspects we’ll discuss:

  • Top values
  • Top values with ties
  • Top values per category

Getting the Top Value

When looking at the Sakila database, we might want to find the actor who played in the most films. The simplest solution here would be to use GROUP BY to find the number of films per actor, and then ORDER BY and LIMIT to find the “TOP 1” actor.

Here’s the query in PostgreSQL:

SELECT actor_id, first_name, last_name, count(film_id)
FROM actor
LEFT JOIN film_actor USING (actor_id)
GROUP BY actor_id, first_name, last_name
ORDER BY count(film_id) DESC
LIMIT 1;

Yielding

ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT
--------------------------------------
107       GINA        DEGENERES  42

Other databases have different syntaxes for LIMIT – check out the jOOQ manual for a complete list of emulations of this useful clause, e.g. in Oracle 12c, we would use FETCH:

SELECT actor_id, first_name, last_name, count(film_id)
FROM actor
LEFT JOIN film_actor USING (actor_id)
GROUP BY actor_id, first_name, last_name
ORDER BY count(*) DESC
FETCH FIRST ROW ONLY;

Or, in SQL Server, we could use TOP

SELECT TOP 1 a.actor_id, first_name, last_name, count(film_id)
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
GROUP BY a.actor_id, first_name, last_name
ORDER BY count(*) DESC;

… which kinda makes sense. We want the TOP 1 actor, right?

Alternatives with Subqueries

In my SQL Masterclass, we’re also looking at the performance of various variants of solving this particular problem. For instance, we could calculate the COUNT(*) value in a derived value, prior to joining:

SELECT actor_id, first_name, last_name, COALESCE(c, 0)
FROM actor
LEFT JOIN (
  SELECT actor_id, count(*) AS c
  FROM film_actor
  GROUP BY actor_id
) fa USING (actor_id)
ORDER BY COALESCE(c, 0) DESC
LIMIT 1;

Or, we could calculate the COUNT(*) value in a correlated subquery:

SELECT actor_id, first_name, last_name, (
  SELECT count(*)
  FROM film_actor fa
  WHERE fa.actor_id = a.actor_id
) AS c
FROM actor a
ORDER BY c DESC
LIMIT 1;

These perform vastly differently, depending on the database as can be seen in this tweet:

Anyway, the different techniques do not really influence the TOP-N semantics themselves, as we’re always using the same two clauses to get the TOP 1 actor:

ORDER BY c DESC -- Some means of calculating "c"
LIMIT 1;        -- Some means of limiting results to 1 row

Window function alternative

In the old days when databases like Oracle didn’t really support LIMIT (or when using DB2, SQL Server, and Sybase, which support “LIMIT” but not “OFFSET”), people used to resort to using window functions. In order to get the FETCH FIRST n ROWS ONLY semantics, we can use ROW_NUMBER():

SELECT *
FROM (
  SELECT 
    actor_id, first_name, last_name, count(film_id),
    row_number() OVER (ORDER BY count(film_id) DESC) rn
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  GROUP BY actor_id, first_name, last_name
) t
WHERE rn <= 5
ORDER BY rn;

The derived table contains that ROW_NUMBER() window function, which produces a unique, consecutive row number for each actor ordered by the number of films the actor played in.

Note that thanks to the aggregation step “happening before” the windowing step, we can use aggregation functions inside of window functions.

We’ll revisit the window function approach again later, when implementing “WITH TIES” semantics using RANK().

Oracle Specific Alternative

In Oracle, we have those cool FIRST and LAST functions, with a bit of an arcane syntax. They help finding the first or last element in a sorted group.

SELECT 
  max(actor_id)   KEEP (DENSE_RANK FIRST ORDER BY c DESC, actor_id),
  max(first_name) KEEP (DENSE_RANK FIRST ORDER BY c DESC, actor_id),
  max(last_name)  KEEP (DENSE_RANK FIRST ORDER BY c DESC, actor_id),
  max(c)          KEEP (DENSE_RANK FIRST ORDER BY c DESC, actor_id)
FROM (
  SELECT actor_id, first_name, last_name, count(film_id) c
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  GROUP BY actor_id, first_name, last_name
) t;

It’s fair to say that the syntax is a bit verbose.

How does this work? The nested query counts again all the films per actor, producing the entire result set. The outer query, however, aggregates all these actors and their film counts into a single row, displaying in that row only the first value(s) within the group, ordered by some specific ordering criteria. Note that the ACTOR_ID was also added to the ordering criteria, to get a unique ordering (see “with ties” semantics, below).

Why do we need the MAX() aggregate function? Simply because this feature applies only to aggregate functions, and the FIRST value(s) could be more than one value within the group, namely when the ordering would produce a “tie” (again, see below).

I’ve seen this slightly outperform the alternatives in TOP 1 cases (that ignore “TIES”, see below), including this one. A benchmark showing relative execution times shows:

Statement 1 : 1.88945  (FETCH version)
Statement 2 : 1.84442  (window function version)
Statement 3 : 1        (KEEP version)

The benchmark technique is described in this blog, on this page, and also again below. Measure yourself though!

Getting the Top Value “With Ties”

In the previous query, we’re getting a single top value, and the result is correct as we can see with this query that returns the TOP 5 values:

SELECT actor_id, first_name, last_name, count(film_id)
FROM actor
LEFT JOIN film_actor USING (actor_id)
GROUP BY actor_id, first_name, last_name
ORDER BY count(film_id) DESC
LIMIT 5;

The result being:

ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT
--------------------------------------
107       GINA        DEGENERES  42
102       WALTER      TORN       41
198       MARY        KEYTEL     40
181       MATTHEW     CARREY     39
23        SANDRA      KILMER     37

But what if WALTER TORN had played in yet one more film? We would have a “tie” between GINA DEGENERES and WALTER TORN, so which one would we pick as the TOP 1 actor? There are several possible answers:

  • Pick a random one by accident: This is what our query does right now. This might be considered fair, as the preference for the winner might be merely technical and might change between releases
  • Pick a random one explicitly: We could add another sort criteria that introduces randomness (e.g. a RANDOM() function call). Let’s just hope that this call is only made when needed, i.e. when we have a tie. Otherwise, that would be rather dangerous. But it would be fair.
  • Specify a unique additional sort criteria: E.g. the ACTOR_ID. That would make the result predictable, but in this case a bit unfair.
  • Fetch both tied actors. Yes, we can have several winners. That’s what sports events do, and that’s what we’ll discuss now.

So, let’s add one more film for WALTER TORN:

INSERT INTO film_actor (actor_id, film_id)
VALUES (102, 1);

The SQL standard specifies how we can fetch the first rows “with their ties”, namely by using … well … the FETCH FIRST ROWS WITH TIES syntax!

This is implemented as such in Oracle 12c, the only database I’ve seen with this standard syntax so far.

SELECT actor_id, first_name, last_name, count(film_id)
FROM actor
LEFT JOIN film_actor USING (actor_id)
GROUP BY actor_id, first_name, last_name
ORDER BY count(film_id) DESC
FETCH FIRST ROWS WITH TIES;

Note that all of these syntaxes are possible, they’re all equivalent. Use the one that resembles the English language the most, in your opinion (in other words: everyone does it differently):

FETCH FIRST 1 ROWS WITH TIES
FETCH FIRST 1 ROW WITH TIES
FETCH FIRST ROWS WITH TIES
FETCH FIRST ROW WITH TIES
FETCH NEXT 1 ROWS WITH TIES
FETCH NEXT 1 ROW WITH TIES
FETCH NEXT ROWS WITH TIES
FETCH NEXT ROW WITH TIES

The only other database I’m aware of that knows this feature (but not the standard syntax) is SQL Server:

SELECT TOP 1 WITH TIES 
  a.actor_id, first_name, last_name, count(*)
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
GROUP BY a.actor_id, first_name, last_name
ORDER BY count(*) DESC;

While SQL Server also supports the standard OFFSET .. FETCH syntax for the ROWS ONLY semantics (i.e. pick a random row among the winners), it supports WITH TIES only with the proprietary TOP syntax.

That’s OK, we can always use TOP, because OFFSET and OFFSET pagination is verboten anyway, right?

Using Window Functions in Other Databases

As promised, we’re now revisiting a window function based solution to make the “WITH TIES” semantics work in other databases as well. We can make use of ranking functions, namely RANK() (or in more rare cases DENSE_RANK()).

Here’s how to find the TOP 1 actor WITH TIES in PostgreSQL, for instance:

SELECT *
FROM (
  SELECT 
    actor_id, first_name, last_name, count(film_id),
    rank() OVER (ORDER BY count(film_id) DESC) rk
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  GROUP BY actor_id, first_name, last_name
) t
WHERE rk = 1;

We’re now nesting the original query in a derived table, adding an additional column RK, which contains the RANK() of the row given the desired ordering. The result is:

ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT  RK
------------------------------------------
107       GINA        DEGENERES  42     1
102       WALTER      TORN       42     1

If we wanted to find the TOP 5 again, we simply search for ranks less or equal to 5. And don’t forget explicit ordering again. Even if the ordering from the subquery will probably be stable (namely the one from the window function, which was the last one applied to the subquery), we must never rely on a non-explicit ordering. NEVER.

SELECT *
FROM (
  SELECT 
    actor_id, first_name, last_name, count(film_id),
    rank() OVER (ORDER BY count(film_id) DESC) rk
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  GROUP BY actor_id, first_name, last_name
) t
WHERE rk <= 5
ORDER BY rk;
ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT  RK
------------------------------------------
107       GINA        DEGENERES  42     1
102       WALTER      TORN       42     1
198       MARY        KEYTEL     40     3
181       MATTHEW     CARREY     39     4
23        SANDRA      KILMER     37     5

Now, if we had another actor with 37 films, that other actor would also appear in this list and the list would contain 6 actors, even if we searched for the TOP 5 (WITH TIES). Cool, eh?

Oracle Specific Solution

Remember we were discussing the Oracle-specific FIRST and LAST functions? Unfortunately, this cannot be used to get the WITH TIES semantics because the aggregation can produce only a single row, not multiple tied rows.

As a workaround, I’ve tried combining LISTAGG() with KEEP with no avail:

SELECT 
  LISTAGG (first_name, ', ')
    WITHIN GROUP (ORDER BY count(film_id) DESC)
    KEEP (DENSE_RANK FIRST ORDER BY count(film_id) DESC),
...

While this could have produced all the ties in a comma separated list of values, this syntax is simply not allowed.

Top Values Per Category

Now, this is the coolest! So far, we’ve run single queries getting the single TOP N values over our entire data set. I.e. the actor with the most films.

Imagine, however, we’d like to get the TOP N something per actor. For instance, the TOP 3 most successful films an actor played in. That would be quite a query. To keep it simple on the Sakila database, let’s simply find the TOP 3 films with the longest titles per actor.

This will again use LIMIT, but this time, we need to do it in a subquery. The two tools we’ve seen so far won’t really work well:

  • Derived tables (subqueries in the FROM clause) cannot implement the per actor semantics easily, at least not with LIMIT. It’s possible with window functions as we’ll see later.
  • Correlated subqueries (subqueries in the SELECT or WHERE clauses) can only return one row and one column. Not quite useful when we want to return, e.g. the TOP 3 rows.

Luckily, the SQL standard specifies LATERAL (implemented by Oracle 12c, PostgreSQL, DB2) and SQL Server has always had APPLY (also available in Oracle 12c). Both syntaxes are exactly equivalent. Let’s look at APPLY first, as I much prefer this syntax:

SELECT actor_id, first_name, last_name, title
FROM actor a
OUTER APPLY (
  SELECT title
  FROM film f
  JOIN film_actor fa USING (film_id)
  WHERE fa.actor_id = a.actor_id
  ORDER BY length(title) DESC
  FETCH FIRST 3 ROWS ONLY
) t
ORDER BY actor_id, length(title) DESC;

This will produce:

ACTOR_ID  FIRST_NAME  LAST_NAME  TITLE
------------------------------------------------------
1         PENELOPE    GUINESS    BULWORTH COMMANDMENTS
1         PENELOPE    GUINESS    ANACONDA CONFESSIONS
1         PENELOPE    GUINESS    GLEAMING JAWBREAKER
2         NICK        WAHLBERG   GOODFELLAS SALUTE
2         NICK        WAHLBERG   DESTINY SATURDAY
2         NICK        WAHLBERG   ADAPTATION HOLES
3         ED          CHASE      ARTIST COLDBLOODED
3         ED          CHASE      NECKLACE OUTBREAK
3         ED          CHASE      BOONDOCK BALLROOM
...

Cool. How does it work?

Think of the derived table as a function. A function with a single argument:

SELECT actor_id, first_name, last_name, title
FROM actor a
OUTER APPLY (
  SELECT title
  FROM film f
  JOIN film_actor fa USING (film_id)
  WHERE fa.actor_id = a.actor_id -- this is the function argument
  ORDER BY length(title) DESC
  FETCH FIRST 3 ROWS ONLY
) t
ORDER BY actor_id, length(title) DESC;

Or, if we would be storing this as an actual table valued function, e.g. in PostgreSQL syntax:

CREATE FUNCTION top_3_films_per_actor(p_actor_id IN INTEGER)
RETURNS TABLE (
  title VARCHAR(50)
)
AS $$
  SELECT title
  FROM film f
  JOIN film_actor fa USING (film_id)
  WHERE fa.actor_id = p_actor_id
  ORDER BY length(title) DESC
  LIMIT 3
$$ LANGUAGE sql;

Table valued functions are pretty cool. They work like parameterised views in case your database can inline it. DB2 and SQL Server can, Oracle 12c and PostgreSQL 9.6 cannot. It’s not a surprise in Oracle, which doesn’t support SQL functions, only PL/SQL functions.

The function can now be used as such:

SELECT actor_id, first_name, last_name, title
FROM actor a
LEFT JOIN LATERAL top_3_films_per_actor(a.actor_id) t 
ON 1 = 1 -- Silly predicate is required because of LEFT JOIN
ORDER BY actor_id, length(title) DESC;

Note that the LATERAL syntax is exactly the same as APPLY with a bit more syntactic ceremony, especially when used with OUTER JOIN.

Now, normally, we aren’t allowed to do this. We aren’t allowed to access the column A.ACTOR_ID from within a derived table or function expression. Inside the derived table, the table A is not defined, and the outer scope is not accessible either.

APPLY (and LATERAL) changes this. With these tools, we can now access all the tables and their columns to the left of the APPLY (or LATERAL) operator. This means our subquery now works like a correlated subquery, but it is allowed to return:

  • More than one row
  • More than one column

That’s really cool! If you’re working with Java 8 streams, think of it as the equivalent of the Stream.flatMap() operation. If we wanted to write this query in Java with streams, we could write something like:

actors
  .stream()
  .flatMap(a -> films
    .stream()
    .filter(f -> f.hasActor(a)) // Assuming some API
    .sorted(comparing(f -> f.title.length()).reversed())
    .limit(3)
    .map(f -> tuple(a, f)));

This would be more or less the same, with a few simplifications. E.g. the FILM_ACTOR JOIN has been cheated away with an auxiliary method Film.hasActor(Actor).

So, following the flatMap() “metaphor”, APPLY applies a table function (e.g. a subquery) to another table (in our case: ACTOR). That function produces a table for each record of the ACTOR table. In our case, we chose OUTER APPLY (instead of CROSS APPLY) because like a LEFT JOIN that will keep the actors without films in the result set – something that isn’t as easy to do with flatMap(), which corresponds to CROSS APPLY.

For more info about LATERAL and APPLY read these interesting posts:

WITH TIES Semantics

What if we wanted to list the TOP 3 films by their longest title per actor WITH TIES? I.e. if there are several films of equal length, we might get 4 or 5 or more films for any given actor?

In databases that support the WITH TIES syntax, this is really simple. In Oracle, just replace ROWS ONLY by ROWS WITH TIES:

SELECT actor_id, first_name, last_name, title
FROM actor a
OUTER APPLY (
  SELECT title
  FROM film f
  JOIN film_actor fa USING (film_id)
  WHERE fa.actor_id = a.actor_id
  ORDER BY length(title) DESC
  FETCH FIRST 3 ROWS WITH TIES
) t
ORDER BY actor_id, length(title) DESC;

In SQL Server, add the clause to the TOP clause:

SELECT actor_id, first_name, last_name, title
FROM actor a
OUTER APPLY (
  SELECT TOP 3 WITH TIES title
  FROM film f
  JOIN film_actor fa ON f.film_id = fa.film_id
  WHERE fa.actor_id = a.actor_id
  ORDER BY len(title) DESC
) t
ORDER BY actor_id, len(title) DESC;

In both cases, we’re now getting:

ACTOR_ID  FIRST_NAME  LAST_NAME  TITLE
------------------------------------------------------
1         PENELOPE    GUINESS    BULWORTH COMMANDMENTS
1         PENELOPE    GUINESS    ANACONDA CONFESSIONS
1         PENELOPE    GUINESS    GLEAMING JAWBREAKER
1         PENELOPE    GUINESS    WESTWARD SEABISCUIT
2         NICK        WAHLBERG   GOODFELLAS SALUTE
2         NICK        WAHLBERG   ADAPTATION HOLES
2         NICK        WAHLBERG   WARDROBE PHANTOM
2         NICK        WAHLBERG   RUSHMORE MERMAID
2         NICK        WAHLBERG   HAPPINESS UNITED
2         NICK        WAHLBERG   FIGHT JAWBREAKER
2         NICK        WAHLBERG   DESTINY SATURDAY
3         ED          CHASE      ARTIST COLDBLOODED
3         ED          CHASE      NECKLACE OUTBREAK
3         ED          CHASE      BOONDOCK BALLROOM
...

Quite a different result!

If we don’t have native support for WITH TIES, then we can use RANK() again, and this time, we don’t need APPLY or LATERAL:

SELECT actor_id, first_name, last_name, title
FROM actor a
LEFT JOIN (
  SELECT 
    actor_id,
    title, 
    rank() OVER (
      PARTITION BY actor_id
      ORDER BY length(title) DESC
    ) rk
  FROM film f
  JOIN film_actor fa USING (film_id)
) t USING (actor_id)
WHERE rk <= 3
ORDER BY actor_id, rk;

What are we doing here, compared to the APPLY version of the query?

  • First off, we’re no longer filtering anything in the subquery. The original WHERE clause that accessed the outer A.ACTOR_ID column is gone
  • … instead, we’re using that ACTOR_ID column in an ordinary JOIN predicate between the ACTOR table and our derived table that produces the films
  • We calculate the RANK() of films per actor. Unlike before, where we calculated the RANK() over the entire data set (the default PARTITION in window functions is always the entire data set), we now partition our data set by ACTOR_ID. So, for each actor, we’re getting the TOP 3 (and 4 and 5 and 6, …) films pre-calculated
  • … only then, in the outer query, we can filter again by this pre-calculated rank, hoping that the database will be smart enough and somehow push down the predicate into the ranking function.

Which is the faster solution? NEVER GUESS! ALWAYS MEASURE! :)

Benchmarking time

As a matter of fact, FETCH is just syntax sugar for filtering on window functions in Oracle. So the two queries should really perform equally well. I’m using the benchmarking technique from this article here.

Oracle 12.2.0.1.0 first:

Here’s the full code for Oracle:

SET SERVEROUTPUT ON
CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 100;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT actor_id, first_name, last_name, title
        FROM actor a
        OUTER APPLY (
          SELECT title
          FROM film f
          JOIN film_actor fa USING (film_id)
          WHERE fa.actor_id = a.actor_id
          ORDER BY length(title) DESC
          FETCH FIRST 3 ROWS WITH TIES
        ) t
        ORDER BY actor_id, length(title) DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, 1, 
	  SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT actor_id, first_name, last_name, title
        FROM actor a
        LEFT JOIN (
          SELECT 
            actor_id,
            title, 
            rank() OVER (
              PARTITION BY actor_id
              ORDER BY length(title) DESC
            ) rk
          FROM film f
          JOIN film_actor fa USING (film_id)
        ) t USING (actor_id)
        WHERE rk <= 3
        ORDER BY actor_id, rk
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    INSERT INTO results VALUES (r, 2, 
	  SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio 
    FROM results
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio);
  END LOOP;
END;
/

DROP TABLE results;

Yielding:

Run 1, Statement 1 :  9.19943
Run 1, Statement 2 :  1.07469
Run 2, Statement 1 : 12.86258
Run 2, Statement 2 :  1.03741
Run 3, Statement 1 : 13.80538
Run 3, Statement 2 :  1.09832
Run 4, Statement 1 : 13.91985
Run 4, Statement 2 :  1.16206
Run 5, Statement 1 :  9.37335
Run 5, Statement 2 :  1

Results are relative to each other (according to my understanding, I’m doing this in order to comply with the Oracle license regarding the publication of benchmark results – no actual times are published). Statement 2 (using explicit ranking) is repeatedly at least 9x faster than statement 1 (using FETCH) on Oracle 12.2.0.1.0! Bummer!

What about SQL Server 2014?

Benchmarking logic:

DECLARE @ts DATETIME;
DECLARE @repeat INT = 100;
DECLARE @r INT;
DECLARE @i INT;
DECLARE @dummy1 INT;
DECLARE @dummy2 VARCHAR;
DECLARE @dummy3 VARCHAR;
DECLARE @dummy4 VARCHAR;

DECLARE @s1 CURSOR;
DECLARE @s2 CURSOR;

DECLARE @results TABLE (
  run     INT,
  stmt    INT,
  elapsed DECIMAL
);

SET @r = 0;
WHILE @r < 5
BEGIN
  SET @r = @r + 1

  SET @s1 = CURSOR FOR 
    SELECT actor_id, first_name, last_name, title
	FROM actor a
	OUTER APPLY (
	  SELECT TOP 3 WITH TIES title
	  FROM film f
	  JOIN film_actor fa ON f.film_id = fa.film_id
	  WHERE fa.actor_id = a.actor_id
	  ORDER BY len(title) DESC
	) t
	ORDER BY actor_id, len(title) DESC;

  SET @s2 = CURSOR FOR 
    SELECT a.actor_id, first_name, last_name, title
    FROM actor a
    LEFT JOIN (
      SELECT 
        actor_id,
        title, 
        rank() OVER (
          PARTITION BY actor_id
          ORDER BY len(title) DESC
        ) rk
      FROM film f
      JOIN film_actor fa ON f.film_id = fa.film_id
	) t ON a.actor_id = t.actor_id
    WHERE rk <= 3
    ORDER BY a.actor_id, rk

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s1;
    FETCH NEXT FROM @s1 INTO @dummy1, @dummy2, @dummy3, @dummy4;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s1 INTO @dummy1, @dummy2, @dummy3, @dummy4;
    END;

    CLOSE @s1;
  END;

  DEALLOCATE @s1;
  INSERT INTO @results 
  VALUES (@r, 1, DATEDIFF(ms, @ts, current_timestamp));

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s2;
    FETCH NEXT FROM @s2 INTO @dummy1, @dummy2, @dummy3, @dummy4;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s2 INTO @dummy1, @dummy2, @dummy3, @dummy4;
    END;

    CLOSE @s2;
  END;

  DEALLOCATE @s2;
  INSERT INTO @results 
  VALUES (@r, 2, DATEDIFF(ms, @ts, current_timestamp));
END;

SELECT 'Run ' + CAST(run AS VARCHAR) + ', Statement ' + 
  CAST(stmt AS VARCHAR) + ': ' + 
  CAST(CAST(elapsed / MIN(elapsed) OVER() AS DECIMAL(10, 5)) AS VARCHAR)
FROM @results;

Surprisingly, also SQL Server has worse performance with the APPLY approach than with window functions, although the difference is smaller (I’m using SQL Server 2014):

Run 1, Statement 1: 5.07019
Run 1, Statement 2: 1.11988
Run 2, Statement 1: 5.48820
Run 2, Statement 2: 1.20683
Run 3, Statement 1: 5.08882
Run 3, Statement 2: 1.31429
Run 4, Statement 1: 5.31863
Run 4, Statement 2: 1.00000
Run 5, Statement 1: 5.07453
Run 5, Statement 2: 1.21491

I would have expected SQL Server to be much better here, given that this syntax has been available for a long time in SQL Server. Challenge to the reader: Why do both databases perform so poorly with CROSS APPLY? I’ll explain Oracle’s problem in a bit.

Let’s look at PostgreSQL 9.6

The “WITH TIES” semantics can only be implemented with window functions in PostgreSQL, so let’s stick to the “ROWS” semantics of the LIMIT clause.

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 100;
  rec RECORD;
  run INT[];
  stmt INT[];
  elapsed DECIMAL[];
  max_elapsed DECIMAL;
  i INT := 1;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT actor_id, first_name, last_name, title
        FROM actor a
        LEFT JOIN LATERAL (
          SELECT title
          FROM film f
          JOIN film_actor fa USING (film_id)
          WHERE fa.actor_id = a.actor_id
          ORDER BY length(title) DESC
          LIMIT 3
        ) t ON true
        ORDER BY actor_id, length(title) DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    run[i] := r;
    stmt[i] := 1;
    elapsed[i] := 
       (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP)) 
      - EXTRACT(EPOCH FROM v_ts));
    i := i + 1;
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT actor_id, first_name, last_name, title
        FROM actor a
        LEFT JOIN (
          SELECT 
            actor_id,
            title, 
            row_number() OVER (
              PARTITION BY actor_id
              ORDER BY length(title) DESC
            ) rn
          FROM film f
          JOIN film_actor fa USING (film_id)
        ) t USING (actor_id)
        WHERE rn <= 3
        ORDER BY actor_id, rn
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    run[i] := r;
    stmt[i] := 2;
    elapsed[i] := 
       (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP)) 
      - EXTRACT(EPOCH FROM v_ts));
    i := i + 1;
  END LOOP;

  SELECT max(t.elapsed)
  INTO max_elapsed
  FROM unnest(elapsed) AS t(elapsed);

  FOR i IN 1..array_length(run, 1) LOOP
    RAISE INFO 'RUN %, Statement %: %', run[i], stmt[i], 
      CAST(elapsed[i] / max_elapsed AS DECIMAL(10, 5));
  END LOOP;
END$$;

This time, the results are more even. It doesn’t really seem to matter which approach you pick:

00000: RUN 1, Statement 1: 0.75904
00000: RUN 1, Statement 2: 0.99784
00000: RUN 2, Statement 1: 0.75907
00000: RUN 2, Statement 2: 0.95206
00000: RUN 3, Statement 1: 0.82102
00000: RUN 3, Statement 2: 0.94841
00000: RUN 4, Statement 1: 0.96208
00000: RUN 4, Statement 2: 0.96218
00000: RUN 5, Statement 1: 0.83398
00000: RUN 5, Statement 2: 1.00000

This can have two reasons:

  • Optimistic: LATERAL is better optimised in PostgreSQL
  • Pessimistic: window functions are less well optimised in PostgreSQL

From experience with tuning PostgreSQL queries, I’ll go for the pessimistic guess, because indeed, there is no hint in the execution plan about any optimisation being done on the window function predicate:

Sort  (cost=911.26..915.81 rows=1821 width=40)
  Sort Key: a.actor_id, t.rn
  -> Hash Join  (cost=596.44..812.65 rows=1821 width=40)
     Hash Cond: (t.actor_id = a.actor_id)
     -> Subquery Scan on t  (cost=589.94..781.11 rows=1821 width=25)
        Filter: (t.rn <= 3)
        -> WindowAgg  (cost=589.94..712.83 rows=5462 width=29)
           -> Sort  (cost=589.94..603.59 rows=5462 width=21)
              Sort Key: fa.actor_id, (length((f.title)::text)) DESC
              -> Hash Join  (cost=77.50..250.88 rows=5462 width=21)
                 Hash Cond: (fa.film_id = f.film_id)
                 -> Seq Scan on film_actor fa (cost=0.0..84.62 rows=5462 width=4)
                 -> Hash  (cost=65.00..65.00 rows=1000 width=19)
                    ->  Seq Scan on film f  (cost=0.00..65.00 rows=1000 width=19)
     -> Hash  (cost=4.00..4.00 rows=200 width=17)
        -> Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=17)

What about DB2 10.5?

BEGIN
  DECLARE CONTINUE HANDLER FOR SQLSTATE '42710' BEGIN END;
  EXECUTE IMMEDIATE 'CREATE TABLE print_relative (run INTEGER, stmt INTEGER, elapsed DECIMAL(20, 4))';
END 

BEGIN
  DECLARE v_ts TIMESTAMP;
  DECLARE v_repeat INTEGER DEFAULT 100;
  DECLARE v_i INTEGER;
  DECLARE v_j INTEGER;

  -- Repeat benchmark several times to avoid warmup penalty
  SET v_i = 1;
  
  DELETE FROM print_relative;
  
  REPEAT
    SET v_j = 1;
    SET v_ts = CURRENT_TIMESTAMP;
    
    REPEAT
      FOR rec AS cur CURSOR FOR
        SELECT actor_id, first_name, last_name, title
        FROM actor a
        LEFT JOIN LATERAL (
          SELECT title
          FROM film f
          JOIN film_actor fa ON f.film_id = fa.film_id
          WHERE fa.actor_id = a.actor_id
          ORDER BY length(title) DESC
          FETCH FIRST 3 ROWS ONLY
        ) t ON 1 = 1
        ORDER BY actor_id, length(title) DESC
      DO
        BEGIN END;
      END FOR;
    
      SET v_j = v_j + 1;
      UNTIL v_j = v_repeat
    END REPEAT;
      
    INSERT INTO print_relative 
    VALUES (v_i, 1, (CURRENT_TIMESTAMP - v_ts));
    
    SET v_j = 1;
    SET v_ts = CURRENT_TIMESTAMP;
    
    REPEAT
      FOR rec AS cur CURSOR FOR
        SELECT a.actor_id, first_name, last_name, title
        FROM actor a
        LEFT JOIN (
          SELECT 
            actor_id,
            title, 
            row_number() OVER (
              PARTITION BY actor_id
              ORDER BY length(title) DESC
            ) rn
          FROM film f
          JOIN film_actor fa ON f.film_id = fa.film_id
        ) t ON t.actor_id = a.actor_id
        WHERE rn <= 3
        ORDER BY a.actor_id, rn
      DO
        BEGIN END;
      END FOR;
      
      SET v_j = v_j + 1;
      UNTIL v_j = v_repeat
    END REPEAT;

    INSERT INTO print_relative 
    VALUES (v_i, 2, (CURRENT_TIMESTAMP - v_ts));
    
    SET v_i = v_i + 1;
    UNTIL v_i = 5
  END REPEAT;
END

SELECT
  run, 
  stmt,
  CAST(elapsed / MIN(elapsed) OVER() AS DECIMAL(20, 4)) ratio
FROM print_relative;

DROP TABLE print_relative;

Result, again, window functions outperform LATERAL joining:

1	1	6.0353
1	2	1.0783
2	1	6.2549
2	2	1.0093
3	1	6.1067
3	2	1.0000
4	1	6.1987
4	2	1.0128

Explanation for benchmarks

For this explanation, I’m looking at Oracle execution plans only – showing what kind of rationale could be derived from their plans. I’m assuming that similar problems may appear in the other 3 databases.

Here’s the plan for the window function query:

The following is the execution plan obtained through

SELECT * FROM TABLE (
  dbms_xplan.display_cursor(format => 'ALLSTATS LAST')
)

It displays the actual execution plan with gathered statistics (using the /*+GATHER_PLAN_STATISTICS*/ hint), not the estimated plan.

----------------------------------------------------------------------
| Id  | Operation                  | Name          | Starts | A-Rows |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT           |               |      1 |    767 |
|   1 |  SORT ORDER BY             |               |      1 |    767 |
|*  2 |   HASH JOIN                |               |      1 |    767 |
|   3 |    TABLE ACCESS FULL       | ACTOR         |      1 |    200 |
|*  4 |    VIEW                    |               |      1 |    767 |
|*  5 |     WINDOW SORT PUSHED RANK|               |      1 |   1079 |
|*  6 |      HASH JOIN             |               |      1 |   5463 |
|   7 |       TABLE ACCESS FULL    | FILM          |      1 |   1000 |
|   8 |       INDEX FAST FULL SCAN | PK_FILM_ACTOR |      1 |   5463 |
----------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("A"."ACTOR_ID"="T"."ACTOR_ID")
   4 - filter("T"."RK"<=3)
   5 - filter(RANK() OVER ( PARTITION BY "FA"."ACTOR_ID" 
         ORDER BY LENGTH("F"."TITLE") DESC )<=3)
   6 - access("F"."FILM_ID"="FA"."FILM_ID")

As can be seen, there’s some initial work of calculating the rank for most FILM_ACTOR rows from the nested query. I say most because the PUSHED RANK in the WINDOW SORT PUSHED RANK operation seems to indicate that the rank is calculated only as long as it is needed, i.e. until the ranking predicate rk <= 3 is no longer true.

In particular, this also means that we do not necessarily execute the entire sort, but we can keep a buffer of the current TOP 3, which can be done in O(N) rather than the full sort’s O(N log N). Whether this is really done in Oracle, I don’t know.

In any case, the result of the ranking operation is then efficiently hash joined with the ACTOR table.

What’s most interesting, though, is the “Starts” column, which indicates how many times an operation has been started. Each operation is started only once!

What about APPLY?

The plan is much worse:

---------------------------------------------------------------------------
| Id  | Operation                     | Name            | Starts | A-Rows |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                 |      1 |    767 |
|   1 |  SORT ORDER BY                |                 |      1 |    767 |
|   2 |   MERGE JOIN OUTER            |                 |      1 |    767 |
|   3 |    TABLE ACCESS FULL          | ACTOR           |      1 |    200 |
|   4 |    BUFFER SORT                |                 |    200 |    767 |
|   5 |     VIEW                      | VW_LAT_14BC7596 |    200 |    767 |
|   6 |      VIEW                     | VW_LAT_A18161FF |    200 |    767 |
|*  7 |       VIEW                    |                 |    200 |    767 |
|*  8 |        WINDOW SORT PUSHED RANK|                 |    200 |   1079 |
|   9 |         NESTED LOOPS          |                 |    200 |   5463 |
|  10 |          TABLE ACCESS FULL    | FILM            |    200 |    200K|
|* 11 |          INDEX UNIQUE SCAN    | PK_FILM_ACTOR   |    200K|   5463 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   7 - filter("from$_subquery$_008"."rowlimit_$$_rank"<=3)
   8 - filter(RANK() OVER ( ORDER BY LENGTH("F"."TITLE") DESC )<=3)
  11 - access("FA"."ACTOR_ID"="A"."ACTOR_ID" AND "F"."FILM_ID"="FA"."FILM_ID")

Observe the “Starts” column. On Id 4, we’re experiencing 200 executions of the BUFFER SORT operation. That’s because there are exactly 200 rows in the ACTOR table (as can be seen on Id 3). So, indeed, there’s no optimisation happening that would calculate the “laterally correlated subquery” for all actors in one go.

Furthermore, quite unfortunately, the JOIN order between FILM and FILM_ACTOR seems quite wrong. For each of the 200 actors, we’re loading the entire 1000 films, which produces 200 * 1000 = 200K rows to scan for those belonging to the single actor from the outer query (Id 11). That’s unreasonably repetitive.

I’d expect Oracle to inverse the table accesses. We can do this with a hint:

SELECT actor_id, first_name, last_name, title
FROM actor a
OUTER APPLY (
  SELECT /*+LEADING(fa f) USE_NL(fa f)*/ title
  FROM film f
  JOIN film_actor fa USING (film_id)
  WHERE actor_id = a.actor_id
  ORDER BY length(title) DESC
  FETCH FIRST 3 ROWS WITH TIES
) t
ORDER BY actor_id, length(title) DESC;

Observe the two hints:

  • LEADING to indicate that the FILM_ACTOR table should be the driving row source of the join
  • USE_NL to enforce a NESTED LOOP JOIN (without this, and with LEADING, Oracle would prefer a HASH JOIN)

With the hints, the benchmark results look a lot better:

Statement 1 : 1          (window functions)
Statement 2 : 9.17483    (APPLY without hints)
Statement 3 : 4.88774    (APPLY with LEADING hint)
Statement 4 : 1.65269    (APPLY with LEADING and USE_NL hints)

Also, the execution plan now no longer has those excessive “Starts” and “A-Rows” values:

--------------------------------------------------------------------------------
| Id  | Operation                           | Name           | Starts | A-Rows |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                |      1 |     50 |
|   1 |  SORT ORDER BY                      |                |      1 |     50 |
|   2 |   MERGE JOIN OUTER                  |                |      1 |    767 |
|   3 |    TABLE ACCESS FULL                | ACTOR          |      1 |    200 |
|   4 |    BUFFER SORT                      |                |    200 |    767 |
|   5 |     VIEW                            | VW_LAT_2880E7C5|    200 |    767 |
|   6 |      VIEW                           | VW_LAT_A18161FF|    200 |    767 |
|*  7 |       VIEW                          |                |    200 |    767 |
|*  8 |        WINDOW SORT PUSHED RANK      |                |    200 |   1079 |
|   9 |         NESTED LOOPS                |                |    200 |   5463 |
|  10 |          NESTED LOOPS               |                |    200 |   5463 |
|* 11 |           INDEX RANGE SCAN          | PK_FILM_ACTOR  |    200 |   5463 |
|* 12 |           INDEX UNIQUE SCAN         | PK_FILM        |   5463 |   5463 |
|  13 |          TABLE ACCESS BY INDEX ROWID| FILM           |   5463 |   5463 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   7 - filter("from$_subquery$_006"."rowlimit_$$_rank"<=3)
   8 - filter(RANK() OVER ( ORDER BY LENGTH("F"."TITLE") DESC )<=3)
  11 - access("FA"."ACTOR_ID"="A"."ACTOR_ID")
  12 - access("F"."FILM_ID"="FA"."FILM_ID")

The highest number of “A-Rows” is easily explained. There are exactly 5463 rows in FILM_ACTOR. Still, I’d wish that the 200 TOP-N lookups per actor could be optimised somehow.

Interestingly, this plan is associated with a much higher cost than the original one, even if in this case, it is much better.

Conclusion

TOP N queries are very common, we need them all the time. The simplest way is to use an ORDER BY clause and a LIMIT clause.

Sometimes, we need WITH TIES semantics, and Oracle 12c as well as SQL Server can provide this with standard or vendor specific syntax, out of the box. Other databases can emulate WITH TIES using window functions rather easily.

When we need TOP N per category, an interesting and cool syntax comes in handy: APPLY or LATERAL. However, very unfortunately, simple benchmarks have shown that these can be much slower than their equivalent window function counterparts. Of course, this will heavily depend on the size of the data set. Do not underestimate the cost of O(N log N) sorts that occur in window functions. As always: Do measure, never guess.

How to Generate at Least One Row in SQL

There are some situations where you would like to have at least one (empty) row in your result set in SQL.

Imagine the following situation. We’re querying the Sakila database for actors and their films:

SELECT first_name, last_name, title
FROM actor
JOIN film_actor USING (actor_id)
JOIN film USING (film_id)
ORDER BY 1, 2, 3

yielding something like:

+------------+-----------+---------------------+
| FIRST_NAME | LAST_NAME | TITLE               |
+------------+-----------+---------------------+
| ...        | ...       | ...                 |
| ADAM       | GRANT     | SEABISCUIT PUNK     |
| ADAM       | GRANT     | SPLENDOR PATTON     |
| ADAM       | GRANT     | TADPOLE PARK        |
| ADAM       | GRANT     | TWISTED PIRATES     |
| ADAM       | GRANT     | WANDA CHAMBER       |
| ADAM       | HOPPER    | BLINDNESS GUN       |
| ADAM       | HOPPER    | BLOOD ARGONAUTS     |
| ADAM       | HOPPER    | CHAMBER ITALIAN     |
| ...        | ...       | ...                 |
+------------+-----------+---------------------+

Now, let’s find actors called SUSAN, and in fact, let’s not care if they played in any films (I’ve added them to the Sakila database for the sake of the example):

SELECT actor_id, first_name, last_name, title
FROM actor
LEFT JOIN film_actor USING (actor_id)
LEFT JOIN film USING (film_id)
WHERE first_name = 'SUSAN'
ORDER BY 1, 2, 3

Interesting, there are now two actors without any films:

+----------+------------+-----------+-----------------+
| ACTOR_ID | FIRST_NAME | LAST_NAME | TITLE           |
+----------+------------+-----------+-----------------+
| 110      | SUSAN      | DAVIS     | TROJAN TOMORROW |
| 110      | SUSAN      | DAVIS     | WASH HEAVENLY   |
| 110      | SUSAN      | DAVIS     | WORDS HUNTER    |
| 201      | SUSAN      | DAVIS     |                 |
| 202      | SUSAN      | SMITH     |                 |
+----------+------------+-----------+-----------------+

This worked, because I have changed the JOIN type from INNER JOIN to LEFT JOIN. That’s neat. But what if we hadn’t found any actor called SUSAN? What if we were looking for SUSANNE instead?

SELECT actor_id, first_name, last_name, title
FROM actor
LEFT JOIN film_actor USING (actor_id)
LEFT JOIN film USING (film_id)
WHERE first_name = 'SUSANNE'
ORDER BY 1, 2, 3

Empty. Void. Nothing:

+----------+------------+-----------+-----------------+
| ACTOR_ID | FIRST_NAME | LAST_NAME | TITLE           |
+----------+------------+-----------+-----------------+

That’s fine in most cases, because the way we wrote this query, we’re expecting:

All actors called Susanne and their films, if any

But what if we wanted to have the same behaviour as we got for Films through LEFT JOIN also with the actors? I.e. if we wanted this, instead (i.e. a collection with 1..N cardinality):

+----------+------------+-----------+-----------------+
| ACTOR_ID | FIRST_NAME | LAST_NAME | TITLE           |
|          |            |           |                 |
+----------+------------+-----------+-----------------+

What would you need this for? Well, sometimes, we can simply not handle the depressing sadness of emptiness.

How to do this? We need another LEFT JOIN prepended to the ACTOR table, but not just to the ACTOR table itself, we need to prepend it to everything. E.g. like this:

SELECT actor_id, first_name, last_name, title

-- This dummy table will always generate exactly one row
FROM (
  SELECT 1 a
) a

-- This is the original query, including the WHERE clause
LEFT JOIN (
  SELECT *
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  LEFT JOIN film USING (film_id)
  WHERE first_name = 'SUSANNE'
) b ON 1 = 1
ORDER BY 1, 2, 3

The above query is guaranteed to produce at least one row, because the left side of the LEFT JOIN always produces exactly one row, which is joined to every row on the right side, if there is any row on the right side.

Caveats:

  • The WHERE clause (and potentially other clauses, like GROUP BY) must now go inside of the new derived table B. Otherwise, we’d be removing that single row from A again using WHERE. (This is because of the order of SQL operations. We must ensure WHERE “happens-before” LEFT JOIN)
  • The LEFT JOIN between A and B needs an ON clause for syntactic reasons, even if we don’t really need that here. Just put something that is always true (like TRUE in PostgreSQL).
  • Our result now has an additional, useless column A, which might bother us, e.g. when using SELECT *

Alternative: OUTER APPLY

If you’re using SQL Server or Oracle 12c, there’s an even more elegant solution using OUTER APPLY:

SQL Server

SELECT actor_id, first_name, last_name, title
FROM (SELECT 1 a) a
OUTER APPLY (
  SELECT a.actor_id, a.first_name, a.last_name, f.title
  FROM actor a
  LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
  LEFT JOIN film f ON fa.film_id = f.film_id
  WHERE first_name = 'SUSANNE'
) b
ORDER BY 1, 2, 3

Oracle 12c

SELECT actor_id, first_name, last_name, title
FROM (SELECT 1 a FROM dual) a
OUTER APPLY (
  SELECT *
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  LEFT JOIN film USING (film_id)
  WHERE first_name = 'SUSANNE'
) b
ORDER BY 1, 2, 3

While we don’t actually need the nice APPLY feature here, it just allows us to omit the ON clause and still have the LEFT OUTER semantics. Neat ey?

Geek bonus

And if you really want to geek out on this functionality, consider using the dee table from the dum/dee PostgreSQL example. Remember, the dee table is a table with exactly one row and no columns! This means we can use SELECT * without getting this dummy row!

SELECT *

-- This dummy table will always generate exactly one row
FROM dee

-- This is the original query, including the WHERE clause
LEFT JOIN (
  SELECT *
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  LEFT JOIN film USING (film_id)
  WHERE first_name = 'SUSANNE'
) b ON 1 = 1
ORDER BY 1, 2, 3

Ahh. Beautiful SQL!

jOOQ Newsletter: December 30, 2013. Happy New Year!

subscribe to the newsletter here

Tweet of the Day

We would like to contribute this new section of the newsletter to our followers, users, and customers. Here’s Andy Van Den Heuvel, who appreciates jOOQ’s and MyBatis’ (both being “post-JPA” frameworks) return to SQL.

https://twitter.com/andyvdh7/status/416685921288093696

2013 from jOOQ’s perspective

2013 was a very exciting year for jOOQ and jOOQ customers. Not only did we create a company to offer commercial licensing and support, but we have also released the awesome jOOQ 3.0 major release, which brought row-level typesafety to SQL in Java. No competitor product on the market currently lifts SQL to be a first-class citizen in Java as we do. In other words, 2013 was a great year for Java and SQL in general.

After 3.0, great minor releases followed suit:

  • jOOQ 3.1 with support for MariaDB, SQL Server 2012, Oracle 12c, and a new sophisticated SPI to interact with POJO mapping
  • jOOQ 3.2 with new SPIs to interact with record and query rendering lifecycles for advanced SQL transformation. We also introduced a new matcher strategy to allow for fine-grained control over jOOQ’s code generation.

And 2014 won’t be less exciting, as we’re adding support for another popular commercial database: MS Access, besides being the first and only SQL abstraction framework to implement keyset paging in such an intuitive syntax.

In that sense, let us wish you a Happy New Year and a great start to 2014!

Convince Your Manager

It is easy for technical folks like software developers and architects to understand why jOOQ adds value to your projects. You’re the workers and as such, you need the right tools. While there are free alternatives (like Gimp), you’re a database professional who prefers to use the market leader (like Photoshop).

But the market leader costs money and your manager needs to be convinced that the money is spent wisely. No problem we told ourselves, and created this simple presentation that can help you convince your managers:

The above presentation compares jOOQ with the industry standard means of accessing RDBMS:

  • JDBC (low-level)
  • JPA (high-level)

This comparison will then be used to explain the return on investment (ROI) that your team will get by using jOOQ.

Upcoming Events

After another welcoming and interesting talk at the JUGS in Berne, we are happy to announce that our talk will now also be hosted in Hamburg, in January 2014:

Here is an overview of our upcoming events

Stay informed about 2014 events on www.jooq.org/news.

SQL Zone – LATERAL joins

Few people may be aware of the SQL:1999 standard LATERAL keyword, which can be applied to table expressions in the FROM clause, in order to allow them to cross-reference other tables within the same FROM clause. This is extremely useful when joining non-scalar table-valued functions to each record of physical tables.

T-SQL (SQL Server and Sybase) users have known this feature for a long time under the vendor-specific CROSS APPLY and OUTER APPLY join keywords. With the recent releases of PostgreSQL 9.3 and Oracle 12c, other databases can now finally also profit from this SQL goodie.

Read more about LATERAL joins and how this will be supported in jOOQ 3.3 in our blog post.

Add LATERAL Joins or CROSS APPLY to Your SQL Tool Chain

The T-SQL dialect has known the powerful CROSS APPLY and OUTER APPLY JOIN syntaxes for ages. The SQL:1999 standard had introduced almost equivalent “lateral derived tables”, which are finally supported with PostgreSQL 9.3, or Oracle 12c, which has adopted both the SQL standard LATERAL syntax and the T-SQL vendor-specific CROSS APPLY and OUTER APPLY syntaxes.

But what are we even talking about?

SQL features have a unique trait that few other languages have. They are obscure to those who don’t know them, as every language feature introduces a new syntax with new keywords. In this case: APPLY and LATERAL. But it really isn’t so hard to understand. All you do with a CROSS APPLY is a CROSS JOIN between two tables where the right-hand side of the join expression can reference columns from the left-hand side of the join expression. Consider the following example by Martin Smith on Stack Overflow:

Reusing column aliases

SELECT number,
       doubled_number,
       doubled_number_plus_one
FROM master..spt_values
CROSS APPLY (
  SELECT 2 * CAST(number AS BIGINT)
) CA1(doubled_number)
CROSS APPLY (
  SELECT doubled_number + 1
) CA2(doubled_number_plus_one)

See a SQLFiddle of the above example

In this example, we’re selecting numbers from a system table and cross apply a scalar subselect multiplying each number by two. Then to the whole table product, we cross apply another scalar subselect, adding one to the last number.

This particular example could also be implemented using subqueries in the SELECT clause. But as you can see in the above example, doubled_number_plus_one can be calculated from a previously calculated column in one go. That wouldn’t be so “simple” with subqueries.

Applying table-valued functions to each record

SELECT *
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle)

This example may be even more useful when you want to join a table-valued function to each record of another table.

PostgreSQL’s LATERAL derived tabels

In PostgreSQL, this can be done somewhat magically by put-ting table-valued functions in the SELECT clause:

SELECT x, GENERATE_SERIES(0, x)
FROM (VALUES(0), (1), (2)) t(x)

See a SQLFiddle of the above example

The above yields

| X | GENERATE_SERIES |
|---|-----------------|
| 0 |               0 |
| 1 |               0 |
| 1 |               1 |
| 2 |               0 |
| 2 |               1 |
| 2 |               2 |

Alternatively, since PostgreSQL 9.3, you can use an explicit lateral derived table as such:

SELECT x, y
FROM (VALUES(0), (1), (2)) t(x),
LATERAL GENERATE_SERIES(0, t.x) u(y)

See a SQLFiddle of the above example

Yielding again

| X | Y |
|---|---|
| 0 | 0 |
| 1 | 0 |
| 1 | 1 |
| 2 | 0 |
| 2 | 1 |
| 2 | 2 |

CROSS APPLY and OUTER APPLY in jOOQ 3.3

The above clauses will also be supported in the upcoming editions of jOOQ 3.3 where you can write queries like this one here:

DSL.using(configuration)
   .select()
   .from(AUTHOR)
   .crossApply(
        select(count().as("c"))
       .from(BOOK)
       .where(BOOK.AUTHOR_ID.eq(AUTHOR.ID)))
   .fetch();

Or lateral joins:

DSL.using(configuration)
   .select()
   .from(
        values(row(0), row(1), row(2))
            .as("t", "x"),
        lateral(generateSeries(0,
                fieldByName("t", "x"))
            .as("u", "y")))
   .fetch();

No matter if you’re using jOOQ or native SQL, lateral derived tables or CROSS APPLY should definitely be part of your awesome SQL tool chain!

Oracle 12c Goodies: { CROSS | OUTER } APPLY

I can’t believe my eyes. Has this been openly communicated by Oracle? I haven’t seen too many blog posts on that matter. Apart from introducing the awesome SQL Standard OFFSET .. FETCH clause (just like SQL Server 2012), there seems to be now also { CROSS | OUTER } APPLY in Oracle 12c! Check out the documentation:

http://docs.oracle.com/cd/E16655_01/server.121/e17209/statements_10002.htm#BABJHDDA

It’s about time jOOQ hops on the 12c train and supports all of these goodies! Expect to see APPLY support in jOOQ, soon, along with emulation thereof (if that’s possible and easy…)