How to Generate at Least One Row in SQL

There are some situations where you would like to have at least one (empty) row in your result set in SQL.

Imagine the following situation. We’re querying the Sakila database for actors and their films:

SELECT first_name, last_name, title
FROM actor
JOIN film_actor USING (actor_id)
JOIN film USING (film_id)
ORDER BY 1, 2, 3

yielding something like:

+------------+-----------+---------------------+
| FIRST_NAME | LAST_NAME | TITLE               |
+------------+-----------+---------------------+
| ...        | ...       | ...                 |
| ADAM       | GRANT     | SEABISCUIT PUNK     |
| ADAM       | GRANT     | SPLENDOR PATTON     |
| ADAM       | GRANT     | TADPOLE PARK        |
| ADAM       | GRANT     | TWISTED PIRATES     |
| ADAM       | GRANT     | WANDA CHAMBER       |
| ADAM       | HOPPER    | BLINDNESS GUN       |
| ADAM       | HOPPER    | BLOOD ARGONAUTS     |
| ADAM       | HOPPER    | CHAMBER ITALIAN     |
| ...        | ...       | ...                 |
+------------+-----------+---------------------+

Now, let’s find actors called SUSAN, and in fact, let’s not care if they played in any films (I’ve added them to the Sakila database for the sake of the example):

SELECT actor_id, first_name, last_name, title
FROM actor
LEFT JOIN film_actor USING (actor_id)
LEFT JOIN film USING (film_id)
WHERE first_name = 'SUSAN'
ORDER BY 1, 2, 3

Interesting, there are now two actors without any films:

+----------+------------+-----------+-----------------+
| ACTOR_ID | FIRST_NAME | LAST_NAME | TITLE           |
+----------+------------+-----------+-----------------+
| 110      | SUSAN      | DAVIS     | TROJAN TOMORROW |
| 110      | SUSAN      | DAVIS     | WASH HEAVENLY   |
| 110      | SUSAN      | DAVIS     | WORDS HUNTER    |
| 201      | SUSAN      | DAVIS     |                 |
| 202      | SUSAN      | SMITH     |                 |
+----------+------------+-----------+-----------------+

This worked, because I have changed the JOIN type from INNER JOIN to LEFT JOIN. That’s neat. But what if we hadn’t found any actor called SUSAN? What if we were looking for SUSANNE instead?

SELECT actor_id, first_name, last_name, title
FROM actor
LEFT JOIN film_actor USING (actor_id)
LEFT JOIN film USING (film_id)
WHERE first_name = 'SUSANNE'
ORDER BY 1, 2, 3

Empty. Void. Nothing:

+----------+------------+-----------+-----------------+
| ACTOR_ID | FIRST_NAME | LAST_NAME | TITLE           |
+----------+------------+-----------+-----------------+

That’s fine in most cases, because the way we wrote this query, we’re expecting:

All actors called Susanne and their films, if any

But what if we wanted to have the same behaviour as we got for Films through LEFT JOIN also with the actors? I.e. if we wanted this, instead (i.e. a collection with 1..N cardinality):

+----------+------------+-----------+-----------------+
| ACTOR_ID | FIRST_NAME | LAST_NAME | TITLE           |
|          |            |           |                 |
+----------+------------+-----------+-----------------+

What would you need this for? Well, sometimes, we can simply not handle the depressing sadness of emptiness.

How to do this? We need another LEFT JOIN prepended to the ACTOR table, but not just to the ACTOR table itself, we need to prepend it to everything. E.g. like this:

SELECT actor_id, first_name, last_name, title

-- This dummy table will always generate exactly one row
FROM (
  SELECT 1 a
) a

-- This is the original query, including the WHERE clause
LEFT JOIN (
  SELECT *
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  LEFT JOIN film USING (film_id)
  WHERE first_name = 'SUSANNE'
) b ON 1 = 1
ORDER BY 1, 2, 3

The above query is guaranteed to produce at least one row, because the left side of the LEFT JOIN always produces exactly one row, which is joined to every row on the right side, if there is any row on the right side.

Caveats:

  • The WHERE clause (and potentially other clauses, like GROUP BY) must now go inside of the new derived table B. Otherwise, we’d be removing that single row from A again using WHERE. (This is because of the order of SQL operations. We must ensure WHERE “happens-before” LEFT JOIN)
  • The LEFT JOIN between A and B needs an ON clause for syntactic reasons, even if we don’t really need that here. Just put something that is always true (like TRUE in PostgreSQL).
  • Our result now has an additional, useless column A, which might bother us, e.g. when using SELECT *

Alternative: OUTER APPLY

If you’re using SQL Server or Oracle 12c, there’s an even more elegant solution using OUTER APPLY:

SQL Server

SELECT actor_id, first_name, last_name, title
FROM (SELECT 1 a) a
OUTER APPLY (
  SELECT a.actor_id, a.first_name, a.last_name, f.title
  FROM actor a
  LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
  LEFT JOIN film f ON fa.film_id = f.film_id
  WHERE first_name = 'SUSANNE'
) b
ORDER BY 1, 2, 3

Oracle 12c

SELECT actor_id, first_name, last_name, title
FROM (SELECT 1 a FROM dual) a
OUTER APPLY (
  SELECT *
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  LEFT JOIN film USING (film_id)
  WHERE first_name = 'SUSANNE'
) b
ORDER BY 1, 2, 3

While we don’t actually need the nice APPLY feature here, it just allows us to omit the ON clause and still have the LEFT OUTER semantics. Neat ey?

Geek bonus

And if you really want to geek out on this functionality, consider using the dee table from the dum/dee PostgreSQL example. Remember, the dee table is a table with exactly one row and no columns! This means we can use SELECT * without getting this dummy row!

SELECT *

-- This dummy table will always generate exactly one row
FROM dee

-- This is the original query, including the WHERE clause
LEFT JOIN (
  SELECT *
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  LEFT JOIN film USING (film_id)
  WHERE first_name = 'SUSANNE'
) b ON 1 = 1
ORDER BY 1, 2, 3

Ahh. Beautiful SQL!

10 Nice Examples of Writing SQL in Kotlin With jOOQ

Kotlin is the next big thing. With Google announcing official support for Kotlin on Android, we’ll see a lot more traction for this lovely language.

We’ve already blogged about the Kotlin language recently: 10 Features I Wish Java Would Steal From the Kotlin Language.

Should you migrate your application to Kotlin? Perhaps – there are many questions. The most important question is (of course): What are you going to do with your awesome SQL code?

The answer is: Use jOOQ! Need convincing? Here are 10 nice examples of writing SQL in Kotlin with jOOQ:

(Note: All the code is available on GitHub)

1. Using Kotlin’s Map Access Literals With jOOQ Records

This is a free feature that you’re getting from Kotlin thanks to their smart move of supporting certain operators on custom types by convention. Check out how easy it is to access a value from a record using a column reference:

val a = AUTHOR
val b = BOOK

ctx.select(a.FIRST_NAME, a.LAST_NAME, b.TITLE)
   .from(a)
   .join(b).on(a.ID.eq(b.AUTHOR_ID))
   .orderBy(1, 2, 3)
   .forEach {
       println("${it[b.TITLE]} by "
             + "${it[a.FIRST_NAME]} ${it[a.LAST_NAME]}")
   }

The output from our sample code is this:

1984 by George Orwell
Animal Farm by George Orwell
Brida by Paulo Coelho
O Alquimista by Paulo Coelho

We’re already using a couple of useful language features here. First off:

val for local variable type inference

We’re renaming AUTHOR to a and BOOK to b in our code to shorten table names a little bit. Using val we don’t even need to explicitly refer to the type of those tables.

As it looks right now, we’re going to get that feature in the Java language as well, so stay tuned.

(More about val in item #3)

More info about this feature here.

String interpolation

As you can see, inside of the String literal, we’re using the ${dollar.notation} to access variables from the context.

More info about this feature here.

(More about string interpolation in item #5)

Implicit lambda parameter “it”

When writing single-parameter lambda expressions, which is typical for loops, mapping, and many other standard library features, we don’t need to actually name that parameter in Kotlin. The default name is “it”. Groovy developers will rejoice, as they can profit from the same feature.

More info about this feature here.

Finally: Operator overloading

Kotlin supports a nice and pragmatic approach to operator overloading. We’re making use of the fact that the following two things mean exactly the same in Kotlin:

val x1 = anything[key];
val x2 = anything.get(key);

anything[key] = value;
anything.set(key, value);

Yeah, why not? In jOOQ, we’ve already seen this coming in version 3.9, and we’ve thus added get() and set() methods on our ubiquitous Record type, so you can access record values by:

  • Field reference (with type safety)
  • Column name (no type safety)
  • Index (no type safety)

Note that operator overloading can be used as well with a couple of arithmetic operators:

ctx.select(TRANSACTION.AMOUNT + 10) // Translates to Field.plus(10)
   .from(TRANSACTION)
   .fetch();

More info about this feature here.

2. Loop Over Record Contents

Another really nice feature of the language is destructuring of “tuples” into several local variables. Quite a few languages support this now, and so does Kotlin, even without first-level language support for real tuples. But often, we don’t actually need tuples, we just need their syntax. This can be done with any Map, for instance. Consider this code:

for (r in ctx.fetch(b))
    for ((k, v) in r.intoMap())
        println("${r[b.ID]}: ${k.padEnd(20)} = $v")

The output from our sample code is this:

1: ID                   = 1
1: AUTHOR_ID            = 1
1: TITLE                = 1984
1: PUBLISHED_IN         = 1948
2: ID                   = 2
2: AUTHOR_ID            = 1
2: TITLE                = Animal Farm
2: PUBLISHED_IN         = 1945
3: ID                   = 3
3: AUTHOR_ID            = 2
3: TITLE                = O Alquimista
3: PUBLISHED_IN         = 1988
4: ID                   = 4
4: AUTHOR_ID            = 2
4: TITLE                = Brida
4: PUBLISHED_IN         = 1990

This is really nice! Every jOOQ record can be represented as a Map using Record.intoMap(), which can be destructured automatically in loops as can be seen above.

More info about this feature here.

3. Local Variable Type Inference

An API that makes use of generics as heavily as jOOQ can leave quite some burden on the user occasionally. Type safety is great for the DSL, e.g. in jOOQ, the Java / Kotlin compilers can type-check things like:

// Type checks: Both sides are of type string
AUTHOR.FIRST_NAME.in("Alice", "Bob");

// Type checks: 
// - Both sides are of type string
// - Both sides are of degree one
AUTHOR.FIRST_NAME.in(
  select(CUSTOMER.FIRST_NAME).from(CUSTOMER)
);

// Doesn't type check: Wrong degree
AUTHOR.FIRST_NAME.in(
  select(CUSTOMER.FIRST_NAME, CUSTOMER.LAST_NAME).from(CUSTOMER)
)

This works through fancy types like:

Select<Record2<String, String>> subselect =
  select(CUSTOMER.FIRST_NAME, CUSTOMER.LAST_NAME).from(CUSTOMER);

These types can get quite hairy as they go up to 22 (higher degrees are supported without type safety). So, in Kotlin, we can simply write:

val subselect =
  select(CUSTOMER.FIRST_NAME, CUSTOMER.LAST_NAME).from(CUSTOMER);

// Still doesn't type check:
AUTHOR.FIRST_NAME.in(subselect)

// This works:
row(AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME).in(subselect)

More info about this feature here.

4. Inline Functions for New jOOQ Features

Are you missing a feature in jOOQ, for instance the PostgreSQL ilike operator? Well, it is supported through the Field.likeIgnoreCase() method, but you might prefer the more native looking ilike operator. No problem with Kotlin. Just write an inline function (which works similar to an extension function but they’re not exactly the same thing):

inline fun <F: Field<String>> F.ilike(field: String): Condition {
    return condition("{0} ilike {1}", this, field)
}

These inline functions make use of jOOQ’s plain SQL API which is heavily used by Java developers as well, but in this case, the keyword inline on the above functions indicates that the function works like an inline function. We can now write a query like this:

println("${ctx.select(b.TITLE)
              .from(b)
              .where(b.TITLE.ilike("%animal%"))
              .fetchOne(b.TITLE)}")

The result being:

Animal Farm

Notice how we can simply write b.TITLE.ilike("..") even if the jOOQ API doesn’t have such a method!

More info about this feature here

5. String Interpolation and Multiline Strings

How many times have we wished for multiline strings in Java? So many other languages have it, even SQL and PL/SQL. When working with string-based embedded SQL, this is just a killer feature. Good for Kotlin! Not only does Kotlin support multiline strings, but also string interpolation.

Here’s how to use the jOOQ plain SQL API with Kotlin:

ctx.resultQuery("""
    SELECT *
    FROM (
      VALUES (1, 'a'),
             (2, 'b')
    ) AS t
    """)
    .fetch()
    .forEach {
        println("${it.intoMap()}")
    }

Just copy-paste any arbitrary SQL statement right from your SQL console (in this case: H2 syntax) and you’re done. The output of the above is:

{C1=1, C2=a}
{C1=2, C2=b}

Bonus feature if you’re using the upcoming jOOQ 3.10 parser: You can standardise on SQL features that are not available from your database:

val colX = field("x")
val colY = field("y")
ctx.parser()
   .parseResultQuery("""
    SELECT *
    FROM (
      VALUES (1, 'a'),
             (2, 'b')
      -- This feature (derived column lists) 
      -- isn't really available in H2 yet it works!
    ) AS t(${colX.name}, ${colY.name})
    """)
   .fetch()
   .forEach {
       println("${it[colX]}, ${it[colY]}")
   }

The above statement makes use of derived column lists (renaming tables and columns in one go). Unfortunately, H2 doesn’t support this feature but jOOQ can emulate it, transparently. Note that we’re again using string interpolation, this time in order to reuse column names.

More info about the feature here.

6. Null Safe Dereferencing

Null is the mother of all bikesheds. Of course, Kotlin is opinionated about null as well

You’re not sure if your query will produce a result? No problem with Kotlin’s various null-related operator:

println("${ctx.fetchOne(b, b.ID.eq(5))?.title ?: "not found"}")

The above will print:

not found

Why? Because fetchOne() won’t return a record from our database, it will return null. We cannot dereference ?.title from null, but the ?. operator won’t throw an exception, it will just return null again. Finally, the Elvis operator will provide a default value if the expression on the left of it is null. Perfect!

More info about the feature here.

7. Getters and setters feel like properties

Remember the last 20 years when we emulated properties in Java using the obnoxious JavaBeans convention? Tired of writing getters and setters? Well, at least, with Kotlin, it pays off. We can syntactically pretend that we have a property where there are only getters and setters. Check this out:

val author1 = ctx.selectFrom(a).where(a.ID.eq(1)).fetchOne();
println("${author1.firstName} ${author1.lastName}")

val author2 = ctx.newRecord(a);
author2.firstName = "Alice"
author2.lastName = "Doe"
author2.store()
println("${author2.firstName} ${author2.lastName}")

Isn’t that awesome? we can just dereference author.firstName, which simply translates to author.getFirstName(). Likewise, we can assign a value to author2.lastName = "Doe", which simply translates to author2.setLastName("Doe")

It seems so obvious, no?

More info about the feature here

8. With to Abbreviate Setter Calls

We can take this feature one step further and use the nice with() inline function (not a keyword!). In the above example, when storing a new author, we have to repeat the author reference all the time. This is no longer true with… with!

val author3 = ctx.newRecord(a);
with (author3) {
    firstName = "Bob"
    lastName = "Doe"
    store()
}

All the code that is inside of the block supplied to the with() function is going to operate on the with() function’s first argument. We already know this interesting feature from JavaScript, but in Kotlin, it’s typesafe!

And notice, the store() call is also affected!

More info about this feature here

9. With to Locally Import Tables

We can use with() also with queries! In jOOQ, columns are instance members inside of tables, and as such, must be dereferenced from a concrete table. That can be tedious (SQL doesn’t have this requirement), especially when selecting only from a single table. You can turn on static member generation in jOOQ’s code generator, but then table aliasing is much less easy.

Or you use Kotlin again! The with() function can help:

with (AUTHOR) {
    ctx.select(FIRST_NAME, LAST_NAME)
       .from(AUTHOR)
       .where(ID.lt(5))
       .orderBy(ID)
       .fetch {
           println("${it[FIRST_NAME]} ${it[LAST_NAME]}")
       }
}

Resulting in

George Orwell
Paulo Coelho

Isn’t that nice? Now, all the FIRST_NAME, LAST_NAME, ID column references are dereferenced from AUTHOR in a much more lenient way.

More info about this feature here

10. Destructuring jOOQ records into sets of local variables

This is so great, we’re going to add more native support for it in jOOQ 3.10 right away.

Kotlin’s type destructuring feature is implemented following convention. This means that any type that has methods with the name component[N] can be destructured. We currently don’t have those methods, but they can be added really easily using inline operators:

operator fun <T, R : Record3<T, *, *>> R.component1() : T {
    return this.value1();
}

operator fun <T, R : Record3<*, T, *>> R.component2() : T {
    return this.value2();
}

operator fun <T, R : Record3<*, *, T>> R.component3() : T {
    return this.value3();
}

These inline functions are added to the R type, which is any subtype of Record3, and then they can be used as such:

for ((first, last, title) in ctx
   .select(a.FIRST_NAME, a.LAST_NAME, b.TITLE)
   .from(a)
   .join(b).on(a.ID.eq(b.AUTHOR_ID))
   .orderBy(1, 2, 3))
       println("$title by $first $last")

The result is again:

1984 by George Orwell
Animal Farm by George Orwell
Brida by Paulo Coelho
O Alquimista by Paulo Coelho

As can be seen, our query returns the type Record3<String, String, String>. Our inline functions will add component1(), component2(), and component3() methods to the type, which are used by the Kotlin language to destructure the record into three local loop variables: (first, last, title).

This is incredibly useful!

(note also that we don’t have to execute the query explicitly in the loop because any jOOQ ResultQuery is Iterable)

More info about this feature here

Conclusion

Kotlin is getting increasingly popular. It contains a lot of smart yet pragmatic features that make daily work with programming on the JVM a lot easier. Kotlin’s high focus on Java interoperability makes it really really powerful, and working with Kotlin and jOOQ is a really productive way of writing SQL on the JVM.

Do give it a shot. The sources are here:

https://github.com/jOOQ/jOOQ/tree/master/jOOQ-examples/jOOQ-kotlin-example

Don’t Use the String Concatenation “Trick” in SQL Predicates

In SQL, quite often, we want to compare several values with each other. For instance, when we’re looking for a specific user by their first and last names, we’ll write a query like this one:

SELECT *
FROM customer
WHERE first_name = 'SUSAN'
AND last_name = 'WILSON';

We’re getting:

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          8   SUSAN        WILSON

Surely, everyone agrees that this is correct and perfectly fine as we probably have an index on these two columns (or on at least one of them) to speed up such queries:

CREATE INDEX idx_customer_name ON customer (last_name, first_name);

The execution plan is thus optimal, e.g. with Oracle:

-------------------------------------------------------------------------
| Id  | Operation                           | Name              | Rows  |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                   |       |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER          |     1 |
|*  2 |   INDEX RANGE SCAN                  | IDX_CUSTOMER_NAME |     1 |
-------------------------------------------------------------------------

But sometimes, we cannot use AND to connect two predicates. In particular, that’s not possible with an IN predicate, so people sometimes resort to using string concatenation, because that seems to work and make sense.

For instance, let’s find all customers whose first and last names matches those of an actor (as always, using the Sakila database)

SELECT *
FROM customer
WHERE first_name || last_name IN (
  SELECT first_name || last_name
  FROM actor
)

And yes indeed, what we’re getting here is the correct answer:

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          6   JENNIFER     DAVIS

But that answer is only accidentally correct!

Because we weren’t looking for customers called

first_name = 'JENNIFER' AND last_name = 'DAVIS'

We were looking for customers called

first_name || last_name = 'JENNIFERDAVIS'

Want proof? Let’s add a new customer:

INSERT INTO customer (customer_id, first_name, last_name )
VALUES               (600        , 'JENNI'   , 'FERDAVIS');

Yeah right? No one is called FERDAVIS. Or are they? As good programmers, we closely observe Murphy’s Law (i.e. always look both left and right when crossing a street).

In any case, let’s run our query again:

SELECT *
FROM customer
WHERE first_name || last_name IN (
  SELECT first_name || last_name
  FROM actor
)

And observe the result!

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          6   JENNIFER     DAVIS
        600   JENNI        FERDAVIS

Of course, because our predicate was really looking for customers called

first_name || last_name = 'JENNIFERDAVIS'

Which matches in both cases:

-- What we expected
first_name || last_name = 'JENNIFER' || 'DAVIS'

-- What we got
first_name || last_name = 'JENNI' || 'FERDAVIS'

Notice that I only added this customer to the customer table, not to the actor table. There’s no actor by the name FERDAVIS, so the result is clearly wrong.

AHA! Let’s use an “impossible” separator

So, we might proceed to fixing this as such:

SELECT *
FROM customer
WHERE first_name || '###' || last_name IN (
  SELECT first_name || '###' || last_name
  FROM actor
)

And now, the result is again correct. We get only JENNIFER DAVIS because we were looking for:

first_name || '###' || last_name = 'JENNIFER###DAVIS'

This works quite well for a while, as the separator is quite “impossible” (i.e. improbable) to be encountered in actual data. But we shouldn’t trust our judgement, because… Murphy’s Law. So you might think: better use a more rare separator, e.g. (if your database supports proper character sets)

SELECT *
FROM customer
WHERE first_name || 'πŸ™ˆπŸ™‰πŸ™Š' || last_name IN (
  SELECT first_name || 'πŸ™ˆπŸ™‰πŸ™Š' || last_name
  FROM actor
)

The use of emojis should indicate what my opinion of this approach is.

Too bad for performance, though

Remember that index we’ve created? Fact is, we also have such an index on the ACTOR table:

CREATE INDEX idx_actor_name ON actor (last_name, first_name);

And now, let’s assume our query is a bit different. We’ll be looking only for customers whose address_id is 10:

SELECT *
FROM customer
WHERE address_id = 10
AND first_name || 'πŸ™ˆπŸ™‰πŸ™Š' || last_name IN (
  SELECT first_name || 'πŸ™ˆπŸ™‰πŸ™Š' || last_name
  FROM actor
)

Now, our querymoji is using the index indeed, but for an INDEX FULL SCAN, so it’s only slightly faster than scanning the entire actor table:

-----------------------------------------------------------------------------------
| Id  | Operation                            | Name                       | Rows  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                            |       |
|*  1 |  HASH JOIN SEMI                      |                            |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER                   |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_CUSTOMER_FK_ADDRESS_ID |     1 |
|   4 |   INDEX FULL SCAN                    | IDX_ACTOR_NAME             |     2 |
-----------------------------------------------------------------------------------

And what’s worse, even if all the cardinality estimates correctly indicate only 1-2 rows, we’ll perform a HASH JOIN and load the full index for it! We should be running a NESTED LOOP instead.

Is there a better way? Yes! Use row constructors to compare several values at once:

SELECT *
FROM customer
WHERE address_id = 10
AND (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM actor
);

Or, if your database doesn’t support this syntax (luckily, Oracle and PostgreSQL do, for instance), then you can resort to an equivalent EXISTS predicate

SELECT *
FROM customer c
WHERE address_id = 10
AND EXISTS (
  SELECT 1
  FROM actor a
  WHERE c.first_name = a.first_name
  AND c.last_name = a.last_name
);

Both of these queries are exactly equivalent and result in a nested loop semi join, rather than the previous hash join, which is perfectly reasonable for these small tables. We can now use the IDX_ACTOR_NAME for a quick INDEX RANGE SCAN operation:

-----------------------------------------------------------------------------------
| Id  | Operation                            | Name                       | Rows  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                            |       |
|   1 |  NESTED LOOPS SEMI                   |                            |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER                   |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_CUSTOMER_FK_ADDRESS_ID |     1 |
|*  4 |   INDEX RANGE SCAN                   | IDX_ACTOR_NAME             |     1 |
-----------------------------------------------------------------------------------

But let’s not trust the estimated plans. Let’s benchmark (more info about benchmarking SQL here)

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2500;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name
        FROM customer
        WHERE address_id = 10
        AND first_name || '###' || last_name IN (
          SELECT first_name || '###' || last_name
          FROM actor
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 1 : ' 
      || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name
        FROM customer
        WHERE address_id = 10
        AND (first_name, last_name) IN (
          SELECT first_name, last_name
          FROM actor
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 2 : '
      || (SYSTIMESTAMP - v_ts));
  END LOOP;
END;
/

As can be seen here, the benchmark shows that the query using the row constructor is drastically faster as it can properly use the index as it should:

Run 1, Statement 1 : +000000000 00:00:00.374471000
Run 1, Statement 2 : +000000000 00:00:00.062830000
Run 2, Statement 1 : +000000000 00:00:00.364168000
Run 2, Statement 2 : +000000000 00:00:00.066252000
Run 3, Statement 1 : +000000000 00:00:00.359559000
Run 3, Statement 2 : +000000000 00:00:00.063898000
Run 4, Statement 1 : +000000000 00:00:00.344775000
Run 4, Statement 2 : +000000000 00:00:00.086060000
Run 5, Statement 1 : +000000000 00:00:00.394163000
Run 5, Statement 2 : +000000000 00:00:00.063176000

Now, imagine we were running this against some much more impressive data sets than the Sakila database

Conclusion

If you’re ever thinking about concatenating two fields for a comparison, try again. There are two major caveats that should indicate you’re about to do something silly:

  • There’s a major risk of your query being subtly wrong (accidental matches between JENNIFER DAVIS and JENNI FERDAVIS)
  • There’s a major risk of your query being quite slow

So, as a rule of thumb, don’t use concatenation in predicates. There’s (almost) always a better way.

Read also: Why You Should (Sometimes) Avoid Expressions in SQL Predicates

How to Calculate Multiple Aggregate Functions in a Single Query

At a customer site, I’ve recently encountered a report where a programmer needed to count quite a bit of stuff from a single table. The counts all differed in the way they used specific predicates. The report looked roughly like this (as always, I’m using the Sakila database for illustration):

-- Total number of films
SELECT count(*)
FROM film

-- Number of films with a given length
SELECT count(*)
FROM film
WHERE length BETWEEN 120 AND 150

-- Number of films with a given language
SELECT count(*)
FROM film
WHERE language_id = 1

-- Number of films for a given rating
SELECT count(*)
FROM film
WHERE rating = 'PG'

And then, unsurprisingly, combinations of these predicates were needed as well, i.e.

-- Number of films with a given length / language_id
SELECT count(*)
FROM film
WHERE length BETWEEN 120 AND 150
AND language_id = 1

-- Number of films with a given length / rating
SELECT count(*)
FROM film
WHERE length BETWEEN 120 AND 150
AND rating = 'PG'

-- Number of films with a given language_id / rating
SELECT count(*)
FROM film
WHERE language_id = 1
AND rating = 'PG'

-- Number of films with a given length / language_id / rating
SELECT count(*)
FROM film
WHERE length BETWEEN 120 AND 150
AND language_id = 1
AND rating = 'PG'

In the end, there were 32 queries in total (or 8 in my example) with all the possible combinations of predicates. Needless to say that running them all took quite a while, because the table had around 200M records and only one predicate could profit from an index.

But in fact, the improvement is really easy. There are several options to calculate all these counts in a single query

Simplest solution works in all databases: Filtered aggregate functions (or manual pivot)

This solution allows for calculating all results in a single query by using 8 different, explicit, filtered aggregate functions and no GROUP BY clause (none in this example. More complex cases where GROUP BY persists are sill imaginable).

This is how it works on all databases:

SELECT 
  count(*),
  count(length),
  count(language_id),
  count(rating),
  count(length + language_id),
  count(length + rating),
  count(language_id + rating),
  count(length + language_id + rating)
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
    CASE WHEN language_id = 1            THEN 1 END language_id,
    CASE WHEN rating = 'PG'              THEN 1 END rating
  FROM film
) film

Which yields:

col1  col2  col3  col4  col5  col6  col7  col8
1000  224   1000  194   224   43    194   43

How to read the above query?

Instead of evaluating the three different predicates in a WHERE clause, we pre-calculate it in a derived table (subquery in the FROM clause) and translate the predicate in some random value (e.g. 1) if TRUE and NULL if FALSE. Note, I omitted the ELSE clause from the CASE expression, which means that we get NULLs per default. Running the nested select on its own…

SELECT
  CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
  CASE WHEN language_id = 1            THEN 1 END language_id,
  CASE WHEN rating = 'PG'              THEN 1 END rating
FROM film

… yields something along the lines of:

length  language_id  rating
---------------------------
NULL    1            1
NULL    1            NULL
NULL    1            NULL
NULL    1            NULL
1       1            NULL
NULL    1            1
NULL    1            NULL
...

(Note, of course, we could have used actual BOOLEAN types, e.g. in PostgreSQL, but that wouldn’t work on all databases)

Now, in the outer query, we’re using once COUNT(*), which simply counts all the rows regardless of any predicates in the CASE expressions. The other COUNT(expr) aggregate functions do something that surprisingly few people are aware of (yet a lot of people use this form “by accident”). They count only the number of non-NULL rows. For instance:

SELECT 
  ...
  count(length),
  ...
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
    ...
  FROM film
) film

Or also:

SELECT 
  count(CASE WHEN length BETWEEN 120 AND 150 THEN 1 END)
FROM
  film

These queries will count those films whose length is BETWEEN 120 AND 150 (because those rows produce the value 1, which is non-NULL, and thus counted), whereas all the other films are not being counted.

Finally, I just used a trick to combine nullable values to make sure they’re all non-NULL:

SELECT 
  ...
  count(length + language_id),
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
    CASE WHEN language_id = 1            THEN 1 END language_id,
    ...
  FROM film
) film

This counts those rows whose length BETWEEN 120 AND 150 and whose language_id = 1, because if either predicate was FALSE, the number would be NULL and thus the sum is NULL as well.

PostgreSQL and HSQLDB variant: FILTER

In PostgreSQL and HSQLDB (and in the SQL standard), there’s a special syntax for this. We can use the FILTER clause instead of encoding values in NULL / non-NULL like this:

SELECT 
  count(*),
  count(*) FILTER (WHERE length IS NOT NULL),
  count(*) FILTER (WHERE language_id IS NOT NULL),
  count(*) FILTER (WHERE rating IS NOT NULL),
  count(*) FILTER (WHERE length + language_id IS NOT NULL),
  count(*) FILTER (WHERE length + rating IS NOT NULL),
  count(*) FILTER (WHERE language_id + rating IS NOT NULL),
  count(*) FILTER (
    WHERE length + language_id + rating IS NOT NULL)
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
    CASE WHEN language_id = 1            THEN 1 END language_id,
    CASE WHEN rating = 'PG'              THEN 1 END rating
  FROM film
) film

Or even, writing out the entire predicates again:

SELECT 
  count(*),
  count(*) FILTER (WHERE length BETWEEN 120 AND 150),
  count(*) FILTER (WHERE language_id = 1),
  count(*) FILTER (WHERE rating = 'PG'),
  count(*) FILTER (
    WHERE length BETWEEN 120 AND 150 AND language_id = 1),
  count(*) FILTER (
    WHERE length BETWEEN 120 AND 150 AND rating = 'PG'),
  count(*) FILTER (
    WHERE language_id = 1 AND rating = 'PG'),
  count(*) FILTER (
    WHERE length BETWEEN 120 AND 150 
    AND language_id = 1 AND rating = 'PG')
FROM film

Usually, the FILTER clause is more convenient, but both approaches are equivalent, and we’re running only a single query!

I also call this “manual PIVOT“, because it really works like a PIVOT table. And the good news is… There is a PIVOT syntax!

A more fancy solution: PIVOT

This solution is vendor-specific and only works in Oracle and with a bit less features in SQL Server. Here’s the Oracle version:

SELECT 
  a + b + c + d + e + f + g + h,
                  e + f + g + h,
          c + d         + g + h,
      b     + d     + f     + h,
                          g + h,
                      f     + h,
              d             + h,
                              h
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 
         THEN 1 ELSE 0 END length,
    CASE WHEN language_id = 1            
         THEN 1 ELSE 0 END language_id,
    CASE WHEN rating = 'PG'              
         THEN 1 ELSE 0 END rating
  FROM film
) film
PIVOT (
  count(*) FOR (length, language_id, rating) IN (
    (0, 0, 0) AS a,
    (0, 0, 1) AS b,
    (0, 1, 0) AS c,
    (0, 1, 1) AS d,
    (1, 0, 0) AS e,
    (1, 0, 1) AS f,
    (1, 1, 0) AS g,
    (1, 1, 1) AS h
  )
)

How to read this solution? There are 3 steps:

Step 1: The derived table

As in the previous example, we’re translating the desired predicates for our report into three columns that produce values 1 and 0. That’s understood so I won’t repeat the explanation.

Step 2: The PIVOT clause

The PIVOT clause can be applied to a table expression to “pivot” it in a similar way as we’re used from Microsoft Excel’s powerful pivot tables. It takes three parts:

  • A list of aggregate functions
  • An expression (FOR clause)
  • A list of expected values (IN clause)

The resulting table expression groups the PIVOT‘s input table by all the remaining columns (i.e. all the columns that are not part of the FOR clause, in our example, that’s no columns), and aggregates all the aggregate functions (in our case, only one) for all the values in the IN list.

If we SELECT * from this PIVOT table:

SELECT *
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 
         THEN 1 ELSE 0 END length,
    CASE WHEN language_id = 1            
         THEN 1 ELSE 0 END language_id,
    CASE WHEN rating = 'PG'              
         THEN 1 ELSE 0 END rating
  FROM film
) film
PIVOT (
  count(*) FOR (length, language_id, rating) IN (
    (0, 0, 0) AS a,
    (0, 0, 1) AS b,
    (0, 1, 0) AS c,
    (0, 1, 1) AS d,
    (1, 0, 0) AS e,
    (1, 0, 1) AS f,
    (1, 1, 0) AS g,
    (1, 1, 1) AS h
  )
)

… we’ll get these values:

a    b    c    d    e    f    g    h
------------------------------------
0    0  625  151    0    0  181   43

As you can see, the column names are generated from the IN list of expected values and the values contained in these columns are aggregations for the different predicates. These aggregations are not exactly the ones we wanted. For instance, column G is all the films whose length BETWEEN 120 AND 150 and whose language_id = 1 and whose RATING != 'PG'.

Step 3: Summing the count values

So, in order to get the expected results, we have to sum all the partial counts as such:

SELECT 
  a + b + c + d + e + f + g + h,
                  e + f + g + h,
          c + d         + g + h,
      b     + d     + f     + h,
                          g + h,
                      f     + h,
              d             + h,
                              h
FROM 
  ...

The result is now the same.

A more fancy solution: GROUPING SETS

GROUPING SETS are a SQL standard and they’re supported in at least:

  • DB2
  • HANA
  • Oracle
  • PostgreSQL
  • SQL Server
  • Sybase SQL Anywhere

Simply put, GROUPING SETS allow for grouping a table several times and creating a UNION of all the results. For example, the following two queries are the same, conceptually, although the GROUPING SETS one is usually faster:

-- Grouping once by language_id, then by rating
SELECT language_id, rating, count(*)
FROM film
GROUP BY GROUPING SETS (
  (language_id),
  (rating)
)

-- Grouping first by language_id
SELECT language_id, NULL, count(*)
FROM film
GROUP BY language_id
UNION ALL
SELECT NULL, rating, count(*)
FROM film
GROUP BY rating

Both queries yield:

language_id   rating   count
          1             1000 -- First grouping set / union subquery
              G          178 \
              PG         194  |
              PG-13      223  | Second grouping set / union subquery
              R          195  |
              NC-17      210 /

Clearly, the GROUPING SETS variant is more concise. Let’s imagine, we’d like to add more combinations of grouping columns, e.g.

SELECT language_id, rating, count(*)
FROM film
GROUP BY GROUPING SETS (
  (),
  (language_id),
  (rating),
  (language_id, rating)
)

Now, we’re grouping by all the combinations of columns, and the result is:

language_id   rating   count
                        1000 -- First grouping set: ()
          1             1000 -- Second grouping set: (language_id)
              G          178 \
              PG         194  |
              PG-13      223  | Third grouping set: (rating)
              R          195  |
              NC-17      210 /
          1   G          178 \
          1   PG         194  |
          1   PG-13      223  | Fourth grouping set: (language_id, rating)
          1   R          195  |
          1   NC-17      210 /

Of course, this would all be more impressive if we had more than one language in the system…

So, how do we solve the original problem with GROUPING SETS? Here’s how:

SELECT 
  GROUPING_ID (length, language_id, rating),
  length,
  language_id,
  rating,
  count(*)
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 
         THEN 1 ELSE 0 END length,
    CASE WHEN language_id = 1            
         THEN 1 ELSE 0 END language_id,
    CASE WHEN rating = 'PG'              
         THEN 1 ELSE 0 END rating
  FROM film
) film
GROUP BY GROUPING SETS (
  (),
  (length),
  (language_id),
  (rating),
  (length, language_id),
  (length, rating),
  (rating, language_id),
  (length, language_id, rating)
)
HAVING COALESCE (length, 1) != 0 
AND COALESCE (language_id, 1) != 0 
AND COALESCE (rating, 1) != 0
ORDER BY GROUPING_ID (length, language_id, rating) DESC

Wow. How to read this? In 4 steps:

Step 1: Again, the derived table

This time, we’ll encode FALSE as 0, not NULL, because NULL already has a different meaning in GROUPING SETS. It means that for a given GROUPING SET, we didn’t group by that column. We’ll see that in step 3.

Step 2: The GROUPING SETS

In this section, we’re just listing all the possible combinations of GROUP BY columns that we want to use, which produces 8 distinct GROUPING SETS. I’ve already explained this in the previous introduction to GROUPING SETS, so this is no different.

Step 3: Filter out unwanted groupings

Just like in the PIVOT example, we’re also getting results for which the predicates are FALSE, but we don’t want those in the result. So we’re filtering them out in the HAVING clause:

SELECT 
  ...
HAVING COALESCE (length, 1) != 0 
AND COALESCE (language_id, 1) != 0 
AND COALESCE (rating, 1) != 0
...

How to read this? E.g. LENGTH can be any of:

  • 1: The length predicate was TRUE
  • 0: The length predicate was FALSE
  • NULL: The length column is not considered for a given GROUPING SET, e.g. () or (rating, language_id)

So, using COALESCE, we’re making sure that we include only 1 and NULL lengths, not 0 lengths.

Step 4: Ordering the results

This is optional, but in order to get the same output order as before, we can use the special GROUPING_ID() (or GROUPING() depending on the DB) function which returns an ID for each GROUPING SET. The output is:

grouping   length   language_id   rating   count
------------------------------------------------
       7     NULL          NULL     NULL    1000
       6     NULL          NULL        1     194
       5     NULL             1     NULL    1000
       4     NULL             1        1     194
       3        1          NULL     NULL     224
       2        1          NULL        1      43
       1        1             1     NULL     224
       0        1             1        1      43

Excellent! And hey, there’s even syntax sugar for “special” GROUPING SETS configurations like ours, where we list all the possible column permutations. In this case, we can use CUBE()!

SELECT 
  GROUPING_ID (length, language_id, rating),
  length,
  language_id,
  rating,
  count(*)
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 
         THEN 1 ELSE 0 END length,
    CASE WHEN language_id = 1            
         THEN 1 ELSE 0 END language_id,
    CASE WHEN rating = 'PG'              
         THEN 1 ELSE 0 END rating
  FROM film
) film
GROUP BY CUBE (length, language_id, rating)
HAVING COALESCE(length, 1) != 0 
AND COALESCE(language_id, 1) != 0 
AND COALESCE(rating, 1) != 0
ORDER BY GROUPING_ID (length, language_id, rating) DESC

Performance

Such a comparison blog post wouldn’t be complete if we wouldn’t benchmark for performance. This time, I’ll be benchmarking only for Oracle, as PostgreSQL doesn’t support PIVOT and SQL Server’s PIVOT is more limited than Oracle’s.

Here’s the complete benchmark:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2000;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP

    -- Individual statements
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT count(*) FROM film
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE length BETWEEN 120 AND 150
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE language_id = 1
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE rating = 'PG'
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE length BETWEEN 120 AND 150
        AND language_id = 1
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE length BETWEEN 120 AND 150
        AND rating = 'PG'
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE language_id = 1
        AND rating = 'PG'
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE length BETWEEN 120 AND 150
        AND language_id = 1
        AND rating = 'PG'
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 2 : ' || (SYSTIMESTAMP - v_ts));

    -- Manual PIVOT
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT
          count(*),
          count(length),
          count(language_id),
          count(rating),
          count(length + language_id),
          count(length + rating),
          count(language_id + rating),
          count(length + language_id + rating)
        FROM (
          SELECT
            CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
            CASE WHEN language_id = 1            THEN 1 END language_id,
            CASE WHEN rating = 'PG'              THEN 1 END rating
          FROM film
        ) film
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 2 : ' || (SYSTIMESTAMP - v_ts));
    
    -- PIVOT
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT 
          a + b + c + d + e + f + g + h,
                          e + f + g + h,
                  c + d         + g + h,
              b     + d     + f     + h,
                                  g + h,
                              f     + h,
                      d             + h,
                                      h
        FROM (
          SELECT
            CASE WHEN length BETWEEN 120 AND 150 THEN 1 ELSE 0 END length,
            CASE WHEN language_id = 1            THEN 1 ELSE 0 END language_id,
            CASE WHEN rating = 'PG'              THEN 1 ELSE 0 END rating
          FROM film
        ) film
        PIVOT (
          count(*) FOR (length, language_id, rating) IN (
            (0, 0, 0) AS a,
            (0, 0, 1) AS b,
            (0, 1, 0) AS c,
            (0, 1, 1) AS d,
            (1, 0, 0) AS e,
            (1, 0, 1) AS f,
            (1, 1, 0) AS g,
            (1, 1, 1) AS h
          )
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 3 : ' || (SYSTIMESTAMP - v_ts));

    -- GROUPING SETS
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT 
          GROUPING_ID (length, language_id, rating),
          length,
          language_id,
          rating,
          count(*)
        FROM (
          SELECT
            CASE WHEN length BETWEEN 120 AND 150 THEN 1 ELSE 0 END length,
            CASE WHEN language_id = 1            THEN 1 ELSE 0 END language_id,
            CASE WHEN rating = 'PG'              THEN 1 ELSE 0 END rating
          FROM film
        ) film
        GROUP BY CUBE (length, language_id, rating)
        HAVING COALESCE (length, 1) != 0 
        AND COALESCE (language_id, 1) != 0 
        AND COALESCE (rating, 1) != 0
        ORDER BY GROUPING_ID (length, language_id, rating) DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 4 : ' || (SYSTIMESTAMP - v_ts));
  END LOOP;
END;
/

And the results:

Run 1, Statement 1 : +000000000 00:00:01.928497000
Run 1, Statement 2 : +000000000 00:00:01.136341000
Run 1, Statement 3 : +000000000 00:00:02.751679000
Run 1, Statement 4 : +000000000 00:00:00.797529000

Run 2, Statement 1 : +000000000 00:00:01.695543000
Run 2, Statement 2 : +000000000 00:00:01.004073000
Run 2, Statement 3 : +000000000 00:00:02.490895000
Run 2, Statement 4 : +000000000 00:00:00.838979000

Run 3, Statement 1 : +000000000 00:00:01.634047000
Run 3, Statement 2 : +000000000 00:00:01.016266000
Run 3, Statement 3 : +000000000 00:00:02.566895000
Run 3, Statement 4 : +000000000 00:00:00.790159000

Run 4, Statement 1 : +000000000 00:00:01.669844000
Run 4, Statement 2 : +000000000 00:00:01.015502000
Run 4, Statement 3 : +000000000 00:00:02.574646000
Run 4, Statement 4 : +000000000 00:00:00.807804000

Run 5, Statement 1 : +000000000 00:00:01.653498000
Run 5, Statement 2 : +000000000 00:00:00.980375000
Run 5, Statement 3 : +000000000 00:00:02.556186000
Run 5, Statement 4 : +000000000 00:00:00.890283000

Very disappointingly, the PIVOT solution is the slowest every time. I’m assuming there’s some substantial temporary object overhead which wouldn’t be as severe if the table were much larger, but clearly, the manual PIVOT solution (COUNT(CASE ...)) and the GROUPING SETS solution heavily outperform the initial attempt, where we calculate 8 counts individually.

To get back to the original report where 32 counts were calculated: The report was roughly 20x as fast with manual PIVOT on 200M rows and imagine if you need to JOIN – you definitely want to avoid those 32 individual queries and calculate everything in one go.

Cheers!

How to Fetch Multiple Oracle Execution Plans in One Nice Query

When looking at execution plans in Oracle, we’ll have to do several steps to be able to call the DBMS_XPLAN package functions. In fact, we have to find out the SQL_ID for a given statement first, and only then we can get its plan. I’ve blogged about this previously, here.

However, thanks to lateral unnesting, we can do the two steps in one go. More than that, we can even fetch several plans in one go this way! Check this out…

Let’s run the queries from the previous benchmarking blog post, and let’s add the /*+GATHER_PLAN_STATISTICS*/ hint to get some actual execution values, not just estimates:

SELECT /*+GATHER_PLAN_STATISTICS*/ 
  first_name, last_name, count(fa.actor_id) AS c
FROM actor a
LEFT JOIN film_actor fa
ON a.actor_id = fa.actor_id
WHERE last_name LIKE 'A%'
GROUP BY a.actor_id, first_name, last_name
ORDER BY c DESC
;

SELECT /*+GATHER_PLAN_STATISTICS*/
  first_name, last_name, (
    SELECT count(*)
    FROM film_actor fa
    WHERE a.actor_id =
    fa.actor_id
  ) AS c
FROM actor a
WHERE last_name LIKE 'A%' 
ORDER BY c DESC
;

Both queries do the same thing. They try to find those actors whose last name starts with the letter A and counts their corresponding films. Now what about the execution plans? Run the following query and you don’t have to know any SQL_ID in advance:

SELECT s.sql_id, p.*
FROM v$sql s, TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
) p
WHERE s.sql_text LIKE '%/*+GATHER_PLAN_STATISTICS*/%';

As you can see, with “LATERAL unnesting”, we can unnest a nested table (as returned by DBMS_XPLAN.DISPLAY_CURSOR) into the calling query, and we can pass column values from the V$SQL table to each function call. In other words, the above query reads as:

  • Get all SQL statements from the cursor cache V$SQL
  • Keep only those who have our GATHER_PLAN_STATISTICS hint in them (replace with your own query matching pattern)
  • Cross-join the unnested table from DBMS_XPLAN.DISPLAY_CURSOR where we get the plan per SQL_ID and CHILD_NUMBER

This implicit “LATERAL unnesting” is a bit obscure in my opinion (but its brief). More formally correct would be to use the actual LATERAL keyword, or better the SQL Server style CROSS APPLY

-- LATERAL: A bit verbose
SELECT s.sql_id, p.*
FROM v$sql s CROSS JOIN LATERAL (SELECT * FROM TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
)) p
WHERE s.sql_text LIKE '%/*+GATHER_PLAN_STATISTICS*/%';

-- CROSS APPLY: Very neat!
SELECT s.sql_id, p.*
FROM v$sql s CROSS APPLY TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
) p
WHERE s.sql_text LIKE '%/*+GATHER_PLAN_STATISTICS*/%';

In any case, the result is the same (I’ve removed some columns for brevity):

SQL_ID          PLAN_TABLE_OUTPUT
3gv1fd3dcj3b0	SQL_ID  3gv1fd3dcj3b0, child number 0
3gv1fd3dcj3b0	-------------------------------------
3gv1fd3dcj3b0	SELECT /*+GATHER_PLAN_STATISTICS*/ first_name, last_name, 
3gv1fd3dcj3b0	count(fa.actor_id) AS c FROM actor a LEFT JOIN film_actor fa ON 
3gv1fd3dcj3b0	a.actor_id = fa.actor_id WHERE last_name LIKE 'A%' GROUP BY a.actor_id, 
3gv1fd3dcj3b0	first_name, last_name ORDER BY c DESC
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	Plan hash value: 3014447605
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	-----------------------------------------------------------------------------------------------------
3gv1fd3dcj3b0	| Id  | Operation                              | Name                    | Starts | E-Rows | A-Rows |
3gv1fd3dcj3b0	-----------------------------------------------------------------------------------------------------
3gv1fd3dcj3b0	|   0 | SELECT STATEMENT                       |                         |      1 |        |      7 |
3gv1fd3dcj3b0	|   1 |  SORT ORDER BY                         |                         |      1 |      7 |      7 |
3gv1fd3dcj3b0	|   2 |   HASH GROUP BY                        |                         |      1 |      7 |      7 |
3gv1fd3dcj3b0	|*  3 |    HASH JOIN OUTER                     |                         |      1 |    154 |    196 |
3gv1fd3dcj3b0	|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR                   |      1 |      6 |      7 |
3gv1fd3dcj3b0	|*  5 |      INDEX RANGE SCAN                  | IDX_ACTOR_LAST_NAME     |      1 |      6 |      7 |
3gv1fd3dcj3b0	|   6 |     INDEX FAST FULL SCAN               | IDX_FK_FILM_ACTOR_ACTOR |      1 |   5462 |   5462 |
3gv1fd3dcj3b0	-----------------------------------------------------------------------------------------------------
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	Predicate Information (identified by operation id):
3gv1fd3dcj3b0	---------------------------------------------------
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	   3 - access("A"."ACTOR_ID"="FA"."ACTOR_ID")
3gv1fd3dcj3b0	   5 - access("A"."LAST_NAME" LIKE 'A%')
3gv1fd3dcj3b0	       filter("A"."LAST_NAME" LIKE 'A%')
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	Note
3gv1fd3dcj3b0	-----
3gv1fd3dcj3b0	   - dynamic statistics used: dynamic sampling (level=2)
3gv1fd3dcj3b0	   - this is an adaptive plan
3gv1fd3dcj3b0	   - 1 Sql Plan Directive used for this statement
3gv1fd3dcj3b0	 
6a3nrpcw22avr	SQL_ID  6a3nrpcw22avr, child number 0
6a3nrpcw22avr	-------------------------------------
6a3nrpcw22avr	SELECT /*+GATHER_PLAN_STATISTICS*/ first_name, last_name, (   SELECT 
6a3nrpcw22avr	count(*)   FROM film_actor fa   WHERE a.actor_id =   fa.actor_id ) AS c 
6a3nrpcw22avr	FROM actor a WHERE last_name LIKE 'A%'  ORDER BY c DESC
6a3nrpcw22avr	 
6a3nrpcw22avr	Plan hash value: 3873085786
6a3nrpcw22avr	 
6a3nrpcw22avr	---------------------------------------------------------------------------------------------------
6a3nrpcw22avr	| Id  | Operation                            | Name                    | Starts | E-Rows | A-Rows |
6a3nrpcw22avr	---------------------------------------------------------------------------------------------------
6a3nrpcw22avr	|   0 | SELECT STATEMENT                     |                         |      1 |        |      7 |
6a3nrpcw22avr	|   1 |  SORT AGGREGATE                      |                         |      7 |      1 |      7 |
6a3nrpcw22avr	|*  2 |   INDEX RANGE SCAN                   | IDX_FK_FILM_ACTOR_ACTOR |      7 |     27 |    196 |
6a3nrpcw22avr	|   3 |  SORT ORDER BY                       |                         |      1 |      6 |      7 |
6a3nrpcw22avr	|   4 |   TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR                   |      1 |      6 |      7 |
6a3nrpcw22avr	|*  5 |    INDEX RANGE SCAN                  | IDX_ACTOR_LAST_NAME     |      1 |      6 |      7 |
6a3nrpcw22avr	---------------------------------------------------------------------------------------------------
6a3nrpcw22avr	 
6a3nrpcw22avr	Predicate Information (identified by operation id):
6a3nrpcw22avr	---------------------------------------------------
6a3nrpcw22avr	 
6a3nrpcw22avr	   2 - access("FA"."ACTOR_ID"=:B1)
6a3nrpcw22avr	   5 - access("LAST_NAME" LIKE 'A%')
6a3nrpcw22avr	       filter("LAST_NAME" LIKE 'A%')
6a3nrpcw22avr	 

This is really neat!

SQL IN Predicate: With IN List or With Array? Which is Faster?

Hah! Got nerd-sniped again:

http://stackoverflow.com/questions/43099226/how-to-make-jooq-to-use-arrays-in-the-in-clause/43102102

A jOOQ user was wondering why jOOQ would generate an IN list for a predicate like this:

Java

COLUMN.in(1, 2, 3, 4)

SQL

COLUMN in (?, ?, ?, ?)

… when in fact there could have been the following predicate being generated, instead:

COLUMN = any(?::int[])

In the second case, there would have been only one single bind variable instead of 4, and the SQL generation and parsing work would have been “much” less (maybe not for the IN list of size 4, but let’s imagine a list of 50 values).

A disclaimer

First off, a disclaimer: In databases that have a cursor cache / plan cache (e.g. Oracle or SQL Server), you should be careful with long IN lists, because they will probably trigger a hard parse every time you run them, as by the time you run the exact same predicate (with 371 elements in the list) again, the execution plan will have been purged from the cache. So, you cannot really profit from the cache.

I’m aware of this problem, and it will be topic of another blog post, soon. Let’s stick to PostgreSQL whose “plan cache” isn’t really that sophisticated.

Measure, don’t guess

The question was about improving the speed of parsing a SQL statement. Parsers are really fast, so parsing shouldn’t be a problem. Generating an execution plan certainly does cost more time, but again, since PostgreSQL’s plan cache isn’t very sophisticated, this won’t play into the issue here. So the question is really:

Is an IN list really that bad in PostgreSQL?

Would an array bind variable be much better?

Since our recent post about benchmarking, we now know that we shall never guess, but always measure. I’m using again the Sakila database to run these two queries:

-- IN list
SELECT * 
FROM film 
JOIN film_actor USING (film_id) 
JOIN actor USING (actor_id) 
WHERE film_id IN (?, ?, ?, ?)

-- Array
SELECT * 
FROM film 
JOIN film_actor USING (film_id) 
JOIN actor USING (actor_id) 
WHERE film_id = ANY(?)

Let’s try lists of length 4, first. The benchmark is here:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000;
  rec RECORD;
  v_e1 INT := 1;
  v_e2 INT := 2;
  v_e3 INT := 4;
  v_e4 INT := 8;
  v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4];
BEGIN
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * 
        FROM film 
        JOIN film_actor USING (film_id) 
        JOIN actor USING (actor_id) 
        WHERE film_id IN (v_e1, v_e2, v_e3, v_e4)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', 
      r, (clock_timestamp() - v_ts); 
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * 
        FROM film 
        JOIN film_actor USING (film_id) 
        JOIN actor USING (actor_id) 
        WHERE film_id = ANY(v_any_arr)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', 
      r, (clock_timestamp() - v_ts); 
  END LOOP;
END$$;

The result being:

INFO:  Run 1, Statement 1: 00:00:00.112195
INFO:  Run 1, Statement 2: 00:00:00.450461
INFO:  Run 2, Statement 1: 00:00:00.109792
INFO:  Run 2, Statement 2: 00:00:00.446518
INFO:  Run 3, Statement 1: 00:00:00.105413
INFO:  Run 3, Statement 2: 00:00:00.44298
INFO:  Run 4, Statement 1: 00:00:00.108249
INFO:  Run 4, Statement 2: 00:00:00.476527
INFO:  Run 5, Statement 1: 00:00:00.120229
INFO:  Run 5, Statement 2: 00:00:00.448214

Interesting. So, the IN list outperforms the array bind variable every time by a factor of 4 (which is the size of the array / list!) So, let’s try 8 values, then. Here are the values and the adapted query 1:

-- values
  v_e1 INT := 1;
  v_e2 INT := 2;
  v_e3 INT := 4;
  v_e4 INT := 8;
  v_e5 INT := 16;
  v_e6 INT := 32;
  v_e7 INT := 64;
  v_e8 INT := 128;
  v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8];

-- adapted query 1 ...
        WHERE film_id IN (v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8)
-- ...

The result is still impressive:

INFO:  Run 1, Statement 1: 00:00:00.182646
INFO:  Run 1, Statement 2: 00:00:00.63624
INFO:  Run 2, Statement 1: 00:00:00.184814
INFO:  Run 2, Statement 2: 00:00:00.685976
INFO:  Run 3, Statement 1: 00:00:00.188108
INFO:  Run 3, Statement 2: 00:00:00.634903
INFO:  Run 4, Statement 1: 00:00:00.184933
INFO:  Run 4, Statement 2: 00:00:00.626616
INFO:  Run 5, Statement 1: 00:00:00.185879
INFO:  Run 5, Statement 2: 00:00:00.636723

The IN list query now takes almost 2x as long (but not quite 2x), whereas the array query now takes around 1.5x as long. It looks as though arrays become the better choice when their size increases. So, let’s do this! With 32 bind variables in the IN list, or 32 array elements respectively:

INFO:  Run 1, Statement 1: 00:00:00.905064
INFO:  Run 1, Statement 2: 00:00:00.752819
INFO:  Run 2, Statement 1: 00:00:00.760475
INFO:  Run 2, Statement 2: 00:00:00.758247
INFO:  Run 3, Statement 1: 00:00:00.777667
INFO:  Run 3, Statement 2: 00:00:00.895875
INFO:  Run 4, Statement 1: 00:00:01.308167
INFO:  Run 4, Statement 2: 00:00:00.789537
INFO:  Run 5, Statement 1: 00:00:00.788606
INFO:  Run 5, Statement 2: 00:00:00.776159

Both are about equally fast. 64 bind values!

INFO:  Run 1, Statement 1: 00:00:00.915069
INFO:  Run 1, Statement 2: 00:00:01.058966
INFO:  Run 2, Statement 1: 00:00:00.951488
INFO:  Run 2, Statement 2: 00:00:00.906285
INFO:  Run 3, Statement 1: 00:00:00.907489
INFO:  Run 3, Statement 2: 00:00:00.892393
INFO:  Run 4, Statement 1: 00:00:00.900424
INFO:  Run 4, Statement 2: 00:00:00.903447
INFO:  Run 5, Statement 1: 00:00:00.961805
INFO:  Run 5, Statement 2: 00:00:00.951697

Still about the same. OK… INTERN! Get over here. I need you to “generate” 128 bind values on this query.

Yep, as expected. Finally, arrays start to outperform IN lists:

INFO:  Run 1, Statement 1: 00:00:01.122866
INFO:  Run 1, Statement 2: 00:00:01.083816
INFO:  Run 2, Statement 1: 00:00:01.416469
INFO:  Run 2, Statement 2: 00:00:01.134882
INFO:  Run 3, Statement 1: 00:00:01.122723
INFO:  Run 3, Statement 2: 00:00:01.087755
INFO:  Run 4, Statement 1: 00:00:01.143148
INFO:  Run 4, Statement 2: 00:00:01.124902
INFO:  Run 5, Statement 1: 00:00:01.236722
INFO:  Run 5, Statement 2: 00:00:01.113741

Using Oracle

Oracle also has array types (although you have to declare them as nominal types first, but that’s not a problem here).

Here are some benchmark results (as always, not actual benchmark results, but anonymised units of measurement. I.e. these aren’t seconds but… Larrys):

4 bind values

Run 1, Statement 1 : 01.911000000
Run 1, Statement 2 : 02.852000000
Run 2, Statement 1 : 01.659000000
Run 2, Statement 2 : 02.680000000
Run 3, Statement 1 : 01.628000000
Run 3, Statement 2 : 02.664000000
Run 4, Statement 1 : 01.629000000
Run 4, Statement 2 : 02.657000000
Run 5, Statement 1 : 01.636000000
Run 5, Statement 2 : 02.688000000

128 bind values

Run 1, Statement 1 : 04.010000000
Run 1, Statement 2 : 06.275000000
Run 2, Statement 1 : 03.749000000
Run 2, Statement 2 : 05.440000000
Run 3, Statement 1 : 03.985000000
Run 3, Statement 2 : 05.387000000
Run 4, Statement 1 : 03.807000000
Run 4, Statement 2 : 05.688000000
Run 5, Statement 1 : 03.782000000
Run 5, Statement 2 : 05.803000000

The size of the number of bind values doesn’t seem to matter really. There’s always a constant overhead of using the array bind variable compared to the IN list, but that might as well be a benchmarking error. For instance, when I add the /*+GATHER_PLAN_STATISTICS*/ hint to both queries, interestingly, the one with the array got significantly faster, whereas the IN list one was not affected… Weird?

Conclusion

This article doesn’t go into why there’s such a big difference for small lists when the benefit is only apparent for quite large lists.

But it has once again shown, that we must not optimise prematurely in SQL, but measure, measure, measure things. IN lists in dynamic SQL queries can be a big issue in production when they lead to cursor cache / plan cache saturation and a lot of “hard parsing”. So, the benefit of using the array is much more drastic when the content is big, as we can recycle execution plans much more often than with IN lists.

But chances are, that IN lists may be faster for single executions.

In any case: Choose carefully when following advice that you find somewhere on the Internet. Also, when following this advice. I ran the benchmark on PostgreSQL 9.5 and Oracle 11gR2 XE. Both are not the latest database versions. Try to measure things again on your side, to be sure that your “improvement” is really an actual improvement! And if in doubt, don’t optimise, until you’re sure you actually have a problem.

How to Benchmark Alternative SQL Queries to Find the Fastest Query

Tuning SQL isn’t always easy, and it takes a lot of practice to recognise how any given query can be optimised. One of the most important slides of my SQL training is the one summarising “how to be fast”:

How to be fast with SQL. Find out with the Data Geekery SQL Training

Some of these bullets were already covered on this blog. For instance avoiding needless, mandatory work, when client code runs queries or parts of queries that aren’t really necessary (e.g. selecting too many columns: “needless”), but the database cannot prove they’re needless, thus: “mandatory” for the database to execute.

But as with many other performance related topics, one key message is not to guess, but to measure! Or, in other words, not to optimise prematurely, but to optimise actual problems.

SQL is full of myths

SQL is a 4GL (Fourth-generation programming language) and as such, has always been a cool, convenient way to express data related constraints and queries. But the declarative nature of the language also often meant that programmers are really looking into a crystal ball. A lot of people have blogged about a lot of half-true discoveries that might have been correct in some context and at some point of time (this blog is no exception).

For instance:

  • Are correlated subqueries slower than their LEFT JOIN equivalents?
  • Are derived tables faster than views or common table expressions?
  • Is COUNT(*) faster than COUNT(1)?

Tons of myhts!

Measure your queries

To bust a myth, if you have good reasons to think that a differently written, but semantically equivalent query might be faster (on your database), you should measure. Don’t even trust any execution plan, because ultimately, what really counts is the wall clock time in your production system.

If you can measure your queries in production, that’s perfect. But often, you cannot – but you don’t always have to. One way to compare two queries with each other is to benchmark them by executing each query hundreds or even thousands of times in a row.

As any technique, benchmarking has pros and cons. Here is a non-exhaustive list:

Pros

  • Easy to do (see examples below)
  • Easy to reproduce, also on different environments
  • Easy to quickly get an idea in terms of orders of magnitude difference

Cons

  • Not actually measuring productive situations (no one runs the same query thousands of times in a row, without any other queries in parallel)
  • Queries may profit from unrealistic caching due to heavy repetition
  • “Real query” might be dynamic, so the “same query” might really manifest itself in dozens of different productive queries

But if you’re fine with the cons above, the pros might outweigh, for instance, if you want to find out whether a correlated subquery is slower than its LEFT JOIN equivalent for a given query. Note my using italics here, because even if you find out it’s slower for that given query it might be faster for other queries. Never jump to generalised rules before measuring again!

For instance, consider these two equivalent queries that run on the Sakila database. Both versions try to find those actors whose last name starts with the letter A and counts their corresponding films:

LEFT JOIN

SELECT first_name, last_name, count(fa.actor_id) AS c
FROM actor a
LEFT JOIN film_actor fa
ON a.actor_id = fa.actor_id
WHERE last_name LIKE 'A%'
GROUP BY a.actor_id, first_name, last_name
ORDER BY c DESC

Correlated subquery

SELECT first_name, last_name, (
  SELECT count(*)
  FROM film_actor fa
  WHERE a.actor_id =
  fa.actor_id
) AS c
FROM actor a
WHERE last_name LIKE 'A%' 
ORDER BY c DESC

The result is always:

The queries have different execution plans on PostgreSQL, Oracle, SQL Server as can be seen below:

PostgreSQL LEFT JOIN

(Plan looks “better”)

PostgreSQL correlated subquery

(Plan looks “worse”)

Oracle LEFT JOIN

(Plan looks “more complicated”)

Oracle correlated subquery

(Plan looks “simpler”)

SQL Server LEFT JOIN

(Plan looks “reasonable”)

SQL Server correlated subquery

(Plan looks… geez, where’s my correlated subquery? It’s been transformed to a LEFT JOIN!)

Huh, as you can see, in SQL Server, both queries produce the exact same plan (as they should, because the queries are really equivalent). But not all databases recognise this and/or optimise this. At least, that’s what the estimated plans suggest.

Also, don’t jump to the conclusion that if the cost of one plan is lower then it’s a better plan than an alternative. Costs can only really be compared when comparing alternative plans for the same query, e.g. in the Oracle example, we had both HASH JOIN and NESTED LOOP JOIN in a single plan, because Oracle 12c may collect runtime statistics and switch plans in flight thanks to the Oracle 12c Adaptive Query Optimization features.

But let’s ignore all of this and look at actual execution times, instead:

Benchmarking the alternatives

As always, disclaimer: Some commercial databases do not allow for publishing benchmark results without prior written consent. As I never ask for permission, but always ask for forgiveness, I do not have consent, and I’m thus not publishing actual benchmark results.

I have anonymized the benchmark results by introducing hypothetical, non-comparable units of measurement, so you cannot see that PostgreSQL is totally slower than Oracle and/or SQL Server. And you cannot see that SQL Server’s procedural language is totally uglier than PostgreSQL’s and/or Oracle’s.

Legal people.

Solving problems we wouldn’t have without legal people, in the first place

Enough ranting. Some important considerations:

  • Ideally, you’ll run benchmarks directly in the database using a procedural language, rather than, e.g. over JDBC to avoid network latency that incurs with JDBC calls, and other non-desired side-effects.
  • Repeat the benchmarks several times to prevent warmup side-effects and other random issues, as your OS / file system may be busy with accidental Scala compilation, or Slack UI refreshes
  • Be sure to actually consume the entire result set of each query in a loop, rather than just executing the query. Some databases may optimise for lazy cursor consumption (and possibly abortion). It would be unfair not to consume the entire result set

PostgreSQL

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 10000;
  rec RECORD;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR i IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name, count(fa.actor_id) AS c
        FROM actor a
        LEFT JOIN film_actor fa
        ON a.actor_id = fa.actor_id
        WHERE last_name LIKE 'A%'
        GROUP BY a.actor_id, first_name, last_name
        ORDER BY c DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', i, (clock_timestamp() - v_ts); 
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name, (
          SELECT count(*)
          FROM film_actor fa
          WHERE a.actor_id =
          fa.actor_id
        ) AS c
        FROM actor a
        WHERE last_name LIKE 'A%' 
        ORDER BY c DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', i, (clock_timestamp() - v_ts); 
  END LOOP;
END$$;

The result is:

INFO:  Run 1, Statement 1: 00:00:01.708257
INFO:  Run 1, Statement 2: 00:00:01.252012
INFO:  Run 2, Statement 1: 00:00:02.33151  -- Slack message received here
INFO:  Run 2, Statement 2: 00:00:01.064007
INFO:  Run 3, Statement 1: 00:00:01.638518
INFO:  Run 3, Statement 2: 00:00:01.149005
INFO:  Run 4, Statement 1: 00:00:01.670045
INFO:  Run 4, Statement 2: 00:00:01.230755
INFO:  Run 5, Statement 1: 00:00:01.81718
INFO:  Run 5, Statement 2: 00:00:01.166089

As you can see, in all 5 benchmark executions, the version with the correlated subquery seemed to have outperformed the version with the LEFT JOIN in this case by roughly 60%! As this is PostgreSQL and open source, benchmark results are in actual seconds for 10000 query executions. Neat. Let’s move on to…

Oracle

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 10000;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name, count(fa.actor_id) AS c
        FROM actor a
        LEFT JOIN film_actor fa
        ON a.actor_id = fa.actor_id
        WHERE last_name LIKE 'A%'
        GROUP BY a.actor_id, first_name, last_name
        ORDER BY c DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r || ', Statement 1 : ' || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name, (
          SELECT count(*)
          FROM film_actor fa
          WHERE a.actor_id =
          fa.actor_id
        ) AS c
        FROM actor a
        WHERE last_name LIKE 'A%' 
        ORDER BY c DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r || ', Statement 2 : ' || (SYSTIMESTAMP - v_ts));
  END LOOP;
END;
/

Gee, check out the difference now (and remember, these are totally not seconds, but a hypothetical unit of measurment, let’s call them Newtons. Or Larrys. Let’s call them Larrys (great idea, Axel)):

Run 1, Statement 1 : 07.721731000
Run 1, Statement 2 : 00.622992000
Run 2, Statement 1 : 08.077535000
Run 2, Statement 2 : 00.666481000
Run 3, Statement 1 : 07.756182000
Run 3, Statement 2 : 00.640541000
Run 4, Statement 1 : 07.495021000
Run 4, Statement 2 : 00.731321000
Run 5, Statement 1 : 07.809564000
Run 5, Statement 2 : 00.632615000

Wow, the correlated subquery totally outperformed the LEFT JOIN query by an order of magnitude. This is totally insane. Now, check out…

SQL Server

… beautiful procedural language in SQL Server: Transact-SQL. With nice features like:

  • Needing to cast INT values to VARCHAR when concatenating them.
  • No indexed loop, only WHILE loop
  • No implicit cursor loops (instead: DEALLOCATE!)

Oh well. It’s just for a benchmark. So here goes:

DECLARE @ts DATETIME;
DECLARE @repeat INT = 10000;
DECLARE @r INT;
DECLARE @i INT;
DECLARE @dummy1 VARCHAR;
DECLARE @dummy2 VARCHAR;
DECLARE @dummy3 INT;

DECLARE @s1 CURSOR;
DECLARE @s2 CURSOR;

SET @r = 0;
WHILE @r < 5
BEGIN
  SET @r = @r + 1

  SET @s1 = CURSOR FOR 
    SELECT first_name, last_name, count(fa.actor_id) AS c
    FROM actor a
    LEFT JOIN film_actor fa
    ON a.actor_id = fa.actor_id
    WHERE last_name LIKE 'A%'
    GROUP BY a.actor_id, first_name, last_name
    ORDER BY c DESC

  SET @s2 = CURSOR FOR 
    SELECT first_name, last_name, (
      SELECT count(*)
      FROM film_actor fa
      WHERE a.actor_id =
      fa.actor_id
    ) AS c
    FROM actor a
    WHERE last_name LIKE 'A%' 
    ORDER BY c DESC

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s1;
    FETCH NEXT FROM @s1 INTO @dummy1, @dummy2, @dummy3;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s1 INTO @dummy1, @dummy2, @dummy3;
    END;

    CLOSE @s1;
  END;

  DEALLOCATE @s1;
  PRINT 'Run ' + CAST(@r AS VARCHAR) + ', Statement 1: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s2;
    FETCH NEXT FROM @s2 INTO @dummy1, @dummy2, @dummy3;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s2 INTO @dummy1, @dummy2, @dummy3;
    END;

    CLOSE @s2;
  END;

  DEALLOCATE @s2;
  PRINT 'Run ' + CAST(@r AS VARCHAR) + ', Statement 2: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';
END;

And again, remember, these aren’t seconds. Really. They’re … Kilowatts. Yeah, let’s settle with kilowatts.

Run 1, Statement 1:  2626
Run 1, Statement 2: 20340
Run 2, Statement 1:  2450
Run 2, Statement 2: 17910
Run 3, Statement 1:  2706
Run 3, Statement 2: 18396
Run 4, Statement 1:  2696
Run 4, Statement 2: 19103
Run 5, Statement 1:  2716
Run 5, Statement 2: 20453

Oh my… Wait a second. Now suddenly, the correlated subquery is factor 5… more energy consuming (remember: kilowatts). Who would have thought?

Conclusion

This article won’t explain the differences in execution time between the different databases. There are a lot of reasons why a given execution plan will outperform another. There are also a lot of reasons why the same plan (at least what looks like the same plan) really isn’t because a plan is only a description of an algorithm. Each plan operation can still contain other operations that might still be different.

In summary, we can say that in this case (I can’t stress this enough. This isn’t a general rule. It only explains what happens in this case. Don’t create the next SQL myth!), the correlated subquery and the LEFT JOIN performed in the same order of magnitude on PostgreSQL (subquery being a bit faster), the correlated subquery drastically outperformed the LEFT JOIN in Oracle, whereas the LEFT JOIN drastically outperformed the correlated subquery in SQL Server (despite the plan having been the same!)

This means:

  • Don’t trust your intitial judgment
  • Don’t trust any historic blog posts saying A) is faster than B)
  • Don’t trust execution plans
  • Don’t trust this blog post here, because it is using uncomparable time scales (seconds vs newtons vs kilowatts)
  • Don’t fully trust your own benchmarks, because you’re not measuring things as they happen in production

And sadly:

  • Even for such a simple query, there’s no optimal query for all databases

(and I haven’t even included MySQL in the benchmarks)

BUT

by measuring two alternative, equivalent queries, you may just get an idea what might perform better for your system in case you do have a slow query somewhere. Perhaps this helps.

And now that you’re all hot on the subject, go book our 2 day SQL training, where we have tons of other interesting, myth busting content!