How to Benchmark Alternative SQL Queries to Find the Fastest Query

Tuning SQL isn’t always easy, and it takes a lot of practice to recognise how any given query can be optimised. One of the most important slides of my SQL training is the one summarising “how to be fast”:

How to be fast with SQL. Find out with the Data Geekery SQL Training

Some of these bullets were already covered on this blog. For instance avoiding needless, mandatory work, when client code runs queries or parts of queries that aren’t really necessary (e.g. selecting too many columns: “needless”), but the database cannot prove they’re needless, thus: “mandatory” for the database to execute.

But as with many other performance related topics, one key message is not to guess, but to measure! Or, in other words, not to optimise prematurely, but to optimise actual problems.

SQL is full of myths

SQL is a 4GL (Fourth-generation programming language) and as such, has always been a cool, convenient way to express data related constraints and queries. But the declarative nature of the language also often meant that programmers are really looking into a crystal ball. A lot of people have blogged about a lot of half-true discoveries that might have been correct in some context and at some point of time (this blog is no exception).

For instance:

  • Are correlated subqueries slower than their LEFT JOIN equivalents?
  • Are derived tables faster than views or common table expressions?
  • Is COUNT(*) faster than COUNT(1)?

Tons of myhts!

Measure your queries

To bust a myth, if you have good reasons to think that a differently written, but semantically equivalent query might be faster (on your database), you should measure. Don’t even trust any execution plan, because ultimately, what really counts is the wall clock time in your production system.

If you can measure your queries in production, that’s perfect. But often, you cannot – but you don’t always have to. One way to compare two queries with each other is to benchmark them by executing each query hundreds or even thousands of times in a row.

As any technique, benchmarking has pros and cons. Here is a non-exhaustive list:

Pros

  • Easy to do (see examples below)
  • Easy to reproduce, also on different environments
  • Easy to quickly get an idea in terms of orders of magnitude difference

Cons

  • Not actually measuring productive situations (no one runs the same query thousands of times in a row, without any other queries in parallel)
  • Queries may profit from unrealistic caching due to heavy repetition
  • “Real query” might be dynamic, so the “same query” might really manifest itself in dozens of different productive queries

But if you’re fine with the cons above, the pros might outweigh, for instance, if you want to find out whether a correlated subquery is slower than its LEFT JOIN equivalent for a given query. Note my using italics here, because even if you find out it’s slower for that given query it might be faster for other queries. Never jump to generalised rules before measuring again!

For instance, consider these two equivalent queries that run on the Sakila database. Both versions try to find those actors whose last name starts with the letter A and counts their corresponding films:

LEFT JOIN

SELECT first_name, last_name, count(fa.actor_id) AS c
FROM actor a
LEFT JOIN film_actor fa
ON a.actor_id = fa.actor_id
WHERE last_name LIKE 'A%'
GROUP BY a.actor_id, first_name, last_name
ORDER BY c DESC

Correlated subquery

SELECT first_name, last_name, (
  SELECT count(*)
  FROM film_actor fa
  WHERE a.actor_id =
  fa.actor_id
) AS c
FROM actor a
WHERE last_name LIKE 'A%' 
ORDER BY c DESC

The result is always:

The queries have different execution plans on PostgreSQL, Oracle, SQL Server as can be seen below:

PostgreSQL LEFT JOIN

(Plan looks “better”)

PostgreSQL correlated subquery

(Plan looks “worse”)

Oracle LEFT JOIN

(Plan looks “more complicated”)

Oracle correlated subquery

(Plan looks “simpler”)

SQL Server LEFT JOIN

(Plan looks “reasonable”)

SQL Server correlated subquery

(Plan looks… geez, where’s my correlated subquery? It’s been transformed to a LEFT JOIN!)

Huh, as you can see, in SQL Server, both queries produce the exact same plan (as they should, because the queries are really equivalent). But not all databases recognise this and/or optimise this. At least, that’s what the estimated plans suggest.

Also, don’t jump to the conclusion that if the cost of one plan is lower then it’s a better plan than an alternative. Costs can only really be compared when comparing alternative plans for the same query, e.g. in the Oracle example, we had both HASH JOIN and NESTED LOOP JOIN in a single plan, because Oracle 12c may collect runtime statistics and switch plans in flight thanks to the Oracle 12c Adaptive Query Optimization features.

But let’s ignore all of this and look at actual execution times, instead:

Benchmarking the alternatives

As always, disclaimer: Some commercial databases do not allow for publishing benchmark results without prior written consent. As I never ask for permission, but always ask for forgiveness, I do not have consent, and I’m thus not publishing actual benchmark results.

I have anonymized the benchmark results by introducing hypothetical, non-comparable units of measurement, so you cannot see that PostgreSQL is totally slower than Oracle and/or SQL Server. And you cannot see that SQL Server’s procedural language is totally uglier than PostgreSQL’s and/or Oracle’s.

Legal people.

Solving problems we wouldn’t have without legal people, in the first place

Enough ranting. Some important considerations:

  • Ideally, you’ll run benchmarks directly in the database using a procedural language, rather than, e.g. over JDBC to avoid network latency that incurs with JDBC calls, and other non-desired side-effects.
  • Repeat the benchmarks several times to prevent warmup side-effects and other random issues, as your OS / file system may be busy with accidental Scala compilation, or Slack UI refreshes
  • Be sure to actually consume the entire result set of each query in a loop, rather than just executing the query. Some databases may optimise for lazy cursor consumption (and possibly abortion). It would be unfair not to consume the entire result set

PostgreSQL

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 10000;
  rec RECORD;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR i IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name, count(fa.actor_id) AS c
        FROM actor a
        LEFT JOIN film_actor fa
        ON a.actor_id = fa.actor_id
        WHERE last_name LIKE 'A%'
        GROUP BY a.actor_id, first_name, last_name
        ORDER BY c DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', i, (clock_timestamp() - v_ts); 
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name, (
          SELECT count(*)
          FROM film_actor fa
          WHERE a.actor_id =
          fa.actor_id
        ) AS c
        FROM actor a
        WHERE last_name LIKE 'A%' 
        ORDER BY c DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', i, (clock_timestamp() - v_ts); 
  END LOOP;
END$$;

The result is:

INFO:  Run 1, Statement 1: 00:00:01.708257
INFO:  Run 1, Statement 2: 00:00:01.252012
INFO:  Run 2, Statement 1: 00:00:02.33151  -- Slack message received here
INFO:  Run 2, Statement 2: 00:00:01.064007
INFO:  Run 3, Statement 1: 00:00:01.638518
INFO:  Run 3, Statement 2: 00:00:01.149005
INFO:  Run 4, Statement 1: 00:00:01.670045
INFO:  Run 4, Statement 2: 00:00:01.230755
INFO:  Run 5, Statement 1: 00:00:01.81718
INFO:  Run 5, Statement 2: 00:00:01.166089

As you can see, in all 5 benchmark executions, the version with the correlated subquery seemed to have outperformed the version with the LEFT JOIN in this case by roughly 60%! As this is PostgreSQL and open source, benchmark results are in actual seconds for 10000 query executions. Neat. Let’s move on to…

Oracle

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 10000;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name, count(fa.actor_id) AS c
        FROM actor a
        LEFT JOIN film_actor fa
        ON a.actor_id = fa.actor_id
        WHERE last_name LIKE 'A%'
        GROUP BY a.actor_id, first_name, last_name
        ORDER BY c DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r || ', Statement 1 : ' || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name, (
          SELECT count(*)
          FROM film_actor fa
          WHERE a.actor_id =
          fa.actor_id
        ) AS c
        FROM actor a
        WHERE last_name LIKE 'A%' 
        ORDER BY c DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r || ', Statement 2 : ' || (SYSTIMESTAMP - v_ts));
  END LOOP;
END;
/

Gee, check out the difference now (and remember, these are totally not seconds, but a hypothetical unit of measurment, let’s call them Newtons. Or Larrys. Let’s call them Larrys (great idea, Axel)):

Run 1, Statement 1 : 07.721731000
Run 1, Statement 2 : 00.622992000
Run 2, Statement 1 : 08.077535000
Run 2, Statement 2 : 00.666481000
Run 3, Statement 1 : 07.756182000
Run 3, Statement 2 : 00.640541000
Run 4, Statement 1 : 07.495021000
Run 4, Statement 2 : 00.731321000
Run 5, Statement 1 : 07.809564000
Run 5, Statement 2 : 00.632615000

Wow, the correlated subquery totally outperformed the LEFT JOIN query by an order of magnitude. This is totally insane. Now, check out…

SQL Server

… beautiful procedural language in SQL Server: Transact-SQL. With nice features like:

  • Needing to cast INT values to VARCHAR when concatenating them.
  • No indexed loop, only WHILE loop
  • No implicit cursor loops (instead: DEALLOCATE!)

Oh well. It’s just for a benchmark. So here goes:

DECLARE @ts DATETIME;
DECLARE @repeat INT = 10000;
DECLARE @r INT;
DECLARE @i INT;
DECLARE @dummy1 VARCHAR;
DECLARE @dummy2 VARCHAR;
DECLARE @dummy3 INT;

DECLARE @s1 CURSOR;
DECLARE @s2 CURSOR;

SET @r = 0;
WHILE @r < 5
BEGIN
  SET @r = @r + 1

  SET @s1 = CURSOR FOR 
    SELECT first_name, last_name, count(fa.actor_id) AS c
    FROM actor a
    LEFT JOIN film_actor fa
    ON a.actor_id = fa.actor_id
    WHERE last_name LIKE 'A%'
    GROUP BY a.actor_id, first_name, last_name
    ORDER BY c DESC

  SET @s2 = CURSOR FOR 
    SELECT first_name, last_name, (
      SELECT count(*)
      FROM film_actor fa
      WHERE a.actor_id =
      fa.actor_id
    ) AS c
    FROM actor a
    WHERE last_name LIKE 'A%' 
    ORDER BY c DESC

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s1;
    FETCH NEXT FROM @s1 INTO @dummy1, @dummy2, @dummy3;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s1 INTO @dummy1, @dummy2, @dummy3;
    END;

    CLOSE @s1;
  END;

  DEALLOCATE @s1;
  PRINT 'Run ' + CAST(@r AS VARCHAR) + ', Statement 1: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s2;
    FETCH NEXT FROM @s2 INTO @dummy1, @dummy2, @dummy3;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s2 INTO @dummy1, @dummy2, @dummy3;
    END;

    CLOSE @s2;
  END;

  DEALLOCATE @s2;
  PRINT 'Run ' + CAST(@r AS VARCHAR) + ', Statement 2: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';
END;

And again, remember, these aren’t seconds. Really. They’re … Kilowatts. Yeah, let’s settle with kilowatts.

Run 1, Statement 1:  2626
Run 1, Statement 2: 20340
Run 2, Statement 1:  2450
Run 2, Statement 2: 17910
Run 3, Statement 1:  2706
Run 3, Statement 2: 18396
Run 4, Statement 1:  2696
Run 4, Statement 2: 19103
Run 5, Statement 1:  2716
Run 5, Statement 2: 20453

Oh my… Wait a second. Now suddenly, the correlated subquery is factor 5… more energy consuming (remember: kilowatts). Who would have thought?

Conclusion

This article won’t explain the differences in execution time between the different databases. There are a lot of reasons why a given execution plan will outperform another. There are also a lot of reasons why the same plan (at least what looks like the same plan) really isn’t because a plan is only a description of an algorithm. Each plan operation can still contain other operations that might still be different.

In summary, we can say that in this case (I can’t stress this enough. This isn’t a general rule. It only explains what happens in this case. Don’t create the next SQL myth!), the correlated subquery and the LEFT JOIN performed in the same order of magnitude on PostgreSQL (subquery being a bit faster), the correlated subquery drastically outperformed the LEFT JOIN in Oracle, whereas the LEFT JOIN drastically outperformed the correlated subquery in SQL Server (despite the plan having been the same!)

This means:

  • Don’t trust your intitial judgment
  • Don’t trust any historic blog posts saying A) is faster than B)
  • Don’t trust execution plans
  • Don’t trust this blog post here, because it is using uncomparable time scales (seconds vs newtons vs kilowatts)
  • Don’t fully trust your own benchmarks, because you’re not measuring things as they happen in production

And sadly:

  • Even for such a simple query, there’s no optimal query for all databases

(and I haven’t even included MySQL in the benchmarks)

BUT

by measuring two alternative, equivalent queries, you may just get an idea what might perform better for your system in case you do have a slow query somewhere. Perhaps this helps.

And now that you’re all hot on the subject, go book our 2 day SQL training, where we have tons of other interesting, myth busting content!

How to Execute SQL Batches With JDBC and jOOQ

Some databases (in particular MySQL and T-SQL databases like SQL Server and Sybase) support a very nice feature: They allow for running a “batch” of statements in a single statement. For instance, in SQL Server, you can do something like this:

-- Statement #1
DECLARE @table AS TABLE (id INT);

-- Statement #2
SELECT * FROM @table;

-- Statement #3
INSERT INTO @table VALUES (1),(2),(3);

-- Statement #4
SELECT * FROM @table;

This is a batch of 4 statements, and it can be executed as a single statement both with JDBC and with jOOQ. Let’s see how:

Executing a batch with JDBC

Unfortunately, the term “batch” has several meanings, and in this case, I don’t mean the JDBC Statement.addBatch() method, which is actually a bit clumsy as it doesn’t allow for fetching mixed update counts and result sets.

Instead, what I’ll be doing is this:

String sql =
    "\n  -- Statement #1                              "
  + "\n  DECLARE @table AS TABLE (id INT);            "
  + "\n                                               "
  + "\n  -- Statement #2                              "
  + "\n  SELECT * FROM @table;                        "
  + "\n                                               "
  + "\n  -- Statement #3                              "
  + "\n  INSERT INTO @table VALUES (1),(2),(3);       "
  + "\n                                               "
  + "\n  -- Statement #4                              "
  + "\n  SELECT * FROM @table;                        ";

try (PreparedStatement s = c.prepareStatement(sql)) {
    fetchLoop:
    for (int i = 0, updateCount = 0;; i++) {
        boolean result = (i == 0)
            ? s.execute()
            : s.getMoreResults();

        if (result)
            try (ResultSet rs = s.getResultSet()) {
                System.out.println("\nResult:");

                while (rs.next())
                    System.out.println("  " + rs.getInt(1));
            }
        else if ((updateCount = s.getUpdateCount()) != -1)
            System.out.println("\nUpdate Count: " + updateCount);
        else
            break fetchLoop;
    }
}

The output of the above program being:

Result:

Update Count: 3

Result:
  1
  2
  3

The above API usage is a somewhat “hidden” – or at least not every day usage of the JDBC API. Mostly, you’ll be using Statement.executeQuery() when you’re expecting a ResultSet, or Statement.executeUpdate() otherwise.

But in our case, we don’t really know what’s happening. We’re going to discover the result types on the fly, when executing the statement. Here are the main JDBC API features that we’re using, along with an explanation:

  • Statement.execute(): This method should be used if we don’t know the result type. The method returns a boolean, which is true when the first statement in the batch produced a ResultSet and false otherwise.
  • Statement.getMoreResults(): This method returns the same kind of boolean value as the previous Statement.execute() method, but it does so for the next statement in the batch (i.e. for every statement except the first).
  • If the current result is a ResultSet (the boolean was true), then we’ll obtain that ResultSet through Statement.getResultSet() (we can obviously no longer call the usual Statement.executeQuery() to obtain the ResultSet).
  • If the current result is not a ResultSet (the boolean was true), then we’ll check the update count value through Statement.getUpdateCount().
  • If the update count is -1, then we’ve reached the end of the batch.

What a nice state machine!

The nice thing about this is that a batch may be completely nondeterministic. E.g. there may be triggers, T-SQL blocks (e.g. an IF statement), stored procedures, and many other things that contribute result sets and/or update counts. In some cases, we simply don’t know what we’ll get.

Executing a batch with jOOQ

It’s great that the JDBC API designers have thought of this exotic API usage on a rather low level. This makes JDBC extremely powerful. But who remembers the exact algorithm all the time? After all, the above minimalistic version required around 20 lines of code for something as simple as that.

Compare this to the following jOOQ API usage:

System.out.println(
    DSL.using(c).fetchMany(sql)
);

The result being:

Result set:
+----+
|  id|
+----+
Update count: 3
Result set:
+----+
|  id|
+----+
|   1|
|   2|
|   3|
+----+

Huh! Couldn’t get much simpler than that! Let’s walk through what happens:

The DSLContext.fetchMany() method is intended for use when users know there will be many result sets and/or update counts. Unlike JDBC which reuses ordinary JDBC API, jOOQ has a different API here to clearly distinguish between behaviours. The method then eagerly fetches all the results and update counts in one go (lazy fetching is on the roadmap with issue #4503).

The resulting type is org.jooq.Results, a type that extends List<Result>, which allows for iterating over the results only, for convenience. If a mix of results or update counts need to be consumed, the Results.resultsOrRows() method can be used.

How to Emulate Partial Indexes in Oracle

A very interesting feature of the SQL Server and PostgreSQL databases (and some others, including SQLite) is the partial index (sometimes also called “filtered index”). That’s an index that contains only “parts” of the table data. For instance, we can write the following index in SQL Server and PostgreSQL:

CREATE INDEX i ON message WHERE deleted = 1;

Let’s imagine you have a house keeping job that periodically removes deleted messages. Now, let’s assume you have discovered, that only 0.1% of all messages are really deleted, so an index on the DELETED column is very selective if you’re looking for deleted messages.

On the other hand, it’s not selective at all if you’re looking for non-deleted messages, as such a query would return 99.9% of all messages, in case of which a full table scan is more efficient.

So, since the index is never useful for non-deleted messages, why index those messages at all? If we can avoid indexing non-deleted messages, then we can:

  • Save a lot of disk space, as the index will be much smaller
  • Save a lot of time inserting into the messages table, since we don’t have to update the index all the time
  • Save a lot of time scanning the index, since it will contain a lot less blocks

Unfortunately, Oracle doesn’t support this feature

… but “luckily”, Oracle has another controversial “feature”. In Oracle, all indexes are partial indexes, because they don’t contain NULL values. This is probably due to an ancient optimisation (remember, partial indexes are smaller), which occasionally gets into your way, performance wise, if you do want to query for NULL values.

But in this case, it’s useful. Check this out:

CREATE TABLE message(deleted number(1));

CREATE INDEX i ON message (
  CASE WHEN deleted > 0 THEN deleted END
);

The above index is a function-based index, i.e. an index that contains not the value of the deleted column itself, but an expression based on it. Concretely, it contains only deleted values that are strictly greater than zero, because if the value is zero, then it is turned to NULL by the CASE expression, and Oracle doesn’t index NULL values. Check this out:

INSERT INTO message
SELECT DECODE(LEVEL, 1, 1, 0)
FROM dual
CONNECT BY LEVEL < 100000;

The above query is inserting a single row containing a deleted value of 1, and almost 100k rows containing a value of 0. The insert is very quick, because only one row has to be added to the index. The other almost 100k rows are skipped:

EXEC dbms_stats.gather_table_stats('SANDBOX', 'MESSAGE');

SELECT NUM_ROWS, BLOCKS
FROM SYS.ALL_TAB_STATISTICS
WHERE TABLE_NAME = 'MESSAGE';

SELECT NUM_ROWS, LEAF_BLOCKS
FROM SYS.ALL_IND_STATISTICS
WHERE TABLE_NAME = 'MESSAGE';

The result is:

NUM_ROWS       BLOCKS
---------------------
   99999          152 <-- table size

NUM_ROWS  LEAF_BLOCKS
---------------------
       1            1 <-- index size

The “trouble” with this kind of emulation is: It’s a function-based index. We can use this index only if we really reproduce the same “function” (or in this case, expression) as in the index itself. So, in order to fetch all the deleted messages, we must not write the following query:

SELECT *
FROM message
WHERE deleted = 1;

But this one, instead:

SELECT *
FROM message
WHERE CASE WHEN deleted > 0 THEN deleted END = 1;

Check out execution plans:

EXPLAIN PLAN FOR
SELECT *
FROM message
WHERE deleted = 1;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR
SELECT *
FROM message
WHERE CASE WHEN deleted > 0 THEN deleted END = 1;

SELECT * FROM TABLE(dbms_xplan.display);

The output being:

------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)|
------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         | 50000 |   146K|    44   (3)|
|*  1 |  TABLE ACCESS FULL| MESSAGE | 50000 |   146K|    44   (3)|
------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("DELETED"=1)

And

----------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |     1 |     3 |     2   (0)|
|   1 |  TABLE ACCESS BY INDEX ROWID| MESSAGE |     1 |     3 |     2   (0)|
|*  2 |   INDEX RANGE SCAN          | I       |     1 |       |     1   (0)|
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access(CASE  WHEN "DELETED">0 THEN "DELETED" END =1)

As you can see, the first query runs a full table scan, estimating to retrieve 50% of all the rows, when the actual result is only 1 row as can be seen in the second execution plan!

Insertion speed

What’s even more impressive is the difference in insertion speed. Consider the following code block, which measures the time it takes to insert 1 million times 0 and one million times 1:

SET SERVEROUTPUT ON
DECLARE
  ts TIMESTAMP;
BEGIN
  ts := SYSTIMESTAMP;
  INSERT INTO message 
  SELECT 0 FROM dual CONNECT BY level <= 1000000;
  dbms_output.put_line(SYSTIMESTAMP - ts);
  
  ts := SYSTIMESTAMP;
  INSERT INTO message 
  SELECT 1 FROM dual CONNECT BY level <= 1000000;
  dbms_output.put_line(SYSTIMESTAMP - ts);
END;
/

The result being:

+000000000 00:00:01.501000000
+000000000 00:00:08.162000000

The insertion is much faster if we don’t have to modify the index!

Conclusion

Partial indexes are a very neat trick in cases where your data is highly skewed and some values in a column are extremely rare and very frequently queried. They may drastically reduce the index size, which greatly improves performance in some situations, including inserting into the table, and querying the index.

In Oracle, they can be emulated using function-based indexes, which means you have to use the exact function expression from the index also in queries, in order to profit. But it may well be worth the trouble!

jOOQ 3.10 will Support SQL Server’s Table Valued Parameters

SQL Server has this nice feature called table-valued parameters (TVP), where users can pass table variables to a stored procedure for bulk data processing. This is particularly nice when the stored procedure is an inline table valued function, i.e. a function that returns a table as well. For instance:

CREATE TYPE numbers AS TABLE (i INTEGER);

CREATE FUNCTION cross_multiply (
  @numbers numbers READONLY
)
RETURNS @result TABLE (
  i1 INTEGER,
  i2 INTEGER,
  product INTEGER
)
AS
BEGIN
  INSERT INTO @result
  SELECT n1.i, n2.i, n1.i * n2.i
  FROM @numbers n1
  CROSS JOIN @numbers n2

  RETURN
END

The above function creates a cross product of a table with itself, and multiplies each possible combination. So, when calling this with the following table argument:

DECLARE @arg NUMBERS;
INSERT INTO @arg VALUES (1),(2),(3),(4);
SELECT * FROM cross_multiply(@arg);

We’re getting the following, nice result:

i1	i2	product
-----------------------
1	1	1
2	1	2
3	1	3
4	1	4
1	2	2
2	2	4
3	2	6
4	2	8
1	3	3
2	3	6
3	3	9
4	3	12
1	4	4
2	4	8
3	4	12
4	4	16

Easy, eh?

Call the above from Java with JDBC

The SQL Server JDBC driver (since recently) supports TVPs if you’re ready to use vendor specific API. If you want to run this T-SQL batch:

DECLARE @arg NUMBERS;
INSERT INTO @arg VALUES (1),(2),(3),(4);
SELECT * FROM cross_multiply(@arg);

In Java, you’d write something along the lines of this:

SQLServerDataTable table = new SQLServerDataTable();
sourceDataTable.addColumnMetadata("i" ,java.sql.Types.INTEGER);
sourceDataTable.addRow(1);  
sourceDataTable.addRow(2);  
sourceDataTable.addRow(3);  
sourceDataTable.addRow(4); 
  
try (SQLServerPreparedStatement stmt=   
    (SQLServerPreparedStatement) connection.prepareStatement(  
       "SELECT * FROM cross_multiply(?)")) {

    // Magic here:
    stmt.setStructured(1, "dbo.numbers", table);  

    try (ResultSet rs = stmt.executeQuery()) {
        ...
    }
}

This is a bit tedious as you have to work through all this API and remember:

  • type names
  • column names
  • column positions

But it works.

Now, call the above from Java, with jOOQ

No problem with jOOQ 3.10. Don’t worry about the boring JDBC data type binding details, as the jOOQ code generator has you covered. As always, all routines are generated classes / methods, and this time, the TABLE type is also a generated type. Let the code speak for itself. Instead of this SQL statement:

DECLARE @arg NUMBERS;
INSERT INTO @arg VALUES (1),(2),(3),(4);
SELECT * FROM cross_multiply(@arg);

You can write the following with jOOQ:

Numbers numbers = new NumbersRecord(
    new NumbersElementTypeRecord(1),
    new NumbersElementTypeRecord(2),
    new NumbersElementTypeRecord(3),
    new NumbersElementTypeRecord(4)
);

// Standalone function call:
Result<CrossMultiplyRecord> r1 = 
    crossMultiply(configuration, numbers);

// Embedded table-valued function call, with predicate
Result<CrossMultiplyRecord> r2 = 
DSL.using(configuration)
   .selectFrom(crossMultiply(numbers))
   .where(F_CROSS_MULTIPLY.PRODUCT.gt(5))
   .fetch();

System.out.println(r1);
System.out.println(r2);

And the nice printed output will be:

+----+----+-------+
|  i1|  i2|product|
+----+----+-------+
|   1|   1|      1|
|   2|   1|      2|
|   3|   1|      3|
|   4|   1|      4|
|   1|   2|      2|
|   2|   2|      4|
|   3|   2|      6|
|   4|   2|      8|
|   1|   3|      3|
|   2|   3|      6|
|   3|   3|      9|
|   4|   3|     12|
|   1|   4|      4|
|   2|   4|      8|
|   3|   4|     12|
|   4|   4|     16|
+----+----+-------+

+----+----+-------+
|  i1|  i2|product|
+----+----+-------+
|   3|   2|      6|
|   4|   2|      8|
|   2|   3|      6|
|   3|   3|      9|
|   4|   3|     12|
|   2|   4|      8|
|   3|   4|     12|
|   4|   4|     16|
+----+----+-------+

Not only does jOOQ understand table-valued parameters, since jOOQ 3.5, we have also supported table-valued functions, which can be used like any ordinary table:

Result<CrossMultiplyRecord> r2 = 
DSL.using(configuration)
   .selectFrom(crossMultiply(numbers))
   .where(F_CROSS_MULTIPLY.PRODUCT.gt(5))
   .fetch();

As you can see, the function call can be embedded in the from clause, it even returns safely-typed CrossMultiplyRecord elements (if you’re not using any projection), and you can form predicates on table columns (i.e. function return values), you can join the table, etc.

Excellent! Let’s start using table-valued parameters!

Impress Your Coworkers by Using SQL UNPIVOT!

I’ve recently encountered a very interesting question on Stack Overflow by an unnamed user. The question was about generating a table of the following form in Oracle, using a table valued function:

Description   COUNT
-------------------
TEST1         10 
TEST2         15
TEST3         25
TEST4         50

The logic that should be implemented for the COUNT column is the following:

  • TEST1: count of employees whose sal < 10000
  • TEST2: count of employees whose dept > 10
  • TEST3: count of employees whose hiredate > (SYSDATE-60)
  • TEST4: count of employees whose grade = 1

Challenge accepted!

For this exercise, let’s assume the following table:

CREATE TABLE employees (
  id NUMBER(18)     NOT NULL PRIMARY KEY,
  sal NUMBER(18, 2) NOT NULL,
  dept NUMBER(18)   NOT NULL,
  hiredate DATE     NOT NULL,
  grade NUMBER(18)  NOT NULL
);

INSERT INTO employees 
      VALUES (1, 10000,  1, SYSDATE     , 1);
INSERT INTO employees 
      VALUES (2,  9000,  5, SYSDATE - 10, 1);
INSERT INTO employees 
      VALUES (3, 11000, 13, SYSDATE - 30, 2);
INSERT INTO employees 
      VALUES (4, 10000, 12, SYSDATE - 80, 2);
INSERT INTO employees 
      VALUES (5,  8000,  7, SYSDATE - 90, 1);

How to calculate the COUNT values

In a first step, we’re going to look into how to best calculate the COUNT values. The simplest way is to calculate the values in individual columns, not rows. SQL newbies will probably resort to a canonical solution using nested SELECTs, which is very bad for performance reasons:

SELECT
  (SELECT COUNT(*) FROM employees 
      WHERE sal < 10000) AS test1,
  (SELECT COUNT(*) FROM employees 
      WHERE dept > 10) AS test2,
  (SELECT COUNT(*) FROM employees 
      WHERE hiredate > (SYSDATE - 60)) AS test3,
  (SELECT COUNT(*) FROM employees 
      WHERE grade = 1) AS test4
FROM dual;

Why is the query not optimal? There are four table accesses to find all the data:

unpivot-nested-select

If you add an index to each individual column being filtered, chances are at least to optimise individual subqueries, but for these kinds of reports, the occasional full table scan is perfectly fine, especially if you aggregate a lot of data.

Even if not optimal in speed, the above yields the correct result:

TEST1   TEST2   TEST3   TEST4
-----------------------------
2	2	3	3

How to improve the query, then?

Few people are aware of the fact that aggregate functions only aggregate non-NULL values. This has no effect, when you write COUNT(*), but when you pass an expression to the COUNT(expr) function, this becomes much more interesting!

The idea here is that you use a CASE expression that transforms each predicate’s TRUE evaluation into a non-NULL value, an the FALSE (or NULL) evaluation into NULL. The following query illustrates this approach

SELECT
  COUNT(CASE WHEN sal < 10000 THEN 1 END)
      AS test1,
  COUNT(CASE WHEN dept > 10 THEN 1 END)
      AS test2,
  COUNT(CASE WHEN hiredate > (SYSDATE-60) THEN 1 END)
      AS test3,
  COUNT(CASE WHEN grade = 1 THEN 1 END)
      AS test4
FROM employees;

… and yields again the correct result:

TEST1   TEST2   TEST3   TEST4
-----------------------------
2	2	3	3

Using FILTER() instead of CASE

The SQL standard and the awesome PostgreSQL database offer an even more convenient syntax for the above functionality. The little known FILTER() clause on aggregate functions.

In PostgreSQL, you’d write instead:

SELECT
  COUNT(*) FILTER (WHERE sal < 10000)
      AS test1,
  COUNT(*) FILTER (WHERE dept > 10)
      AS test2,
  COUNT(*) FILTER (WHERE hiredate > (SYSDATE - 60))
      AS test3,
  COUNT(*) FILTER (WHERE grade = 1)
      AS test4
FROM employees;

This is useful when you want to cleanly separate the FILTER() criteria from any other expression that you want to use for aggregating. E.g. when calculating a SUM().

In any case, the query now has to hit the table only once. The aggregation can then be performed entirely in memory.

unpivot-case-expression

This is always better than the previous approach, unless you have an index for every aggregation!

OK. Now how to get the results in rows?

The question on Stack Overflow wanted a result with TESTn values being put in individual rows, not columns.

Description   COUNT
-------------------
TEST1         2
TEST2         2
TEST3         3
TEST4         3

Again, there’s a canonical, not so performant approach to do this with UNION ALL:

SELECT 
  'TEST1' AS Description, 
  COUNT(*) AS COUNT 
FROM employees WHERE sal < 10000
UNION ALL
SELECT 
  'TEST2', 
  COUNT(*)
FROM employees WHERE dept > 10
UNION ALL
SELECT 
  'TEST3', 
  COUNT(*) 
FROM employees WHERE hiredate > (SYSDATE - 60)
UNION ALL
SELECT 
  'TEST4', 
  COUNT(*) 
FROM employees WHERE grade = 1

This approach is more or less equivalent to the nested selects approach, except for the column / row transposition (“unpivoting”). And the plan is also very similar:

unpivot-union-all

Transposition = (un)pivoting

Notice how I used the term “transpose”. That’s what we did, and it has a name: (un)pivoting. Not only does it have a name, but this feature is also supported out of the box in Oracle and SQL Server via the PIVOT and UNPIVOT keywords that can be placed after table references.

  • PIVOT transposes rows into columns
  • UNPIVOT transposes columns back into rows

So, we’ll take the original, optimal solution, and transpose that with UNPIVOT

SELECT *
FROM (
  SELECT
    COUNT(CASE WHEN sal < 10000 THEN 1 END)
      AS test1,
    COUNT(CASE WHEN dept > 10 THEN 1 END)
      AS test2,
    COUNT(CASE WHEN hiredate > (SYSDATE-60) THEN 1 END)
      AS test3,
    COUNT(CASE WHEN grade = 1 THEN 1 END)
      AS test4
  FROM employees
) t
UNPIVOT (
  count FOR description IN (
    "TEST1", "TEST2", "TEST3", "TEST4"
  )
)

All we need to do is wrap the original query in a derived table t (i.e. an inline SELECT in the FROM clause), and then “UNPIVOT” that table t, generating the count and description columns. The result is, again:

Description   COUNT
-------------------
TEST1         2
TEST2         2
TEST3         3
TEST4         3

The execution plan is still optimal. All the action is happening in memory.

unpivot-unpivot

Conclusion

PIVOT and UNPIVOT are very useful tools for reporting and reorganising data. There are many use-cases like the above, where you want to re-organise some aggregations. Other use-cases include settings or properties tables that implement an entity attribute value model, and you want to transform attributes from rows to columns (PIVOT), or from columns to rows (UNPIVOT)

Intrigued? Read on about PIVOT here:

The 10 Most Popular DB Engines (SQL and NoSQL) in 2015

About two years ago, we’ve published this post about the 10 most popular DB engines, where we analyzed the data published by Solid IT on their DB Ranking website.

In the meantime, the Solid IT measurement system has found to be a credible source, such that the website has also been cited at Gartner, InfoWorld, and many other sources. For more details about how this is measured, check out the relevant website on db-engines.com:
http://db-engines.com/en/ranking_definition

Comparing the top 10 list, we can see that the players have shifted, heavily:

Reproduced with permission of DB-Engines.com

Reproduced with permission of DB-Engines.com

The top 3 elefants are still Oracle, MySQL and Microsoft SQL Server, but the runner-ups have changed. While PostgreSQL is still gaining traction, it has lost grounds compared to MongoDB.

Also, Cassandra and Redis have pushed out Sybase and Teradata from the top 10!

When 2 years ago, there had been only a single non RDBMS in this top 10 list, there are now 3, all of which “schema-less”, and they’re gaining additional momentum.

Clearly, vendors of RDBMS will need to move quickly to accommodate the needs of document storage and key-value storage the way their new competitors do.

For us on the jOOQ blog, a much more interesting perspective is to see where our supported databases currently are in this ranking:

Reproduced with permission of DB-Engines.com

Reproduced with permission of DB-Engines.com

With the recent release of jOOQ 3.7, we’ve added another three databases to our list of now 21 supported RDBMS. Compared to last year’s ranking, almost all of these RDBMS are gaining traction as well, apart from SQL Server.

One thing is certain: We still have very exciting times ahead. Stay tuned for more updates, and check out the current ranking here:

http://db-engines.com/en/ranking

10 SQL Articles Everyone Must Read

We’ve been blogging about Java and SQL for a while now, on the jOOQ blog. Over the years, while researching interesting blog topics, we’ve discovered a lot of SQL gems in the blogosphere that have inspired our work and our passion for SQL.

Today, we’re presenting to you a list of 10 articles that we think you should absolutely read. At the end of the list, you will agree that either:

  • SQL is awesome
  • SQL is crazy

… or probably both. Here goes, in no particular order:

1. Joe Celko: “Divided We Stand: The SQL of Relational Division”

Relational division is a very powerful concept in relational algebra. It answers questions like:

Give me all the students that have completed a given set of courses

Unfortunately, division doesn’t have any direct equivalent in SQL. We wish there would be a clause like

TABLE_A
  DIVIDE BY TABLE_B 
  ON [ some predicate ]

Nonetheless, you can express a division in SQL in various ways.

Read Joe’s: “Divided We Stand: The SQL of Relational Division

2. Alex Bolenok: “Happy New Year!”

Alek Bolenok (a.k.a. Quassnoi) blogs about various interesting SQL-related things, but one of his top contributions every year are his “happy new year” series. Alek paints “beautiful” (beauty is in the eye of the beholder), and certainly impressive pictures into your SQL console. For instance:

Read Alek’s, “Christmas tree in SQL

3. Markus Winand: “Clustering Data: The Second Power of Indexing”

Markus Winand is the author of the popular book SQL Performance Explained, parts of which you can also read on his blog “Use The Index Luke“. There is an incredible amount of very useful knowledge both in the book and on this page, but one of the most revealing and neat SQL tricks is to know about “covering indexes”, “clustering indexes”, or “index only scans”

Read Markus’s: “Clustering Data: The Second Power of Indexing

4. Dimitri Fontaine: “Understanding Window Functions”

There was SQL before window functions and SQL after window functions

Window functions are some of the most powerful and underused features of SQL. They’re available in all commercial databases, in PostgreSQL, and soon also in Firebird 3.0. We’ve blogged about window functions a couple of times ourselves, but one of the best summaries and explanations about what they really are and how they work has been written by Dimitri Fontaine.

Read Dimitri’s: “Understanding Window Functions

5. Lukas Eder: “10 Common Mistakes Java Developers Make when Writing SQL”

A bit of advertising for our own writing. We’ve collected 10 of the most common mistakes that Java developers make when writing SQL. These mistakes are actually not even specific to Java developers, they could happen to any developer. This article has generated so much traction on our blog, there must be some great truth in it.

Read Lukas’s: “10 Common Mistakes Java Developers Make when Writing SQL

6. András Gábor’s “Techniques for Pagination in SQL”

Up until recently, offset pagination has been rather difficult to implement in commercial databases like Oracle, SQL Server, DB2, Sybase, which didn’t feature the equivalent of MySQL’s / PostgreSQL’s LIMIT .. OFFSET clause. Pagination could be emulated, however, and there are a lot of techniques for doing that. Picking the right technique is essential for performance. If you’re using Oracle 11g or less, you should filter on ROWNUM:

Read András’s “Techniques for Pagination in SQL

On a side-note, you should probably consider not using OFFSET at all. For details about the NO OFFSET movement, read…

7. Markus Windand: “We need tool support for keyset pagination”

If you think about OFFSET pagination, it’s actually a rather dumb thing from a technical perspective, and a useless thing from a business perspective. Here’s why.

From a technical perspective…

… you need to apply filtering, grouping, and ordering on a vast amount of data, skipping and throwing away all the data that appears before the offset until you reach the first row of interest. That is a lot of waste of resources given that…

From a business perspective…

… perhaps, pages 1-3 are interesting, but there is absolutely no meaning in offering users to navigate to page 1337. After a certain offset, the meaning of the offset from a business perspective has vanished. You might as well display random, unordered data samples. The user wouldn’t notice. Proably, when you reach a higher page on Google search results, this is exactly what happens. Random stuff.

Or on reddit. There, you get random stuff already on the first page – such as this popular display of a Stabilized head on green vine snake.

Much better than offset pagination is keyset pagination (which we’ve blogged about as well).

Read Markus’s “We need tool support for keyset pagination

no-offset-banner-468x60.white

8. Josh Berkus “Tag All The Things”

Implementing tagging in a relational database can be a beast from a performance perspective. Should you normalise (one-to-many)? Should you normalise heavily (many-to-many)? Should you use nested collections / arrays / or even JSON data structures?

Josh has written a very interesting write-up on the performance of heavy tagging in PostgreSQL, showing that normalisation isn’t always the best choice.

Read Josh’s “Tag All The Things

9. Alek Bolenok’s “10 things in SQL Server (which don’t work as expected)”

This is again Alek’s (Quassnoi’s) work. A very interesting set of things that happen inside of SQL Server, which you might not have expected when you’re used to using other databases. Whether you’re using SQL Server or not, this is a must-read to re-raise awareness of the subtle little differences between SQL implementations

Read Alek’s “10 things in SQL Server (which don’t work as expected)

10. Aaron Bertrand: “Best approaches for running totals”

Running totals are a very typical use-case for SQL-based reporting. A running total is something that every project manager using Excel knows intuitively how to do. Just drag-and-drop that sweet sweet formula across your spreadsheet and done:

excel-running-total

How to do the same in SQL? There are again tons of ways. Aaron Bertrand has summarised various solutions for SQL Server 2012.

Read Aaron’s “Best approaches for running totals

Many other articles

There are, of course, many other very good articles providing deep insight into useful SQL tricks. If you find you’ve encountered an article that would nicely complement this list, please leave a link and description in the comments section. Future readers will appreciate the additional insight.