The Performance Difference Between SQL Row-by-row Updating, Batch Updating, and Bulk Updating

Something that has been said many times, but needs constant repeating until every developer is aware of the importance of this is the performance difference between row-by-row updating and bulk updating. If you cannot guess which one will be much faster, remember that row-by-row kinda rhymes with slow-by-slow (hint hint).

Disclaimer: This article will discuss only non-concurrent updates, which are much easier to reason about. In a concurrent update situation, a lot of additional factors will add complexity to the problem, including the locking strategy, transaction isolation levels, or simply how the database vendor implements things in detail. For the sake of simplicity, I’ll assume no concurrent updates are being made.

Example query

Let’s say we have a simple table for our blog posts (using Oracle syntax, but the effect is the same on all databases):

CREATE TABLE post (
  id INT NOT NULL PRIMARY KEY,
  text VARCHAR2(1000) NOT NULL,
  archived NUMBER(1) NOT NULL CHECK (archived IN (0, 1)),
  creation_date DATE NOT NULL
);

CREATE INDEX post_creation_date_i ON post (creation_date);

Now, let’s add some 10000 rows:

INSERT INTO post
SELECT 
  level,
  lpad('a', 1000, 'a'),
  0 AS archived,
  DATE '2017-01-01' + (level / 100)
FROM dual
CONNECT BY level <= 10000;

EXEC dbms_stats.gather_table_stats('TEST', 'POST');

Now imagine, we want to update this table and set all posts to ARCHIVED = 1 if they are from last year, e.g. CREATION_DATE < DATE '2018-01-01'. There are various ways to do this, but you should have built an intuition that doing the update in one single UPDATE statement is probably better than looping over each individual row and updating each individual row explicitly. Right?

Right.

Then, why do we keep doing it?

Let me ask this differently:

Does it matter?

The best way to find out is to benchmark. I’m doing two benchmarks for this:

  1. One that is run in PL/SQL, showing the performance difference between different approaches that are available to PL/SQL (namely looping, the FORALL syntax, and a single bulk UPDATE)
  2. One that is run in Java, doing JDBC calls, showing the performance difference between different approaches available to Java (namely looping, caching PreparedStatement but still looping, batching, and a single bulk UPDATE)

Benchmarking PL/SQL

The code of the benchmark can be found in this gist. I will also include it at the bottom of this blog post. The results are:

Run 1, Statement 1 : .01457 (avg : .0098)
Run 1, Statement 2 : .0133  (avg : .01291)
Run 1, Statement 3 : .02351 (avg : .02519)
Run 2, Statement 1 : .00882 (avg : .0098)
Run 2, Statement 2 : .01159 (avg : .01291)
Run 2, Statement 3 : .02348 (avg : .02519)
Run 3, Statement 1 : .01012 (avg : .0098)
Run 3, Statement 2 : .01453 (avg : .01291)
Run 3, Statement 3 : .02544 (avg : .02519)
Run 4, Statement 1 : .00799 (avg : .0098)
Run 4, Statement 2 : .01346 (avg : .01291)
Run 4, Statement 3 : .02958 (avg : .02519)
Run 5, Statement 1 : .00749 (avg : .0098)
Run 5, Statement 2 : .01166 (avg : .01291)
Run 5, Statement 3 : .02396 (avg : .02519)

The difference between Statement 1 and 3 is a factor of 2.5x

Showing the time it takes for each statement type to complete, each time updating 3649 / 10000 rows. The winner is:

Statement 1, running a bulk update

It looks like this:

UPDATE post
SET archived = 1
WHERE archived = 0 AND creation_date < DATE '2018-01-01';

Runner-up (not too far away) is:

Statement 2, using the PL/SQL FORALL syntax

It works like this:

DECLARE
  TYPE post_ids_t IS TABLE OF post.id%TYPE;
  v_post_ids post_ids_t;
BEGIN
  SELECT id 
  BULK COLLECT INTO v_post_ids
  FROM post 
  WHERE archived = 0 AND creation_date < DATE '2018-01-01';

  FORALL i IN 1 .. v_post_ids.count
    UPDATE post
    SET archived = 1
    WHERE id = v_post_ids(i);
END;

Loser (by a factor of 2.5x on our specific data set) is:

Statement 3, using an ordinary LOOP and running row-by-row updates

FOR rec IN (
  SELECT id 
  FROM post 
  WHERE archived = 0 AND creation_date < DATE '2018-01-01'
) LOOP
  UPDATE post
  SET archived = 1
  WHERE id = rec.id;
END LOOP;

It does not really come as a surprise. We’re switching between the PL/SQL engine and the SQL engine many many times, and also, instead of running through the post table only once in O(N) time, we’re looking up individual ID values in O(log N) time, N times, so the complexity went from

O(N) -> O(N log N)

We’d get far worse results for larger tables!

What about doing this from Java?

The difference is much more drastic if each call to the SQL engine has to be done over the network from another process. Again, the benchmark code is available from a gist, and I will paste it to the end of this blog post as well.

The result is (same time unit):

Run 0, Statement 1: PT4.546S
Run 0, Statement 2: PT3.52S
Run 0, Statement 3: PT0.144S
Run 0, Statement 4: PT0.028S
Run 1, Statement 1: PT3.712S
Run 1, Statement 2: PT3.185S
Run 1, Statement 3: PT0.138S
Run 1, Statement 4: PT0.025S
Run 2, Statement 1: PT3.481S
Run 2, Statement 2: PT3.007S
Run 2, Statement 3: PT0.122S
Run 2, Statement 4: PT0.026S
Run 3, Statement 1: PT3.518S
Run 3, Statement 2: PT3.077S
Run 3, Statement 3: PT0.113S
Run 3, Statement 4: PT0.027S
Run 4, Statement 1: PT3.54S
Run 4, Statement 2: PT2.94S
Run 4, Statement 3: PT0.123S
Run 4, Statement 4: PT0.03S

The difference between Statement 1 and 4 is a factor of 100x !!

So, who’s winning? Again (by far):

Statement 4, running the bulk update

In fact, the time is not too far away from the time taken by PL/SQL. With larger data sets being updated, the two results will converge. The code is:

try (Statement s = c.createStatement()) {
    s.executeUpdate(
        "UPDATE post\n" +
        "SET archived = 1\n" +
        "WHERE archived = 0\n" +
        "AND creation_date < DATE '2018-01-01'\n");
}

Followed by the not that much worse (but still 3.5x worse):

Statement 3, running the batch update

Batching can be compared to PL/SQL’s FORALL statement. While we’re running individual row-by-row updates, we’re sending all the update statements in one batch to the SQL engine. This does save a lot of time on the network and all the layers in between.

The code looks like this:

try (Statement s = c.createStatement();
    ResultSet rs = s.executeQuery(
        "SELECT id FROM post WHERE archived = 0\n"
      + "AND creation_date < DATE '2018-01-01'"
    );
    PreparedStatement u = c.prepareStatement(
        "UPDATE post SET archived = 1 WHERE id = ?"
    )) {

    while (rs.next()) {
        u.setInt(1, rs.getInt(1));
        u.addBatch();
    }

    u.executeBatch();
}

Followed by the losers:

Statement 1 and 2, running row by row updates

The difference between statement 1 and 2 is that 2 caches the PreparedStatement, which allows for reusing some resources. This can be a good thing, but didn’t have a very significant effect in our case, compared to the batch / bulk alternatives. The code is:

// Statement 1:
try (Statement s = c.createStatement();
    ResultSet rs = s.executeQuery(
        "SELECT id FROM post\n"
      + "WHERE archived = 0\n"
      + "AND creation_date < DATE '2018-01-01'"
    )) {

    while (rs.next()) {
        try (PreparedStatement u = c.prepareStatement(
            "UPDATE post SET archived = 1 WHERE id = ?"
        )) {
            u.setInt(1, rs.getInt(1));
            u.executeUpdate();
        }
    }
}

// Statement 2:
try (Statement s = c.createStatement();
    ResultSet rs = s.executeQuery(
        "SELECT id FROM post\n"
      + "WHERE archived = 0\n"
      + "AND creation_date < DATE '2018-01-01'"
    );
    PreparedStatement u = c.prepareStatement(
        "UPDATE post SET archived = 1 WHERE id = ?"
    )) {

    while (rs.next()) {
        u.setInt(1, rs.getInt(1));
        u.executeUpdate();
    }
}

Conclusion

As shown previously on this blog, there is a significant cost of JDBC server roundtrips, which can be seen in the JDBC benchmark. This cost is much more severe if we unnecessarily create many server roundtrips for a task that could be done in a single roundtrip, namely by using a SQL bulk UPDATE statement.

This is not only true for updates, but also for all the other statements, including SELECT, DELETE, INSERT, and MERGE. If doing everything in a single statement isn’t possible due to the limitations of SQL, we can still save roundtrips by grouping statements in a block, either by using an anonymous block in databases that support them:

BEGIN
  statement1;
  statement2;
  statement3;
END;

(you can easily send these anonymous blocks over JDBC, as well!)

Or, by emulating anonymous blocks using the JDBC batch API (has its limitations), or by writing stored procedures.

The performance gain is not always worth the trouble of moving logic from the client to the server, but very often (as in the above case), the move is a no-brainer and there’s absolutely no reason against it.

So, remember: Stop doing row-by-row (slow-by-slow) operations when you could run the same operation in bulk, in a single SQL statement.

Hint: Always know what your ORM (if you’re using one) is doing, because the ORM can help you with automatic batching / bulking in many cases. But it often cannot, or it is too difficult to make it do so, so resorting to SQL is the way to go.

Code

PL/SQL benchmark

SET SERVEROUTPUT ON

DROP TABLE post;

CREATE TABLE post (
  id INT NOT NULL PRIMARY KEY,
  text VARCHAR2(1000) NOT NULL,
  archived NUMBER(1) NOT NULL CHECK (archived IN (0, 1)),
  creation_date DATE NOT NULL
);

CREATE INDEX post_creation_date_i ON post (creation_date);

ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;

CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  
  PROCEDURE reset_post IS
  BEGIN
    EXECUTE IMMEDIATE 'TRUNCATE TABLE post';
    INSERT INTO post
    SELECT 
      level AS id,
      lpad('a', 1000, 'a') AS text,
      0 AS archived,
      DATE '2017-01-01' + (level / 100) AS creation_date
    FROM dual
    CONNECT BY level <= 10000;
    dbms_stats.gather_table_stats('TEST', 'POST');
  END reset_post;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
  
    reset_post;
    v_ts := SYSTIMESTAMP;
    
    UPDATE post
    SET archived = 1
    WHERE archived = 0 AND creation_date < DATE '2018-01-01';
  
    INSERT INTO results VALUES (r, 1, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    
    reset_post;
    v_ts := SYSTIMESTAMP;
    
    DECLARE
      TYPE post_ids_t IS TABLE OF post.id%TYPE;
      v_post_ids post_ids_t;
    BEGIN
      SELECT id 
      BULK COLLECT INTO v_post_ids
      FROM post 
      WHERE archived = 0 AND creation_date < DATE '2018-01-01';
    
      FORALL i IN 1 .. v_post_ids.count
        UPDATE post
        SET archived = 1
        WHERE id = v_post_ids(i);
    END;
    
    INSERT INTO results VALUES (r, 2, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    
    reset_post;
    v_ts := SYSTIMESTAMP;
      
    FOR rec IN (
      SELECT id 
      FROM post 
      WHERE archived = 0 AND creation_date < DATE '2018-01-01'
    ) LOOP
      UPDATE post
      SET archived = 1
      WHERE id = rec.id;
    END LOOP;
      
    INSERT INTO results VALUES (r, 3, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed AS NUMBER(10, 5)) ratio,
      CAST(AVG(elapsed) OVER (PARTITION BY stmt) AS NUMBER(10, 5)) avg_ratio
    FROM results
    ORDER BY run, stmt
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio || ' (avg : ' || rec.avg_ratio || ')');
  END LOOP;
  
  dbms_output.put_line('');
  dbms_output.put_line('Copyright Data Geekery GmbH');
  dbms_output.put_line('https://www.jooq.org/benchmark');
END;
/

DROP TABLE results;

JDBC benchmark

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.time.Duration;
import java.time.Instant;
import java.util.Properties;

public class OracleUpdate {

    public static void main(String[] args) throws Exception {
        Class.forName("oracle.jdbc.OracleDriver");

        String url = "jdbc:oracle:thin:@192.168.99.100:1521:ORCLCDB";
        String user = "TEST";
        String password = "TEST";

        Properties properties = new Properties();
        properties.setProperty("user", user);
        properties.setProperty("password", password);

        try (Connection c = DriverManager.getConnection(url, properties)) {
            for (int i = 0; i < 5; i++) {
                Instant ts;

                resetPost(c);
                ts = Instant.now();

                try (Statement s = c.createStatement();
                    ResultSet rs = s.executeQuery(
                        "SELECT id FROM post WHERE archived = 0 AND creation_date < DATE '2018-01-01'"
                    )) {

                    while (rs.next()) {
                        try (PreparedStatement u = c.prepareStatement(
                            "UPDATE post SET archived = 1 WHERE id = ?"
                        )) {
                            u.setInt(1, rs.getInt(1));
                            u.executeUpdate();
                        }
                    }
                }

                System.out.println("Run " + i + ", Statement 1: " + Duration.between(ts, Instant.now()));

                resetPost(c);
                ts = Instant.now();

                try (Statement s = c.createStatement();
                    ResultSet rs = s.executeQuery(
                        "SELECT id FROM post WHERE archived = 0 AND creation_date < DATE '2018-01-01'"
                    );
                    PreparedStatement u = c.prepareStatement(
                        "UPDATE post SET archived = 1 WHERE id = ?"
                    )) {

                    while (rs.next()) {
                        u.setInt(1, rs.getInt(1));
                        u.executeUpdate();
                    }
                }

                System.out.println("Run " + i + ", Statement 2: " + Duration.between(ts, Instant.now()));

                resetPost(c);
                ts = Instant.now();

                try (Statement s = c.createStatement();
                    ResultSet rs = s.executeQuery(
                        "SELECT id FROM post WHERE archived = 0 AND creation_date < DATE '2018-01-01'"
                    );
                    PreparedStatement u = c.prepareStatement(
                        "UPDATE post SET archived = 1 WHERE id = ?"
                    )) {

                    while (rs.next()) {
                        u.setInt(1, rs.getInt(1));
                        u.addBatch();
                    }

                    u.executeBatch();
                }
                System.out.println("Run " + i + ", Statement 3: " + Duration.between(ts, Instant.now()));

                resetPost(c);
                ts = Instant.now();

                try (Statement s = c.createStatement()) {
                    s.executeUpdate("UPDATE post\n" +
                        "SET archived = 1\n" +
                        "WHERE archived = 0 AND creation_date < DATE '2018-01-01'\n");
                }

                System.out.println("Run " + i + ", Statement 4: " + Duration.between(ts, Instant.now()));
            }
        }
    }

    static void resetPost(Connection c) throws SQLException {
        try (Statement s = c.createStatement()) {
            s.executeUpdate("TRUNCATE TABLE post");
            s.executeUpdate("INSERT INTO post\n" +
                "    SELECT \n" +
                "      level,\n" +
                "      lpad('a', 1000, 'a'),\n" +
                "      0,\n" +
                "      DATE '2017-01-01' + (level / 10)\n" +
                "    FROM dual\n" +
                "    CONNECT BY level <= 10000");
            s.executeUpdate("BEGIN dbms_stats.gather_table_stats('TEST', 'POST'); END;");
        }
    }
}

Why SQL Bind Variables are Important for Performance

A common problem with dynamic SQL is parsing performance in production. What makes matters worse is that many developers do not have access to production environments, so they are unaware of the problem (even if there’s nothing new about this topic). What exactly is the problem?

Execution plan caches

Most database vendors these days ship with an execution plan cache (Oracle calls it cursor cache), where previously parsed SQL statements are stored and their execution plan(s) is cached for reuse. This is the main reason why bind variables are so important (the other reason being SQL injection prevention). By using bind variables, we can make sure that the database will easily recognise an identical SQL statement from a previous execution and be able to re-execute the previously found execution plan. This is in fact one of my favourite topics from my SQL training.

Let’s see what happens in various databases, if we run the following queries:

-- First, run them with "inline values" or "constant literals"
SELECT first_name, last_name FROM actor WHERE actor_id = 1;
SELECT first_name, last_name FROM actor WHERE actor_id = 2;

-- Then, run the same queries again with bind values
SELECT first_name, last_name FROM actor WHERE actor_id = ?;
SELECT first_name, last_name FROM actor WHERE actor_id = ?;

Note, it doesn’t matter if the queries are run from JDBC, jOOQ, Hibernate, or the procedural language in the database, e.g. PL/SQL, T-SQL, pgplsql. The result is always the same.

Let’s run an example

I’ll run the following examples using Oracle, only. Other databases behave in a similar fashion.

We’ll run the following script, which includes the above queries, and a query to fetch all the execution plans:

SELECT first_name, last_name FROM actor WHERE actor_id = 1;
SELECT first_name, last_name FROM actor WHERE actor_id = 2;

SET SERVEROUTPUT ON
DECLARE
  v_first_name actor.first_name%TYPE;
  v_last_name  actor.last_name%TYPE;
BEGIN
  FOR i IN 1 .. 2 LOOP
    SELECT first_name, last_name 
    INTO v_first_name, v_last_name
    FROM actor 
    WHERE actor_id = i;
    
    dbms_output.put_line(v_first_name || ' ' || v_last_name);
  END LOOP;
END;
/

SELECT s.sql_id, p.*
FROM v$sql s, TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
) p
WHERE lower(s.sql_text) LIKE '%actor_id = %';

The output is:

SQL_ID  90rk04nhr45yz, child number 0
-------------------------------------
SELECT FIRST_NAME, LAST_NAME FROM ACTOR WHERE ACTOR_ID = :B1
 
Plan hash value: 457831946
 
---------------------------------------------------------
| Id  | Operation                   | Name     | E-Rows |
---------------------------------------------------------
|   0 | SELECT STATEMENT            |          |        |
|   1 |  TABLE ACCESS BY INDEX ROWID| ACTOR    |      1 |
|*  2 |   INDEX UNIQUE SCAN         | PK_ACTOR |      1 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ACTOR_ID"=:B1)
 

SQL_ID  283s8m524c9rk, child number 0
-------------------------------------
SELECT first_name, last_name FROM actor WHERE actor_id = 2
 
Plan hash value: 457831946
 
---------------------------------------------------------
| Id  | Operation                   | Name     | E-Rows |
---------------------------------------------------------
|   0 | SELECT STATEMENT            |          |        |
|   1 |  TABLE ACCESS BY INDEX ROWID| ACTOR    |      1 |
|*  2 |   INDEX UNIQUE SCAN         | PK_ACTOR |      1 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ACTOR_ID"=2)
 

SQL_ID  3mks715670mqw, child number 0
-------------------------------------
SELECT first_name, last_name FROM actor WHERE actor_id = 1
 
Plan hash value: 457831946
 
---------------------------------------------------------
| Id  | Operation                   | Name     | E-Rows |
---------------------------------------------------------
|   0 | SELECT STATEMENT            |          |        |
|   1 |  TABLE ACCESS BY INDEX ROWID| ACTOR    |      1 |
|*  2 |   INDEX UNIQUE SCAN         | PK_ACTOR |      1 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ACTOR_ID"=1)

The plans are always the same and since we’re accessing primary key values, we’ll always get the same cardinalities, so there doesn’t seem to be anything wrong on each individual execution. But notice how the predicate information is slightly different. When querying for a constant value, then the predicate will include that value right there, whereas with the bind variable, we don’t know what the predicate value is, from the plan. This is perfectly expected, because we want to reuse that plan for both executions of the query.

With another query, we can see the number of executions of each statement:

SELECT sql_id, sql_text, executions
FROM v$sql
WHERE sql_id IN (
  '90rk04nhr45yz', 
  '283s8m524c9rk', 
  '3mks715670mqw'
);
SQL_ID          SQL_TEXT                                                        EXECUTIONS
------------------------------------------------------------------------------------------
90rk04nhr45yz	SELECT FIRST_NAME, LAST_NAME FROM ACTOR WHERE ACTOR_ID = :B1 	2
283s8m524c9rk	SELECT first_name, last_name FROM actor WHERE actor_id = 2	1
3mks715670mqw	SELECT first_name, last_name FROM actor WHERE actor_id = 1	1

This is where it gets more interesting. In the second case where we used a bind variable (which was generated by PL/SQL, automatically), we could reuse the statement, cache its plan, and run it twice.

Meh, does it matter?

It matters for two reasons:

  • Performance of individual executions
  • Performance of your entire system

How this affects individual executions

It seems very obvious that when being able to cache something, there’s a slight overhead to the cache maintenance compared to the gain in not having to do the work whose result is cached. The work in question here is parsing the SQL statement and creating an execution plan for it. Even if the plan is trivial, as in the above examples, there is overhead involved with calculating this plan.

This overhead can best be shown in a benchmark, a technique that I also display in my SQL training:

SET SERVEROUTPUT ON

-- Don't run these on production
-- But on your development environment, this guarantees clean caches
ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;

CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2000;
  v_first_name actor.first_name%TYPE;
  v_last_name  actor.last_name%TYPE;
BEGIN

  -- Repeat whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      BEGIN
        EXECUTE IMMEDIATE '
          SELECT first_name, last_name 
          FROM actor 
          WHERE actor_id = ' || i 
          -- Just fixing a syntax highlighting bug of this blog '
        INTO v_first_name, v_last_name;
      EXCEPTION
        -- Please forgive me
        WHEN OTHERS THEN NULL;
      END;
    END LOOP;
  
    INSERT INTO results VALUES (
      r, 1, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      BEGIN
        EXECUTE IMMEDIATE '
          SELECT first_name, last_name 
          FROM actor 
          WHERE actor_id = :i'
        INTO v_first_name, v_last_name
        USING i;
      EXCEPTION
        -- Please forgive me
        WHEN OTHERS THEN NULL;
      END;
    END LOOP;
      
    INSERT INTO results VALUES (
      r, 2, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio 
    FROM results
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio);
  END LOOP;
END;
/

DROP TABLE results;

As always, on the jOOQ blog, we don’t publish actual execution times to comply with license restrictions on benchmark publications, so we’re only comparing each execution with the fastest execution. This is the result of the above:

Run 1, Statement 1 : 83.39893
Run 1, Statement 2 : 1.1685
Run 2, Statement 1 : 3.02697
Run 2, Statement 2 : 1
Run 3, Statement 1 : 2.72028
Run 3, Statement 2 : 1.03996
Run 4, Statement 1 : 2.70929
Run 4, Statement 2 : 1.00866
Run 5, Statement 1 : 2.71895
Run 5, Statement 2 : 1.02198

We can see that consistently, the SQL version using a bind variable is 2.5x as fast as the one not using the bind variable. This overhead is very significant for trivial queries – it might be a bit less so for more complex queries, where the execution itself takes more time, compared to the parsing. But it should be obvious that the overhead is a price we do not want to pay. We want the query and its plan to be cached!

Notice also how the very first execution of the benchmark has a very significant overhead, because all of the 2000 queries will have been encountered for the first time before they’re cached for the second run. That’s a price we’re only paying during the first run, though.

How this affects your entire system

Not only do individual query executions suffer, your entire system does, too. After running the benchmark a few times, these are the execution statistics I’m getting from the Oracle cursor cache:

SELECT 
  count(*), 
  avg(executions), 
  min(executions), 
  max(executions)
FROM v$sql
WHERE lower(sql_text) LIKE '%actor_id = %'
AND sql_text NOT LIKE '%v$sql%';

Yielding:

count  avg      min  max
2001   9.9950   5    10000

There are now, currently, 2000 queries in my cache. The one that has been executed 10000 times (benchmark was repeated 5x and 2000 executions of the query per run), and 2000 queries that have been executed 5x (benchmark was repeated 5x).

If instead we run the query 20000 times (and remember, the query run corresponds to the filtered ACTOR_ID), then the result will be vastly different!

Run 1, Statement 1 : 86.85862
Run 1, Statement 2 : 1.13546
Run 2, Statement 1 : 78.39842
Run 2, Statement 2 : 1.01298
Run 3, Statement 1 : 72.45254
Run 3, Statement 2 : 1
Run 4, Statement 1 : 73.78357
Run 4, Statement 2 : 2.24365
Run 5, Statement 1 : 84.89842
Run 5, Statement 2 : 1.143

Oh, my. Why has this happened? Let’s check again the cursor cache stats:

SELECT 
  count(*), 
  avg(executions), 
  min(executions), 
  max(executions)
FROM v$sql
WHERE lower(sql_text) LIKE '%actor_id = %'
AND sql_text NOT LIKE '%v$sql%';

Yielding:

count  avg      min  max
15738  3.4144   1    20000

This is a vastly different result. We don’t have all of our 20000 queries in the cursor cache, only some of them. This means that some statements have been purged from the cache to make room for new ones (which is reasonable behaviour for any cache).

But purging them is problematic too, because the way the benchmark was designed, they will re-appear again in the second, third, fourth, and fifth run, so we should have kept them in the cache. And since we’re executing every query the same number of times, there really wasn’t any way of identifying a “more reasonable” (i.e. rare) query to purge.

Resources in a system are always limited, and so is the cursor cache size. The more distinct queries we’re running in a system, the less they can profit from the cursor cache.

This is not a problem for rarely run queries, including reports, analytics, or some special queries run only by some very few users. But the queries that are being run all the time should always be cached.

I cannot stress enough how serious this can be:

In the above case, what should have been a single query in the cursor cache exploded into 20000 queries, shoving a lot of much more useful queries out of the cache. Not only does this slow down the execution of this particular query, it will purge tons of completely unrelated queries from the cache, thus slowing down the entire system by similar factors. If everyone is slowed down drastically, everyone will start to queue up to have their SQL queries parsed, and you can bring down your entire server with this problem (in the worst case)!

Workaround

Some databases support enforcing parsing the constant literals to bind variables. In Oracle, you can specify CURSOR_SHARING = FORCE as a “quick fix”. In SQL Server, it’s called forced parametrization.

But this approach has its own limitations and overhead, as this means that additional parsing work needs to be performed every time to recognise constant literals and replace them by bind variables. This overhead will then apply to all queries!

Conclusion

Bind variables are very important for SQL performance. After tons of training people to use them for SQL injection reasons (which is already a good thing), we have now seen how important they are also for performance reasons.

Not using a bind variable for values like IDs, timestamps, names, or anything that is uniformly distributed and has many values in your column will produce the above problem. The exception are bind variables for columns with only very few distinct values (like true/false flags, codes that encode a given state, etc.), in case of which a constant literal can be a reasonable option (follow-up blog post coming, soon).

But bind variables should always be your default choice. If you’re using a client-side tool like jOOQ or Hibernate, bind variables tend to be the default, and you’re fine. If you’re using a stored procedural language like PL/SQL or T-SQL, bind variables are generated automatically and you’re fine as well. But if you’re using JDBC or any JDBC wrapper like Spring’s JdbcTemplates, or any other string based API, like JPA’s native query API, then you are on your own again, and you must explicitly take care of using bind variables every time you have variable input.

And in our next article, we’ll see how bind variables are actually not enough, when using dynamic IN lists, another topic that I borrowed from my SQL training.

The Cost of JDBC Server Roundtrips

Or: Move That Loop into the Server Already!

This article will illustrate the significance of something that I always thought to be common sense, but I keep seeing people getting this (very) wrong in their productive systems. Chances are, in fact, that most applications out there suffer from this performance problem – and the fix is really easy.

Transferring data in bulk vs. transferring it row by row

This blog post is inspired by a recent Stack Overflow question that was asking about how to call Oracle’s DBMS_OUTPUT.GET_LINES from JDBC. The answer to that specific question can also be seen in a previous blog post here.

This time, I don’t want to discuss that particular Oracle-specific technical problem, but the performance aspect of whether to call:

  • DBMS_OUTPUT.GET_LINES: Which allows for fetching a bulk of server output into an array
  • DBMS_OUTPUT.GET_LINE: Which fetches a single line of server output into a string

While I will continue to discuss this Oracle-specific functionality, this blog post is in no way strictly related to Oracle. It is generally applicable (and again, it should be common sense) for any database, and in fact, any client-server architecture, or even any distributed architecture. The solution to such performance problems will almost always be that transferring a bulk of data is better than transferring it row by row.

Let’s run a benchmark!

The full benchmark logic can be seen in this gist. It includes the boring parts of actually calling the GET_LINE[S] procedures. The beef of the benchmark is this:

int max = 50;
long[] getLines = new long[max];
long[] getLine = new long[max];

try (Connection c = DriverManager.getConnection(url, properties);
    Statement s = c.createStatement()) {

    for (int warmup = 0; warmup < 2; warmup++) {
        for (int i = 0; i < max; i++) {
            s.executeUpdate("begin dbms_output.enable(); end;");
            String sql =
                "begin "
              + "for i in 1 .. 100 loop "
              + "dbms_output.put_line('Message ' || i); "
              + "end loop; "
              + "end;";
            long t1 = System.nanoTime();
            logGetLines(c, 100, () -> s.executeUpdate(sql));
            long t2 = System.nanoTime();
            logGetLine(c, 100, () -> s.executeUpdate(sql));
            long t3 = System.nanoTime();
            s.executeUpdate("begin dbms_output.disable(); end;");

            if (warmup > 0) {
                getLines[i] = t2 - t1;
                getLine[i] = t3 - t2;
            }
        }
    }
}

System.out.println(LongStream.of(getLines).summaryStatistics());
System.out.println(LongStream.of(getLine).summaryStatistics());

What does it do in prose?

  • It contains a warmup loop whose first iteration doesn’t contribute to our measurements (this is always a good idea)
  • It runs the benchmarked logic 50 times
  • It generates 100 DBMS_OUTPUT.PUT_LINE messages for each run in an anonymous PL/SQL loop …
  • … and then fetches those 100 messages immediately with either 1 call to GET_LINES or 100 calls to GET_LINE
  • Finally, all the stored execution times are aggregated and printed out conveniently with Java 8 Stream’s summary statistics feature

So, in both cases, we’re generating and fetching 5000 messages.

Both methods using GET_LINES and GET_LINE respectively are functionally equivalent, i.e. the only difference is performance. Again the full benchmark can be seen here (which also benchmarks the effect of JDBC fetchSize, which is none, in this case).

The results are devastating:

{count=50, sum=  69120455, min= 1067521, average= 1382409.100000, max= 2454614}
{count=50, sum=2088201423, min=33737827, average=41764028.460000, max=64498375}

We’re gaining a factor of 30x in this benchmark run on my machine. The actual results may vary depending on your hardware and software (e.g. I’m running Oracle 12cR2 in Docker), but regardless of the setup, the results are always very significant.

Note that caching the prepared statement in the client yields better results (see the gist), but nowhere near as good as moving the loop to the server

Does this mean that GET_LINE is slow?

When looking at benchmark results, we must always be very very careful not to draw the wrong conclusions. There can be many reasons why benchmarks show differences in performance. One simple (albeit improbable) reason could be that the GET_LINE implementation simply sucks.

So, I’ve tried to re-implement this benchmark in pure PL/SQL. The full benchmark can be seen here. The beef of it is this:

FOR r IN 1..5 LOOP
  v_ts := SYSTIMESTAMP;
      
  FOR i IN 1..v_repeat LOOP
    m();
     
    v_i := v_max;
    dbms_output.get_lines(v_array, v_i);
  END LOOP;
      
  INSERT INTO results VALUES (1, (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    m();
    
    FOR j IN 1 .. v_max LOOP
      dbms_output.get_line(v_string, v_i);
    END LOOP;
  END LOOP;
     
  INSERT INTO results VALUES (2, (SYSTIMESTAMP - v_ts));
END LOOP;

Where m() is:

PROCEDURE m IS BEGIN
  FOR i IN 1 .. v_max LOOP 
    dbms_output.put_line('Message ' || i);
  END LOOP;
END m;

The results are now rather different:

stmt    sum     avg      min     max
1       0.0609  0.01218  0.0073  0.0303
2       0.0333  0.00666  0.0063  0.007

This time, calling GET_LINE individually seems to have been 2x faster than the GET_LINES version. Again, it is important not to draw the wrong conclusions! This could be due to:

  • GET_LINES allocating an additional array copy of the original lines, which resides in the PGA, might be costly
  • GET_LINE might have profited from some additional optimisation because we’re never actually consuming the result in the benchmark

But the one thing we can conclude with certainty is: There’s no problem in GET_LINE, so calling it is not inherently worse than calling GET_LINES.

Which brings us back to the JDBC calls

While guess work and hypotheses are usually dangerous, in this case I’m certain of the reason why the JDBC based approach shows such drastic differences. Just watch this excellent talk by Toon Koppelaars from Oracle “NoPLSql and Thick Database Approaches with Toon Koppelaars”, where he explains this with some impressive flame graphs:

The obvious difference between the JDBC benchmark and the PL/SQL one is the fact that the JDBC call has to traverse a vast amount of logic, APIs, “barriers” between the JVM and the Oracle kernel before it can actually invoke the really interesting part. This includes:

  • JVM overhead
  • JDBC logic
  • Network overhead
  • Various “outer” layers inside the Oracle database
  • Oracle’s API layers to get into the SQL and PL/SQL execution engines
  • The actual code running in the PL/SQL engine

In Toon’s talk (which again, you should definitely watch), the examples are running SQL code, not PL/SQL code, but the results are the same. The actual logic is relatively cheap inside of the database (as we’ve seen in the PL/SQL only benchmark), but the overhead is significant when calling database logic from outside the database.

Thus: It is very important to minimise that overhead

There are two ways to minimise that overhead:

  • The super hard way: Change and / or tweak the API technology, e.g. in Oracle’s case, using the C/OCI bindings can be much faster than JDBC
  • The easy way: Just move some data collection logic into the database and fetch data in bulk

Let me repeat this one more time:

Fetch (or send) data in bulk

… it’s almost always faster than processing things row-by-row (or as the Oracle folks call it: “slow-by-slow”).

And it does not matter at all, if that database logic is written in SQL or in a procedural language. The point is that accessing objects over the network (any network) is expensive, so you should minimise the access calls if ever possible.

As I’ve tweeted recently:

When calling logic over the network (any network), we should move logic to the data, not data to the logic. When working with RDBMS, we’re doing this through SQL (preferrably) or if SQL doesn’t suffice, we resort to using stored procedures.

When working with HTTP, we’re doing this with – well, it doesn’t have a name, but we should prefer making few physical HTTP calls that aggregate several logical API calls in order to transfer a lot of data in bulk.

When working with “map reduce” or “serverless” etc technology, we’re calling this “functions” or “lambdas”, which are just fancy, more modern names for stored procedures.

Conclusion

I’m not an expert in how to design complex distributed systems. This stuff is really hard. But I’ve spent many years working with RDBMS, which are also, in a very simple way, distributed systems (data on one server, client logic on another one).

A very significant amount of performance problems with RDBMS is related to the simple fact of clients making way too many calls to the database for what could be implemented in a single SQL query. Within the database, once your logic has reached the kernel, stuff gets executed really really fast. Adding more logic to a query is going to cause far less trouble than adding more queries.

Does your application take into account these things? Especially, if you’re using an ORM that generates the SQL for you, does it generate the right amount of queries, or are you suffering from “N+1 problems”? The main reason why ORM-based systems tend to be slow is because developers are not aware of the SQL their ORMs generate, e.g. due to excessive lazy loading, when a single SQL (or JPQL) query would have been a much better choice. It’s not the ORM’s fault. Most ORMs can get this right.

It’s the developer’s responsibility to think about where the logic should be executed. The rule of thumb is:

If you’re looping in the client to fetch individual things from the same server, you’re doing it wrong. Move the loop into the server.

And by doing so, you’ve written your first (ghasp) “stored procedure”. You’ll write many more, and you’ll love it, once you realise how much speed you’re gaining.

Or, in other words:

Update: Some criticism from the reddit discussion of this article

/u/cogman10 made good points in his comment warning about batching “too big” workloads, which is perfectly correct when batching write heavy tasks. Large batches may increase the contention inside of your database. If you’re working in an MVCC environment (such as Oracle), having transactions that span millions of updates will put a lot of pressure on the UNDO/REDO log, and all other sessions reading from the same data will feel that pain.

Also, when working with HTTP, beware of the fact that batches are harder to cache than individual requests. This article made the assumption that HTTP requests are:

  • Authorised – i.e. caching doesn’t even make sense as the authorisation might be revoked or the session terminated
  • Operating on similar resources, e.g. fetching a bulk of IDs in one go might be more sensible than fetching each ID individuall

… of course, as always, don’t follow advice you find on the internet blindly :) This article illustrated a common mistake. The fix isn’t always as simple as illustrated here, but often it really is.

How to Avoid Excessive Sorts in Window Functions

Usually, this blog is 100% pro window functions and advocates using them at any occasion. But like any tool, window functions come at a price and we must carefully evaluate if that’s a price we’re willing to pay. That price can be a sort operation. And as we all know, sort operations are expensive. They follow O(n log n) complexity, which must be avoided at all costs for large data sets.

In a previous post, I’ve described how to calculate a running total with window functions (among other ways). In this post, we’re going to calculate the cumulative revenue at each payment in our Sakila database.

SELECT
  customer_id,
  payment_date,
  amount,
  SUM(amount) OVER (
    PARTITION BY customer_id
    ORDER BY payment_date, payment_id
  ) cumulative_amount
FROM payment
ORDER BY customer_id, payment_date, payment_id;

The above will yield something like this:

customer_id |payment_date        |amount |cumulative_amount 
------------|--------------------|-------|------------------
1           |2005-05-25 11:30:37 |2.99   |2.99              
1           |2005-05-28 10:35:23 |0.99   |3.98              
1           |2005-06-15 00:54:12 |5.99   |9.97              
1           |2005-06-15 18:02:53 |0.99   |10.96             
1           |2005-06-15 21:08:46 |9.99   |20.95             
1           |2005-06-16 15:18:57 |4.99   |25.94             
...

As can be seen, in spread sheet notation, cumulative_amount[N] = cumulative_amount[N-1] + amount.

Reusing this calculation in several queries

As in any other language, we don’t want to repeat ourselves, so the SQL way of doing DRY is to create a view or a table valued function. Let’s create a view, first. Something like this:

CREATE VIEW payment_with_revenue AS
SELECT
  customer_id,
  payment_date,
  amount,
  SUM(amount) OVER (
    PARTITION BY customer_id
    ORDER BY payment_date, payment_id
  ) cumulative_amount
FROM payment

Now, we can do nice things like this:

SELECT 
  customer_id,
  payment_date,
  amount,
  cumulative_amount
FROM payment_with_revenue
WHERE customer_id IN (1, 2, 3)
AND payment_date 
  BETWEEN DATE '2005-05-25'
  AND     DATE '2005-05-29'
ORDER BY customer_id, payment_date

yielding:

customer_id |payment_date        |amount |cumulative_amount 
------------|--------------------|-------|------------------
1           |2005-05-25 11:30:37 |2.99   |2.99              
1           |2005-05-28 10:35:23 |0.99   |3.98              
2           |2005-05-27 00:09:24 |4.99   |4.99              
3           |2005-05-27 17:17:09 |1.99   |1.99              

What about performance?

Now, if we have an index on (CUSTOMER_ID, PAYMENT_DATE), we’d expect to be able to use it, right? Because it seems that our predicate should be able to profit from it:

SELECT 
  count(*),
  count(*) FILTER (
    WHERE customer_id IN (1, 2, 3)
  ),
  count(*) FILTER (
    WHERE customer_id IN (1, 2, 3)
    AND payment_date < DATE '2005-05-29'
  ) 
FROM payment;

yielding:

count |count |count 
------|------|-----
16049 |85    |4     

(To learn more about the cool FILTER clause, read this article here)

How could we best use the index? Let’s look again at our original query, but this time, with an inlined view (“inlined”):

SELECT 
  customer_id,
  payment_date,
  amount,
  cumulative_amount
FROM (
  SELECT
    customer_id,
    payment_date,
    amount,
    SUM(amount) OVER (
      PARTITION BY customer_id
      ORDER BY payment_date, payment_id
    ) cumulative_amount
  FROM payment
) inlined
WHERE customer_id IN (1, 2, 3)
AND payment_date 
  BETWEEN DATE '2005-05-25'
  AND     DATE '2005-05-29'
ORDER BY customer_id, payment_date;

We should be able to apply two transformations that benefit using the index:

CUSTOMER_ID IN (1, 2, 3) predicate

The CUSTOMER_ID IN (1, 2, 3) predicate should be pushed down into the view, “past” the window function, because it does not affect the window function calculation, which partitions the data set by CUSTOMER_ID. By being pushed “past” the window function, I mean the fact that window functions are calculated late in the order of SELECT clauses.

This means that our original query should be equivalent to this one:

SELECT 
  customer_id,
  payment_date,
  amount,
  cumulative_amount
FROM (
  SELECT
    customer_id,
    payment_date,
    amount,
    SUM(amount) OVER (
      PARTITION BY customer_id
      ORDER BY payment_date, payment_id
    ) cumulative_amount
  FROM payment
  WHERE customer_id IN (1, 2, 3) -- Pushed down
) inlined
WHERE payment_date 
  BETWEEN DATE '2005-05-25'
  AND     DATE '2005-05-29'
ORDER BY customer_id, payment_date;

The PAYMENT_DATE predicate

The PAYMENT_DATE predicate is a bit more tricky. It cannot be pushed “past” the window function completely, because that would alter the semantics of the window function, which calculates the cumulative amount in the RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW range (which is the default, if we do not specify it).

But intuitively (and if you want to spend the time: formally as well), we can show that we can at least push the upper bound of our range predicate into the view, like this:

SELECT 
  customer_id,
  payment_date,
  amount,
  cumulative_amount
FROM (
  SELECT
    customer_id,
    payment_date,
    amount,
    SUM(amount) OVER (
      PARTITION BY customer_id
      ORDER BY payment_date, payment_id
    ) cumulative_amount
  FROM payment
  WHERE customer_id IN (1, 2, 3)
  AND payment_date <= DATE '2005-05-29' -- Pushed down
) inlined
WHERE payment_date >= DATE '2005-05-25'
ORDER BY customer_id, payment_date;

And now, we can profit from the index very easily! But is this transformation being done by any database? Unfortunately not. Some databases manage to push down the “more obvious” CUSTOMER_ID predicate past the window function, but none can do the same with the “less obvious” range predicate on PAYMENT_DATE:

DB2 LUW 10.5

The CUSTOMER_ID predicate is pushed down into the view, which generates an index scan (blue) on the pre-existing foreign key index (which doesn’t contain the PAYMENT_DATE column), but the PAYMENT_DATE itself is only filtered much later using an in-memory filter (red):

Explain Plan                                                       
-------------------------------------------------------------------
ID | Operation                       |                  Rows | Cost
 1 | RETURN                          |                       |   40
 2 |  FILTER                         |     4 of 80 (  5.00%) |   40
 3 |   TBSCAN                        |    80 of 80 (100.00%) |   40
 4 |    SORT                         |    80 of 80 (100.00%) |   40
 5 |     NLJOIN                      |               80 of 3 |   40
 6 |      TBSCAN GENROW              |      3 of 3 (100.00%) |    0
 7 |      FETCH PAYMENT              |    27 of 27 (100.00%) |   13
 8 |       IXSCAN IDX_FK_CUSTOMER_ID | 27 of 16049 (   .17%) |    6
                                                                   
Predicate Information                                              
 2 - RESID (Q5.PAYMENT_DATE <= '2005-05-29')                       
     RESID ('2005-05-25' <= Q5.PAYMENT_DATE)                       
 5 - JOIN (Q3.CUSTOMER_ID = Q2.$C0)                                
 8 - START (Q3.CUSTOMER_ID = Q2.$C0)                               
      STOP (Q3.CUSTOMER_ID = Q2.$C0)                               

Conversely, see the plan of the manually optimised query:

Explain Plan                                                  
--------------------------------------------------------------
ID | Operation                   |                 Rows | Cost
 1 | RETURN                      |                      |   40
 2 |  FILTER                     |     4 of 4 (100.00%) |   40
 3 |   TBSCAN                    |     4 of 4 (100.00%) |   40
 4 |    SORT                     |     4 of 4 (100.00%) |   40
 5 |     NLJOIN                  |               4 of 1 |   40
 6 |      TBSCAN GENROW          |     3 of 3 (100.00%) |    0
 7 |      FETCH PAYMENT          |     1 of 1 (100.00%) |   13
 8 |       IXSCAN IDX_PAYMENT_I1 | 1 of 16049 (   .01%) |    6
                                                              
Predicate Information                                         
 2 - RESID ('2005-05-25' <= Q5.PAYMENT_DATE)                  
 5 - JOIN (Q3.CUSTOMER_ID = Q2.$C0)                           
 8 - START (Q3.CUSTOMER_ID = Q2.$C0)                          
      STOP (Q3.CUSTOMER_ID = Q2.$C0)                          
      STOP (Q3.PAYMENT_DATE <= '2005-05-29')                  

This is certainly a better plan.

MySQL 8.0.2

MySQL, very regrettably, doesn’t seem to show any effort at all in optimising this. We’re accessing the entire payment table to get this result.

id   table        type  rows    filtered    Extra
-----------------------------------------------------------------------
1    <derived2>   ALL   16086    3.33       Using where
2    payment      ALL   16086  100.00       Using filesort

Here’s the manually optimised plan:

id   table        type  key             rows  filtered    Extra
-------------------------------------------------------------------------------
1    <derived2>   ALL                   4     3.33        Using where
2    payment      range idx_payment_i1  4      100.00     Using index condition

Oracle 12.2.0.1

Oracle also cannot do this beyond pushing the more obvious CUSTOMER_ID predicate into the view:

-------------------------------------------------------------------------------
| Id  | Operation                              | Name                 | Rows  |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                      |       |
|*  1 |  VIEW                                  | PAYMENT_WITH_REVENUE |    80 |
|   2 |   WINDOW SORT                          |                      |    80 |
|   3 |    INLIST ITERATOR                     |                      |       |
|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| PAYMENT              |    80 |
|*  5 |      INDEX RANGE SCAN                  | IDX_FK_CUSTOMER_ID   |    80 |
-------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(("PAYMENT_DATE">=TO_DATE('2005-05-25 00:00:00') AND 
              "PAYMENT_DATE"<=TO_DATE('2005-05-29 00:00:00')))
   5 - access(("CUSTOMER_ID"=1 OR "CUSTOMER_ID"=2 OR "CUSTOMER_ID"=3))

The manually optimised plan looks better:

-------------------------------------------------------------------------
| Id  | Operation                              | Name           | Rows  |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                |       |
|*  1 |  VIEW                                  |                |     1 |
|   2 |   WINDOW SORT                          |                |     1 |
|   3 |    INLIST ITERATOR                     |                |       |
|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| PAYMENT        |     1 |
|*  5 |      INDEX RANGE SCAN                  | IDX_PAYMENT_I1 |     1 |
-------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("PAYMENT_DATE">=TO_DATE('2005-05-25 00:00:00'))
   5 - access(("CUSTOMER_ID" IN (1, 2, 3)) AND 
              "PAYMENT_DATE"<=TO_DATE('2005-05-29 00:00:00'))

Much better cardinality estimates!

PostgreSQL 10

PostgreSQL’s version of the Sakila database uses a partitioned payment table, but that should be irrelevant for this analysis. The CUSTOMER_ID predicate could be pushed down…

QUERY PLAN                                                                                          
---------------------------------------------------------------------------------------------------
Subquery Scan on payment_with_revenue  (cost=117.06..124.45 rows=8 width=52)                       
  Filter: ((payment_date >= '2005-05-25') AND (payment_date <= '2005-05-29'))
-> WindowAgg  (cost=117.06..121.49 rows=197 width=56)                                               
   -> Sort  (cost=117.06..117.55 rows=197 width=24)                                              
      Sort Key: payment.customer_id, payment.payment_date, payment.payment_id                  
      -> Result  (cost=0.29..109.55 rows=197 width=24)                                        
         -> Append  (cost=0.29..107.58 rows=197 width=24)                                  
            -> Index Scan using idx_fk.. on payment  (cost=0.29..18.21 rows=77 width=20)
               Index Cond: (customer_id = ANY ('{1,2,3}'::integer[]))
               -> Bitmap Heap Scan on payment_p2007_01  (cost=4.62..14.90 rows=20 width=26)
                  Recheck Cond: (customer_id = ANY ('{1,2,3}'::integer[]))               
                  -> Bitmap Index Scan on idx_fk.. (cost=0.00..4.61 rows=20 width=0)
                     Index Cond: (customer_id = ANY ('{1,2,3}'::integer[]))           
                  -> Bitmap Heap Scan on payment_p2007_02  (cost=4.62..14.90 rows=20 width=26)
                     Recheck Cond: (customer_id = ANY ('{1,2,3}'::integer[]))               
                  -> Bitmap Index Scan on idx_fk.. (cost=0.00..4.61 rows=20 width=0)
                     Index Cond: (customer_id = ANY ('{1,2,3}'::integer[]))           
              ...

But manual optimisation is required to get better behaviour for the date range:

QUERY PLAN                                                                                           
-----------------------------------------------------------------------------------------------------
Subquery Scan on inlined  (cost=18.46..18.56 rows=3 width=48)                                        
  Filter: (inlined.payment_date >= '2005-05-25'::date)                    
-> WindowAgg  (cost=18.46..18.52 rows=3 width=52)                                                 
   -> Sort  (cost=18.46..18.46 rows=3 width=20)                                                
      Sort Key: payment.customer_id, payment.payment_date, payment.payment_id                
      -> Result  (cost=0.29..18.43 rows=3 width=20)                                         
         -> Append  (cost=0.29..18.40 rows=3 width=20)                                   
            -> Index Scan using idx_fk.. on payment  (cost=0.29..18.40 rows=3 width=20)
                Index Cond: (customer_id = ANY ('{1,2,3}'::integer[]))
                Filter: (payment_date <= '2005-05-29'::date)

Interestingly, the index still isn’t used optimally on both columns, which has nothing to do with the current discussion on window functions. PostgreSQL seems to be unable to think of the IN predicate as an equality predicate. See also this article about other optimisations (such as predicate merging) that are not possible (yet) in PostgreSQL.

But still, this is much better as it brings down the estimated cardinalities (in case this query is a subquery in a more sophisticated context), and more importantly, it filters out many many rows prior to calculating the window function.

SQL Server 2014

Another database that cannot push down this predicate past the window function optimally. Only the “obvious” part is pushed down:

|--Sort(ORDER BY:([payment_date] ASC))
   |--Filter(WHERE:([payment_date]>='2005-05-25' AND [payment_date]<='2005-05-29'))
      |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1004]=(0) THEN NULL ELSE [Expr1005] END))
         |--Stream Aggregate(GROUP BY:([WindowCount1009]) DEFINE:(..))
            |--Window Spool(RANGE BETWEEN:(UNBOUNDED, [[payment_date], [payment_id]]))
               |--Segment
                  |--Segment
                     |--Sort(ORDER BY:([customer_id] ASC, [payment_date] ASC, [payment_id] ASC))
                        |--Table Scan(OBJECT:([payment]), WHERE:([customer_id] IN (1, 2, 3)))

Interestingly, this doesn’t even use the index at all, but at least the data is filtered out prior to the calculation that relies on sorting. With the manual optimisation, again the same, much better effect:

|--Filter(WHERE:([payment_date]>='2005-05-25'))
   |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1004]=(0) THEN NULL ELSE [Expr1005] END))
      |--Stream Aggregate(GROUP BY:([WindowCount1011]) DEFINE:(..))
         |--Window Spool(RANGE BETWEEN:(UNBOUNDED, [[payment_date], [payment_id]]))
            |--Segment
               |--Segment
                  |--Sort(ORDER BY:([payment_date] ASC, [payment_id] ASC))
                     |--Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1000]))
                        |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1007], [Expr1008], [Expr1006]))
                        |  |--Compute Scalar(DEFINE:(([Expr1007],[Expr1008],[Expr1006])=GetRangeWithMismatchedTypes(NULL,'2005-05-29',(42))))
                        |  |  |--Constant Scan
                        |  |--Index Seek(OBJECT:([idx_payment_i1]), SEEK:([customer_id] IN (1, 2, 3) AND [payment_date] > [Expr1007] AND [payment_date] < [Expr1008]))
                        |--RID Lookup(OBJECT:([payment]))

Certainly, this is a bit cryptic to read but it really means the same thing as always: The manual optimisation worked and we got a better plan.

Meh, does it matter?

I hope so! Let’s benchmark these things against each other! Some info about our benchmarking technique in our previous post and on this page here. Specifically, we don’t publish actual execution times, only relative times within the benchmark as we do not want to compare databases against each other but only against themselves.

DB2 LUW 10.5

RUN |STMT |RATIO  |
----|-----|-------|
1   |1    |3.0890 |
1   |2    |1.2272 |
2   |1    |3.0624 |
2   |2    |1.0100 |
3   |1    |3.0389 |
3   |2    |1.0000 |
4   |1    |3.1566 |
4   |2    |1.0948 |
5   |1    |3.1817 |
5   |2    |1.0905 |

The manually optimised statement is 3x faster in our benchmark. Do bear in mind that we’re operating on a rather small data set of a total of a few thousand rows! This gets worse in a larger data set.

MySQL 8.0.2

The difference is devastating in MySQL 8.0.2, which just recently introduced window functions. Surely, the MySQL team will be able to apply some further optimisations prior to GA – I’ve filed an issue for review:

0	1	431.1905
0	2	1.0000
1	1	372.4286
1	2	1.0000
2	1	413.4762
2	2	1.0000
3	1	381.2857
3	2	1.0000
4	1	400.1429
4	2	1.2857

Oracle 12.2.0.1

Another factor 4x can be observed in Oracle:

Run 1, Statement 1 : 4.58751
Run 1, Statement 2 : 1.37639
Run 2, Statement 1 : 4.71833
Run 2, Statement 2 : 1.03693
Run 3, Statement 1 : 4.05729
Run 3, Statement 2 : 1.04719
Run 4, Statement 1 : 3.86653
Run 4, Statement 2 : 1
Run 5, Statement 1 : 3.99603
Run 5, Statement 2 : 1.0212

PostgreSQL 10

PostgreSQL is quite bad too, here. A factor 7x can be observed:

RUN 1, Statement 1: 7.23373
RUN 1, Statement 2: 1.01438
RUN 2, Statement 1: 6.62028
RUN 2, Statement 2: 1.26183
RUN 3, Statement 1: 8.40322
RUN 3, Statement 2: 1.04074
RUN 4, Statement 1: 6.33401
RUN 4, Statement 2: 1.06750
RUN 5, Statement 1: 6.41649
RUN 5, Statement 2: 1.00000

SQL Server 2014

Another very significant penalty in SQL Server for the unoptimised version:

Run 1, Statement 1: 29.50000
Run 1, Statement 2: 1.07500
Run 2, Statement 1: 28.15000
Run 2, Statement 2: 1.00000
Run 3, Statement 1: 28.00000
Run 3, Statement 2: 1.00000
Run 4, Statement 1: 28.00000
Run 4, Statement 2: 1.00000
Run 5, Statement 1: 31.07500
Run 5, Statement 2: 1.00000

Bad news for views. Is there a better solution?

This is rather bad news for window functions inside of reusable views. None of the databases, not even DB2 or Oracle can push down range predicates past a derived table’s window function, if the column that is part of the range predicate doesn’t correspond to the window function’s PARTITION BY clause.

The problem described above can be easily fixed when the query is written manually, expanding all possible views into their calling SQL, but that kind of sucks – we’d love to make our code reusable. There’s one solution in databases that support inline table valued functions. Among the tested databases, these include:

  • DB2
  • PostgreSQL
  • SQL Server

MySQL doesn’t have table valued functions, and Oracle’s (very regrettably) are not inlineable because they have to be written in PL/SQL.

Here’s how to write these functions:

DB2

Function definition:

CREATE OR REPLACE FUNCTION f_payment_with_revenue (
  p_customer_id BIGINT,
  p_from_date DATE,
  p_to_date DATE
)
RETURNS TABLE (
  customer_id BIGINT,
  payment_date DATE,
  amount DECIMAL(10, 2),
  cumulative_amount DECIMAL(10, 2)
)
LANGUAGE SQL
RETURN
SELECT *
FROM (
  SELECT
    customer_id,
    payment_date,
    amount,
    SUM(amount) OVER (
      PARTITION BY customer_id
      ORDER BY payment_date, payment_id
    ) cumulative_amount
  FROM payment
  WHERE customer_id = p_customer_id
  AND payment_date <= p_to_date
) t
WHERE payment_date >= p_from_date;

Function call:

SELECT 
  payment_date,
  amount,
  cumulative_amount
FROM (
  SELECT customer_id FROM customer WHERE customer_id IN (1, 2, 3)
) c(customer_id),
TABLE(sakila.f_payment_with_revenue(
  c.customer_id,
  CAST('2005-05-25' AS DATE),
  CAST('2005-05-29' AS DATE)
))
ORDER BY payment_date;

Execution plan:

Explain Plan                                                    
----------------------------------------------------------------
ID | Operation                     |                 Rows | Cost
 1 | RETURN                        |                      |   33
 2 |  TBSCAN                       |     4 of 4 (100.00%) |   33
 3 |   SORT                        |     4 of 4 (100.00%) |   33
 4 |    NLJOIN                     |               4 of 1 |   33
 5 |     NLJOIN                    |               3 of 1 |   20
 6 |      TBSCAN GENROW            |     3 of 3 (100.00%) |    0
 7 |      IXSCAN PK_CUSTOMER       |   1 of 599 (   .17%) |    6
 8 |     FILTER                    |     1 of 1 (100.00%) |   13
 9 |      TBSCAN                   |     1 of 1 (100.00%) |   13
10 |       SORT                    |     1 of 1 (100.00%) |   13
11 |        FETCH PAYMENT          |     1 of 1 (100.00%) |   13
12 |         IXSCAN IDX_PAYMENT_I1 | 1 of 16049 (   .01%) |    6
                                                                
Predicate Information                                           
  5 - JOIN (Q3.CUSTOMER_ID = Q2.$C0)                            
  7 - START (Q3.CUSTOMER_ID = Q2.$C0)                           
       STOP (Q3.CUSTOMER_ID = Q2.$C0)                           
  8 - RESID ('2005-05-25' <= Q6.PAYMENT_DATE)                   
 12 - START (Q4.CUSTOMER_ID = Q3.CUSTOMER_ID)                   
       STOP (Q4.CUSTOMER_ID = Q3.CUSTOMER_ID)                   
       STOP (Q4.PAYMENT_DATE <= '2005-05-29')                   

Much better!

Benchmark result (Statement 1 = function call, Statement 2 = manually optimised):

RUN |STMT |RATIO  |
----|-----|-------|
1   |1    |1.5945 |
1   |2    |1.0080 |
2   |1    |1.6310 |
2   |2    |1.0768 |
3   |1    |1.5827 |
3   |2    |1.0090 |
4   |1    |1.5486 |
4   |2    |1.0084 |
5   |1    |1.5569 |
5   |2    |1.0000 |

Definitely a huge improvement. The comparison might not be entirely fair because

  • CROSS APPLY / LATERAL unnesting tends to generate nested loops that could be written more optimally with a classic join
  • We have an additional auxiliary customer table access (which could probably be tuned away with another rewrite)

PostgreSQL

Function definition:

CREATE OR REPLACE FUNCTION f_payment_with_revenue (
  p_customer_id BIGINT,
  p_from_date DATE,
  p_to_date DATE
)
RETURNS TABLE (
  customer_id SMALLINT,
  payment_date TIMESTAMP,
  amount DECIMAL(10, 2),
  cumulative_amount DECIMAL(10, 2)
)
AS $$
SELECT *
FROM (
  SELECT
    customer_id,
    payment_date,
    amount,
    SUM(amount) OVER (
      PARTITION BY customer_id
      ORDER BY payment_date, payment_id
    ) cumulative_amount
  FROM payment
  WHERE customer_id = p_customer_id
  AND payment_date <= p_to_date
) t
WHERE payment_date >= p_from_date
$$ LANGUAGE SQL;

Function call:

SELECT 
  payment_date,
  amount,
  cumulative_amount
FROM (
  SELECT customer_id FROM customer WHERE customer_id IN (1, 2, 3)
) c(customer_id)
CROSS JOIN LATERAL f_payment_with_revenue(
  c.customer_id,
  CAST('2005-05-25' AS DATE),
  CAST('2005-05-29' AS DATE)
)
ORDER BY payment_date;

Execution plan:

QUERY PLAN                                                                                    
----------------------------------------------------------------------------------------------
Sort  (cost=250.39..257.89 rows=3000 width=72)                                                
  Sort Key: f_payment_with_revenue.payment_date                                               
  ->  Nested Loop  (cost=0.53..77.13 rows=3000 width=72)                                      
        ->  Index Only Scan using customer_pkey on customer  (cost=0.28..16.88 rows=3 width=4)
              Index Cond: (customer_id = ANY ('{1,2,3}'::integer[]))                          
        ->  Function Scan on f_payment_with_revenue  (cost=0.25..10.25 rows=1000 width=72)    

Oops, no unnesting of the function is happening. The cardinality defaults to 1000. That’s bad news!

Benchmark result (Statement 1 = function call, Statement 2 = manually optimised):

RUN 1, Statement 1: 25.77538
RUN 1, Statement 2: 1.00000
RUN 2, Statement 1: 27.55197
RUN 2, Statement 2: 1.11581
RUN 3, Statement 1: 27.99331
RUN 3, Statement 2: 1.16463
RUN 4, Statement 1: 29.11022
RUN 4, Statement 2: 1.01159
RUN 5, Statement 1: 26.65781
RUN 5, Statement 2: 1.01654

Rats. This has gotten much worse than with the view. Not surprising, though. Table valued functions are not that good of an idea when they cannot be inlined! Oracle would have had a similar result if I wasn’t too lazy to translate my function to an ordinary PL/SQL table valued function, or a pipelined function.

SQL Server

Function definition:

CREATE FUNCTION f_payment_with_revenue (
  @customer_id BIGINT,
  @from_date DATE,
  @to_date DATE
)
RETURNS TABLE
AS RETURN
SELECT *
FROM (
  SELECT
    customer_id,
    payment_date,
    amount,
    SUM(amount) OVER (
      PARTITION BY customer_id
      ORDER BY payment_date, payment_id
    ) cumulative_amount
  FROM payment
  WHERE customer_id = @customer_id
  AND payment_date <= @to_date
) t
WHERE payment_date >= @from_date;

Function call:

SELECT 
  payment_date,
  amount,
  cumulative_amount
FROM (
  SELECT customer_id FROM customer WHERE customer_id IN (1, 2, 3)
) AS c(customer_id)
CROSS APPLY f_payment_with_revenue(
  c.customer_id,
  CAST('2005-05-25' AS DATE),
  CAST('2005-05-29' AS DATE)
)
ORDER BY payment_date;

Execution plan

|--Sort(ORDER BY:([payment_date] ASC))
   |--Nested Loops(Inner Join, OUTER REFERENCES:([customer_id]))
      |--Index Seek(OBJECT:([PK__customer__CD65CB84E826462D]), SEEK:([customer_id] IN (1, 2, 3))
      |--Filter(WHERE:([payment_date]>='2005-05-25'))
         |--Compute Scalar(DEFINE:([Expr1006]=CASE WHEN [Expr1007]=(0) THEN NULL ELSE [Expr1008] END))
            |--Stream Aggregate(GROUP BY:([WindowCount1014]) DEFINE:(..)))
               |--Window Spool(RANGE BETWEEN:(UNBOUNDED, [[payment_date], [payment_id]]))
                  |--Segment
                     |--Segment
                        |--Sort(ORDER BY:([payment_date] ASC, [payment_id] ASC))
                           |--Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1003]))
                              |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1010], [Expr1011], [Expr1009]))
                              |  |--Compute Scalar(DEFINE:(([Expr1010],[Expr1011],[Expr1009])=GetRangeWithMismatchedTypes(NULL,'2005-05-29',(42))))
                              |  |  |--Constant Scan
                              |  |--Index Seek(OBJECT:([idx_payment_i1]), SEEK:([customer_id]=CONVERT_IMPLICIT(bigint,[customer_id],0) AND [payment_date] > [Expr1010] AND [payment_date] < [Expr1011]))
                              |--RID Lookup(OBJECT:([payment]), SEEK:([Bmk1003]=[Bmk1003]))

Again, super unreadable IMO, but after looking a bit more closely, we can see that the plan is almost the same as the manually optimised one, and the predicate is applied early on, where it belongs.

Benchmark result (Statement 1 = function call, Statement 2 = manually optimised):

Run 1, Statement 1: 2.50000
Run 1, Statement 2: 1.27778
Run 2, Statement 1: 2.11111
Run 2, Statement 2: 1.27778
Run 3, Statement 1: 2.11111
Run 3, Statement 2: 1.00000
Run 4, Statement 1: 2.22222
Run 4, Statement 2: 1.11111
Run 5, Statement 1: 2.02778
Run 5, Statement 2: 1.19444

Conclusion

Window functions are super cool and powerful. But they come at a price. They sort your data. Normally, when we write complex queries and reuse parts in views, we can profit from predicate push down operations into derived tables and views, which is something that most databases support (see also our previous blog post about such optimisations).

But when it comes to using window functions, they act like a “fence”, past which only few predicates can be pushed automatically. It’s not that it wouldn’t be possible, it simply isn’t done very well by most databases (and in the case of MySQL, not at all as of 8.0.2).

Inline table valued functions can be a remedy to avoid manual building of complex queries, such that at least some parts of your logic can be reused among queries. Unfortunately, they rely on CROSS APPLY or LATERAL JOIN, which can also cause performance issues in more complex setups. Besides, among the databases covered in this article, only DB2 and SQL Server support inline table valued functions. Oracle doesn’t support SQL functions at all, and PostgreSQL’s SQL functions are not inlinable (yet), which means that in these databases, in order to tune such queries, you might not be able to reuse the parts that use window functions in views or stored functions.

However, as always, do measure. Perhaps, a 4x waste of performance for a particular query is OK.

Squeezing Another 10% Speed Increase out of jOOQ using JMC and JMH

In this post, we’re going to discuss a couple of recent efforts to squeeze roughly 10% in terms of speed out of jOOQ by iterating on hotspots that were detected using JMC (Java Mission Control) and then validated using JMH (Java Microbenchmark Harness). This post shows how to apply micro optimisations to algorithms where the smallest improvement can have a significant effect.

While JMH is probably without competition, JMC could easily be replaced by JProfiler, YourKit, or even your own manual jstack sampling. I’ll just use JMC because it ships with the JDK and is free for use for development as of JDK 8 and 9 (if you’re unsure whether you’re “developing”, better ask Oracle). Rumours have it that JMC might be contributed to the OpenJDK in the near future.

Micro optimisations

Micro optimisations are a cool technique to squeeze a very small improvement out of a local algorithm (e.g. a loop) that has a significant effect on the entire application / library, because of the fact that the local algorithm is called many times. This is absolutely the case in jOOQ, which is essentially a library that always runs 4 nested loops:

  1. S: A “loop” over all possible SQL statements
  2. E: A “loop” over all executions of such a statement
  3. R: A loop over all rows in the result
  4. C: A loop over all columns in a row

Such four level nested loops result in what we could call a polynomial complexity of our algorithms, even if we cannot call the complexity O(N4) (as the 4 “N” are not all the same), it is certainly of O(S x E x R x C) (I’ll call this “S-E-R-C loops” further down). Even to the untrained eye, it becomes evident that anything that happens in the inner-most “C-loop” can have devastating effects. We better not be opening any files here, that could be opened outside of, e.g. the “S-loop”

In a previous blog post, we’ve discussed common techniques of optimising such situations. In this blog post, we’ll look into a couple of concrete examples.

How to discover flaws in these loops?

We’re looking for the problems that affect all users, the kind of problem that, once fixed, will improve jOOQ’s performance for everyone by e.g. 10%. This is similar to what the JIT does, by performing things like stack allocation, inlining, which don’t drastically improve things locally, but do so globally, and for everyone. Here’s an interesting guest post by Tagir Valeev on JIT optimisation, and how good it is.

Getting a large “S-loop”

The first option is to run profiling sessions on benchmarks. We could, for example, run the entire “S-E-R-C loops” in a JMC profiling session, where the “S-loop” is a loop over all our statements, or in other words, over all our integration tests. Unfortunately, with this approach, our “E-loop” (in the case of jOOQ’s integration tests) is a single execution per statement. We’d have to run the integration tests many, many times in order to get meaningful results.

Also, while the jOOQ integration tests run thousands of distinct queries, most queries are still rather simple, each one focusing on an individual SQL feature (e.g. lateral join). In a end user application, queries might use less specific features, but are much more complex, i.e. they have a lot of ordinary joins.

This technique is useful to find problems that appear in all queries, deep down inside of jOOQ – e.g. at the JDBC interface. But we cannot use this approach to test individual features.

Getting a large “E-loop”

Another option is to write a single test that runs a few statements (small “S-loop”) many times in an explicit loop (large “E-loop”). This has the advantage that a specific bottleneck can be found with a high confidence, but the drawback is: It’s specific. For instance, if we find a small bottleneck in the string concatenation function, well, that is certainly worth fixing, but doesn’t affect most users.

This approach is useful to test individual features. It can also be useful for finding issues that affect all queries, but with a lower confidence than the previous case, where the “S-loop” is maximised.

Getting large “R-loops” and “C-loops”

Creating large result sets is easy and should definitely be part of such benchmarks, because in the case of a large result set, any flaw will multiply drastically, so fixing these things is worthwhile. However, these problems only affect actual result sets, not the query building process or the execution process. Sure, most statements are probably queries, not insertions / updates, etc. But this needs to be kept in mind.

Optimising for problems in large “E-loops”

All of the above scenarios are different optimisation sessions and deserve their own blog posts. In this post, I’m describing what has been discovered and fixed when running a single query 3 million times on an H2 database. The H2 database is chosen here, because it can run in memory of the same process and thus has the least extra overhead compared to jOOQ – so jOOQ’s overhead contributions become significant in a profiling session / benchmark. In fact, it can be shown that in such a benchmark, jOOQ (or Hibernate, etc.) appears to perform quite poorly compared to a JDBC only solution, as many have done before.

This is an important moment to remind ourselves:

Benchmarks do not reflect real-world use cases! You will never run the exact same query 3 million times on a production system, and your production system doesn’t run on H2.

A benchmark profits from so much caching, buffering, you would never perform as fast as in a benchmark.

Always be careful not to draw any wrong conclusions from a benchmark!

This needs to be said, so take every benchmark you find on the web with a grain of salt. This includes our own!

The query being profiled is:

ctx.select(
      AUTHOR.FIRST_NAME,
      AUTHOR.LAST_NAME,
      BOOK.ID,
      BOOK.TITLE)
   .from(BOOK)
   .join(AUTHOR).on(BOOK.AUTHOR_ID.eq(AUTHOR.ID))
   .where(BOOK.ID.eq(1))
   .and(BOOK.TITLE.isNull().or(BOOK.TITLE.ne(randomValue)));

The trivial query returns a ridiculous 4 rows and 4 columns, so the “R-loop” and “C-loops” are negligible. This benchmark is really testing the overhead of jOOQ query execution in a case where the database does not contribute much to the execution time. Again, in a real world scenario, you will get much more overhead from your database.

In the following sections, I’ll show a few minor bottlenecks that could be found when drilling down into these such execution scenarios. As I’ve switched between JMC versions, the screenshots will not always be the same, I’m afraid.

1. Instance allocation of constant values

A very silly mistake was easily discovered right away:

The mistake didn’t contribute a whole lot of overhead, only 1.1% to the sampled time spent, but it made me curious. In version 3.10 of jOOQ, the SelectQueryImpl‘s Limit class, which encodes the jOOQ OFFSET / LIMIT behaviour kept allocating this DSL.val() thingy, which is a bind variable. Sure, limits do work with bind variables, but this happened when SelectQueryImpl was initialised, not when the LIMIT clause is added by the jOOQ API user.

As can be seen in the sources, the following logic was there:

private static final Field<Integer> ZERO              = zero();
private static final Field<Integer> ONE               = one();
private Field<Integer>              numberOfRowsOrMax = 
    DSL.inline(Integer.MAX_VALUE);

While the “special limits” ZERO and ONE were static members, the numberOfRowsOrMax value wasn’t. That’s the instantiation we were measuring in JMC. The member is not a constant, but the default value is. It is always initialised with Integer.MAX_VALUE wrapped in an DSL.inline() call. The solution is really simple:

private static final Param<Integer> MAX               = 
    DSL.inline(Integer.MAX_VALUE);
private Field<Integer>              numberOfRowsOrMax = MAX;

This is obviously better! Not only does it avoid the allocation of the bind variable, it also avoids the boxing of Integer.MAX_VALUE (which can also be seen in the sampling screenshot).

Note, a similar optimisation is available in the JDK’s ArrayList. When you look at the sources, you’ll see:

/**
 * Shared empty array instance used for empty instances.
 */
private static final Object[] EMPTY_ELEMENTDATA = {};

When you initialise an ArrayList without initial capacity, it will reference this shared instance, instead of creating a new, empty (or even non-empty) array. This delays the allocation of such an array until we actually add things to the ArrayList, just in case it stays empty.

jOOQ’s LIMIT is the same. Most queries might not have a LIMIT, so better not allocate that MAX_VALUE afresh!

This is done once per “E-loop” iteration

One issue down: https://github.com/jOOQ/jOOQ/issues/6635

2. Copying lists in internals

This is really a micro optimisation that you probably shouldn’t do in ordinary business logic. But it might be worthwhile in infrastructure logic, e.g. when you’re also in an “S-E-R-C loop”:

jOOQ (unfortunately) occasionally copies data around between arrays, e.g. wrapping Strings in jOOQ wrapper types, transforming numbers to strings, etc. These loops aren’t bad per se, but remember, we’re inside some level of the “S-E-R-C loop”, so these copying operations might be run hundreds of millions of times when we run a statement 3 million times.

The above loop didn’t contribute a lot of overhead, and possible the cloned object was stack allocated or the clone call eliminated by the JIT. But maybe it wasn’t. The QualifiedName class cloned its argument prior to returning it to make sure that no accidental modifications will have any side effect:

private static final String[] nonEmpty(String[] qualifiedName) {
    String[] result;
    ...
    if (nulls > 0) {
        result = new String[qualifiedName.length - nulls];
        ...
    }
    else {
        result = qualifiedName.clone();
    }
    return result;
}

So, the implementation of the method guaranteed a new array as a result.

After a bit of analysis, it could be seen that there is only a single consumer of this method, and it doesn’t leave that consumer. So, it’s safe to remove the clone call. Probably, the utility was refactored from a more general purpose method into this local usage.

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6640

3. Running checks in loops

This one is too silly to be true:

There’s a costly overhead in the CombinedCondition constructor (<init> method). Notice, how the samples drop from 0.47% to 0.32% between the constructor and the next method init(), that’s the time spent inside the constructor.

A tiny amount of time, but this time is spent every time someone combines two conditions / predicates with AND and OR. Every time. We can probably save this time. The problem is this:

CombinedCondition(Operator operator, Collection<? extends Condition> conditions) {
    ...
    for (Condition condition : conditions)
        if (condition == null)
            throw new IllegalArgumentException("The argument 'conditions' must not contain null");

    ...
    init(operator, conditions);
}

There’s a loop over the arguments to give some meaningful error messages. That’s a bit too defensive, I suspect. How about we simply live with the NPE when it arises, as this should be rather unexpected (for the context, jOOQ hardly ever checks on parameters like this, so this should also be removed for consistency reasons).

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6666 (nice number)

4. Lazy initialisation of lists

The nature of the JDBC API forces us to work with ThreadLocal variables, very unfortunately, as it is not possible to pass arguments from parent SQLData objects to children, especially when we combine nesting of Oracle TABLE/VARRAY and OBJECT types.

In this analysis, we’re combining the profiler’s CPU sampling with its memory sampling:

In the CPU sampling view above, we can see some overhead in the DefaultExecuteContext, which is instantiated once per “E-loop” iteration. Again, not a huge overhead, but let’s look at what this constructor does. It contributes to the overall allocations of ArrayList:

When we select the type in JMC, the other view will then display all the stack traces where ArrayList instances were allocated, among which, again, our dear DefaultExecuteContext constructor:

Where are those ArrayLists allocated? Right here:

BLOBS.set(new ArrayList<Blob>());
CLOBS.set(new ArrayList<Clob>());
SQLXMLS.set(new ArrayList<SQLXML>());
ARRAYS.set(new ArrayList<Array>());

Every time we start executing a query, we initialise a list for each ones of these types. All of our variable binding logic will then register any possibly allocated BLOB or CLOB, etc. such that we can clean these up at the end of the execution (a JDBC 4.0 feature that not everyone knows of!):

static final void register(Blob blob) {
    BLOBS.get().add(blob);
}
    
static final void clean() {
    List<Blob> blobs = BLOBS.get();

    if (blobs != null) {
        for (Blob blob : blobs)
            JDBCUtils.safeFree(blob);

        BLOBS.remove();
    }
    ...
}

Don’t forget calling Blob.free() et al, if you’re working with JDBC directly!

But the truth is, in most cases, we don’t really need these things. We need them only in Oracle, and only if we’re using TABLE / VARRAY or OBJECT types, due to some JDBC restrictions. Why punish all the users of other databases with this overhead? Instead of a sophisticated refactoring, which risks introducing regressions (https://github.com/jOOQ/jOOQ/issues/4205), we can simply initialise these lists lazily. We leave the clean() method as it is, remove the initialisation in the constructor, and replace the register() logic by this:

static final void register(Blob blob) {
    List<Blob> list = BLOBS.get();

    if (list == null) {
        list = new ArrayList<Blob>();
        BLOBS.set(list);
    }

    list.add(blob);
}

That was easy. And significant. Check out the new allocation measurements:

Note that every allocation, apart from the overhead of allocating things, also incurs additional overhead when the object is garbage collected. That’s a bit trickier to measure and correlate. In general, less allocations is almost always a good thing, except if the allocation is super short lived, in case of which stack allocation can happen, or the logic can even be eliminated by the JIT.

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6669

6. Using String.replace()

This is mostly a problem in JDK 8 only, JDK 9 fixed string replacing by no longer relying on regular expressions internally. In JDK 8, however (and jOOQ still supports Java 6, so this is relevant), string replacement works through regular expressions as can be seen here:

The Pattern implementation allocates quite a few int[] instances, even if that’s probably not strictly needed for non-regex patterns as those of String.replace():

I’ve already analysed this in a previous blog post, which can be seen here:

https://blog.jooq.org/2017/10/11/benchmarking-jdk-string-replace-vs-apache-commons-stringutils-replace/

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6672

7. Registering an SPI that is going to be inactive

This one was a bit more tricky to solve as it relies on a deeper analysis. Unfortunately, I have no profiling screenshots available anymore, but it is easy to explain with code. There’s an internal ExecuteListeners utility, which abstracts over the ExecuteListener SPIs. Users can register such a listener and listen to query rendering, variable binding, query execution, and other lifecycle events. By default, there is no such ExecuteListener by the users, but there’s always one internal ExecuteListener:

private static ExecuteListener[] listeners(ExecuteContext ctx) {
    List<ExecuteListener> result = new ArrayList<ExecuteListener>();

    for (ExecuteListenerProvider provider : ctx.configuration()
                                               .executeListenerProviders())
        if (provider != null)
            result.add(provider.provide());

    if (!FALSE.equals(ctx.settings().isExecuteLogging()))
        result.add(new LoggerListener());

    return result.toArray(EMPTY_EXECUTE_LISTENER);
}

The LoggerListener is added by default, unless users turn off that feature. Which means:

  • We’ll pretty much always get this ArrayList
  • We’ll pretty much always loop over this list
  • We’ll pretty much always clal this LoggerListener

But what does it do? It logs stuff on DEBUG and TRACE level. For instance:

@Override
public void executeEnd(ExecuteContext ctx) {
    if (ctx.rows() >= 0)
        if (log.isDebugEnabled())
            log.debug("Affected row(s)", ctx.rows());
}

That’s what it does by definition. It’s a debug logger. So, the improved logic for initialising this thing is the following:

private static final ExecuteListener[] listeners(ExecuteContext ctx) {
    List<ExecuteListener> result = null;

    for (ExecuteListenerProvider provider : ctx.configuration()
                                               .executeListenerProviders())
        if (provider != null)
            (result = init(result)).add(provider.provide());

    if (!FALSE.equals(ctx.settings().isExecuteLogging())) {
        if (LOGGER_LISTENER_LOGGER.isDebugEnabled())
            (result = init(result)).add(new LoggerListener());
    }

    return result == null ? null : result.toArray(EMPTY_EXECUTE_LISTENER);
}

We’re no longer allocating the ArrayList (that might be premature, the JIT might have rewritten this allocation to not happen, but OK), and we’re only adding the LoggerListener if it DEBUG or TRACE logging is enabled for it, i.e. if it would do any work at all.

That’s just a couple of CPU cycles we can save on every execution. Again, I don’t have the profiling measurements anymore, but trust me. It helped.

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6747

8. Eager allocation where lazy allocation works

Sometimes, we need two different representations of the same information. The “raw” representation, and a more useful, pre-processed representation for some purposes. This was done, for instance, in QualifiedField:

private final Name          name;
private final Table<Record> table;

QualifiedField(Name name, DataType<T> type) {
    super(name, type);

    this.name = name;
    this.table = name.qualified()
        ? DSL.table(name.qualifier())
        : null;
}

@Override
public final void accept(Context<?> ctx) {
    ctx.visit(name);
}

@Override
public final Table<Record> getTable() {
    return table;
}

As can be seen, the name is really the beef of this class. It’s a qualified name that generates itself on the SQL string. The Table representation is useful when navigating the meta model, but this is hardly ever done by jOOQ’s internals and/or user facing code.

However, this eager initialisation it is costly:

Quite a few UnqualifiedName[] arrays are allocated by the call to Name.qualifier(). We can easily make that table reference non-final and calculate it lazily:

private final Name              name;
private Table<Record>           table;

QualifiedField(Name name, DataType<T> type) {
    super(name, type);

    this.name = name;
}

@Override
public final Table<Record> getTable() {
    if (table == null)
        table = name.qualified() ? DSL.table(name.qualifier()) : null;

    return table;
}

Because name is final, we could call table “effectively final” (in a different meaning than the Java language’s) – we won’t have any thread safety issues because these particular types are immutable inside of jOOQ.

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6755

Results

Now, thus far, we’ve “improved” many low hanging fruit based on a profiler session (that was run, akhem, from outside of Eclipse on a rather busy machine). This wasn’t very scientific. Just tracking down “bottlenecks” which triggered my interest by having high enough numbers to even notice. This is called “micro optimisation”, and it is only worth the trouble if you’re in a “S-E-R-C loop”, meaning that the code you’re optimising is executed many many times. For me, developing jOOQ, this is almost always the case, because jOOQ is a library used by a lot of people who all profit from these optimisations. In many other cases, this might be called “premature optimisation”

But once we’ve optimised, we shouldn’t stop. I’ve done a couple of individual JMH benchmarks for many of the above problems, to see if they were really an improvement. But sometimes, in a JMH benchmark, something that doesn’t look like an improvement will still be an improvement in the bigger picture. The JVM doesn’t inline all methods 100 levels deep. If your algorithm is complex, perhaps a micro optimisation will still have an effect that would not have any effect on a JMH benchmark.

Unfortunately this isn’t very exact science, but with enough intuition, you’ll find the right spots to optimise.

In my case, I verified progress over two patch releases: 3.10.0 -> 3.10.1 -> 3.10.2 (not yet released) by running a JMH benchmark over the entire query execution (including H2’s part). The results of applying roughly 15 of the above and similar optimisations (~2 days’ worth of effort) is:

JDK 9 (9+181)

jOOQ 3.10.0 Open Source Edition

Benchmark                          Mode   Cnt       Score      Error  Units
ExecutionBenchmark.testExecution   thrpt   21  101891.108 ± 7283.832  ops/s

jOOQ 3.10.2 Open Source Edition

Benchmark                          Mode   Cnt       Score      Error  Units
ExecutionBenchmark.testExecution   thrpt   21  110982.940 ± 2374.504  ops/s

JDK 8 (1.8.0_145)

jOOQ 3.10.0 Open Source Edition

Benchmark                          Mode   Cnt       Score      Error  Units
ExecutionBenchmark.testExecution   thrpt   21  110178.873 ± 2134.894  ops/s

jOOQ 3.10.2 Open Source Edition

Benchmark                          Mode   Cnt       Score      Error  Units
ExecutionBenchmark.testExecution   thrpt   21  118795.922 ± 2661.653  ops/s

As can be seen, in both JDK versions, we’ve gotten roughly a 10% speed increase. What’s interesting is also that JDK 8 seemed to have been also 10% faster than JDK 9 in this benchmark, although this can be due to a variety of things that I haven’t considered yet, and which are out of scope for this discussion.

Conclusion

This iterative approach to tackling performance is definitely worth it for library authors:

  • run a representative benchmark (repeat a task millions of times)
  • profile it
  • track down “bottlenecks”
  • if they’re easy to fix without regression risk, do it
  • repeat
  • after a while, verify with JMH

Individual improvements are quite hard to measure, or measure correctly. But when you do 10-15 of them, they start adding up and become significant. 10% can make a difference.

Looking forward to your comments, alternative techniques, alternative tools, etc.!

If you liked this article, you will also like Top 10 Easy Performance Optimisations in Java

A Basic Programming Pattern: Filter First, Map Later

In recent days, I’ve seen a bit too much of this:

someCollection
    .stream()
    .map(e -> someFunction(e))
    .collect(Collectors.toList())
    .subList(0, 2);

Something is very wrong with the above example. Can you see it? No? Let me rename those variables for you.

hugeCollection
    .stream()
    .map(e -> superExpensiveMapping(e))
    .collect(Collectors.toList())
    .subList(0, 2);

Better now? Exactly. The above algorithm is O(N) when it could be O(1):

hugeCollection
    .stream()
    .limit(2)
    .map(e -> superExpensiveMapping(e))
    .collect(Collectors.toList());

(Let’s assume the lack of explicit ordering is irrelevant)

I’m working mostly with SQL and helping companies tune their SQL (check out our training, btw) and as such, I’m always very eager to reduce the algorithmic complexity of queries. To some people, this seems like wizardry, but honestly, most of the time, it just boils down to adding a well placed index.

Why?

Because an index reduces the algorithmic complexity of a query algorithm from O(N) (scanning the entire table for matches) to O(log N) (scanning a B-tree index for the same matches).

Sidenote: Reducing algorithmic complexity is almost never premature optimisation. As your data set grows, bad complexity will always bite you!

The same is true for the above examples. Why would you ever traverse and transform an entire list of a huge amount of elements (N) with an expensive operation (superExpensiveMapping), when you really only need to do this only for the first two values?

Conclusion

SQL is a declarative language where the query optimiser gets this right automatically for you: It will (almost) always filter the data first (WHERE clause), and only then transform it (JOIN, GROUP BY,
SELECT
, etc).

Just as in SQL, when we hand-write our queries using Streams (and also in imperative programming), do always:

Filter First, Map Later

Your production system will thank you.

Note, another interesting case with flatMap() is documented here.

How to Find Redundant Indexes in SQL

The following two indexes are redundant in most SQL databases:

CREATE INDEX i_actor_1 ON actor (last_name);
CREATE INDEX i_actor_2 ON actor (last_name, first_name);

It is usually safe to drop the first index, because all queries that query the LAST_NAME column only can still profit from the second index I_ACTOR_2. The reason being that LAST_NAME is the first column of the composite index I_ACTOR_2 (it would be a different story, if it weren’t the first column).

Note: It is usually safe to drop the first index, because the benefits probably outweigh the cost:

Benefits of dropping

Costs of dropping

  • Querying a composite index can be slightly slower as can be seen in the below benchmark

Let’s see the costs of dropping the index below for Oracle, PostgreSQL, and SQL Server in this particular case (beware as always when interpreting benchmarks, they heavily depend on a lot of context, especially data size!)

Oracle

Preparation:

CREATE TABLE t (
  a NUMBER(10) NOT NULL,
  b NUMBER(10) NOT NULL
);

INSERT INTO t (a, b)
SELECT level, level
FROM dual
CONNECT BY level <= 100000;

CREATE INDEX i1 ON t(a);
CREATE INDEX i2 ON t(a, b);

EXEC dbms_stats.gather_table_stats('TEST', 'T');

Benchmark:

SET SERVEROUTPUT ON
CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2000;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT /*+INDEX(t i1)*/ * FROM t WHERE a = 1
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, 1, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT /*+INDEX(t i2)*/ * FROM t WHERE a = 1
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    INSERT INTO results VALUES (r, 2, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio 
    FROM results
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio);
  END LOOP;
END;
/

DROP TABLE results;

The result being:

Run 1, Statement 1 : 1.4797
Run 1, Statement 2 : 1.45545

Run 2, Statement 1 : 1.1997
Run 2, Statement 2 : 1.01121

Run 3, Statement 1 : 1.13606
Run 3, Statement 2 : 1

Run 4, Statement 1 : 1.13455
Run 4, Statement 2 : 1.00242

Run 5, Statement 1 : 1.13303
Run 5, Statement 2 : 1.00606

Some notes on benchmarks here.

The fastest query execution in the above result yields 1, the other executions are multiples of 1. Yes, there’s a 10% difference in this case, so as you can see. The benefits (faster insertions) certainly should outweight the cost (slower queries), so, don’t apply this advice in a read-heavy / write-rarely database.

PostgreSQL

A similar difference can be seen in a PostgreSQL benchmark. No hints can be used to choose indexes, so we’re simply creating two tables:

CREATE TABLE t1 (
  a INT NOT NULL,
  b INT NOT NULL
);
CREATE TABLE t2 (
  a INT NOT NULL,
  b INT NOT NULL
);

INSERT INTO t1 (a, b)
SELECT s, s
FROM generate_series(1, 100000) AS s(s);

INSERT INTO t2 (a, b)
SELECT s, s
FROM generate_series(1, 100000) AS s(s);

CREATE INDEX i1 ON t1(a);
CREATE INDEX i2 ON t2(a, b);

ANALYZE t1;
ANALYZE t2;

Benchmark:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 10000;
  rec RECORD;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * FROM t1 WHERE a = 1
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', r, 
      (clock_timestamp() - v_ts); 
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * FROM t2 WHERE a = 1
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', r, 
      (clock_timestamp() - v_ts); 
    RAISE INFO '';
  END LOOP;
END$$;

Result:

INFO:  Run 1, Statement 1: 00:00:00.071891
INFO:  Run 1, Statement 2: 00:00:00.080833

INFO:  Run 2, Statement 1: 00:00:00.076329
INFO:  Run 2, Statement 2: 00:00:00.079772

INFO:  Run 3, Statement 1: 00:00:00.073137
INFO:  Run 3, Statement 2: 00:00:00.079483

INFO:  Run 4, Statement 1: 00:00:00.073456
INFO:  Run 4, Statement 2: 00:00:00.081508

INFO:  Run 5, Statement 1: 00:00:00.077148
INFO:  Run 5, Statement 2: 00:00:00.083535

SQL Server

Preparation:

CREATE TABLE t (
  a INT NOT NULL,
  b INT NOT NULL
);

WITH s(s) AS (
  SELECT 1
  UNION ALL
  SELECT s + 1 FROM s WHERE s < 100
)
INSERT INTO t
SELECT TOP 100000 
  row_number() over(ORDER BY (SELECT 1)), 
  row_number() over(ORDER BY (select 1)) 
FROM s AS s1, s AS s2, s AS s3;

CREATE INDEX i1 ON t(a);
CREATE INDEX i2 ON t(a, b);

UPDATE STATISTICS t;

Benchmark:

DECLARE @ts DATETIME;
DECLARE @repeat INT = 2000;
DECLARE @r INT;
DECLARE @i INT;
DECLARE @dummy VARCHAR;

DECLARE @s1 CURSOR;
DECLARE @s2 CURSOR;

DECLARE @results TABLE (
  run     INT,
  stmt    INT,
  elapsed DECIMAL
);

SET @r = 0;
WHILE @r < 5
BEGIN
  SET @r = @r + 1

  SET @s1 = CURSOR FOR 
    SELECT b FROM t WITH (INDEX (i1)) WHERE a = 1;

  SET @s2 = CURSOR FOR 
    SELECT b FROM t WITH (INDEX (i2)) WHERE a = 1;

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s1;
    FETCH NEXT FROM @s1 INTO @dummy;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s1 INTO @dummy;
    END;

    CLOSE @s1;
  END;

  DEALLOCATE @s1;
  INSERT INTO @results 
    VALUES (@r, 1, DATEDIFF(ms, @ts, current_timestamp));

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s2;
    FETCH NEXT FROM @s2 INTO @dummy;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s2 INTO @dummy;
    END;

    CLOSE @s2;
  END;

  DEALLOCATE @s2;
  INSERT INTO @results 
    VALUES (@r, 2, DATEDIFF(ms, @ts, current_timestamp));
END;

SELECT 'Run ' + CAST(run AS VARCHAR) + 
  ', Statement ' + CAST(stmt AS VARCHAR) + ': ' + 
  CAST(CAST(elapsed / MIN(elapsed) OVER() AS DECIMAL(10, 5)) AS VARCHAR)
FROM @results;

Result:

Run 1, Statement 1: 1.22368
Run 1, Statement 1: 1.09211

Run 2, Statement 1: 1.05263
Run 2, Statement 1: 1.09211

Run 3, Statement 1: 1.00000
Run 3, Statement 1: 1.05263

Run 4, Statement 1: 1.05263
Run 4, Statement 1: 1.00000

Run 5, Statement 1: 1.09211
Run 5, Statement 1: 1.05263

As can be seen, predictably, in all databases the smaller non-composite index is slightly faster for this type of query than the composite index. In this particular benchmark, this is specifically true because the composite index acts as a covering index.

Yet both indexes can be used for the query in a reasonable way, so if disk space / insertion speed is an issue, the redundant single-column index can be dropped.

How to find such indexes

The following query will help you detect such indexes in Oracle, PostgreSQL, and SQL Server:

Oracle

WITH indexes AS (
  SELECT
    i.owner,
    i.index_name,
    i.table_name,
    listagg(c.column_name, ', ')
      WITHIN GROUP (ORDER BY c.column_position)
      AS columns
  FROM all_indexes i
  JOIN all_ind_columns c
    ON i.owner = c.index_owner
    AND i.index_name = c.index_name
  GROUP BY i.owner, i.table_name, i.index_name, i.leaf_blocks
)
SELECT
  i.owner,
  i.table_name,
  i.index_name AS "Deletion candidate index",
  i.columns AS "Deletion candidate columns",
  j.index_name AS "Existing index",
  j.columns AS "Existing columns"
FROM indexes i
JOIN indexes j
  ON i.owner = j.owner
  AND i.table_name = j.table_name
  AND j.columns LIKE i.columns || ',%'

Result:

TABLE_NAME   delete index   columns   existing index   columns
-------------------------------------------------------------------------
T            I1             A         I2               A, B

In short, it lists all the indexes whose columns are a prefix of another index’s columns

PostgreSQL

Get ready for a really nifty query. Here’s how to discover redundant indexes in PostgreSQL, which unfortunately doesn’t seem to have an easy, out-of-the-box dictionary view to discover index columns:

WITH indexes AS (
  SELECT 
    tnsp.nspname AS schema_name,
    trel.relname AS table_name,
    irel.relname AS index_name,
    string_agg(a.attname, ', ' ORDER BY c.ordinality) AS columns
  FROM pg_index AS i
  JOIN pg_class AS trel ON trel.oid = i.indrelid
  JOIN pg_namespace AS tnsp ON trel.relnamespace = tnsp.oid
  JOIN pg_class AS irel ON irel.oid = i.indexrelid
  JOIN pg_attribute AS a ON trel.oid = a.attrelid
  JOIN LATERAL unnest(i.indkey) 
    WITH ORDINALITY AS c(colnum, ordinality)
      ON a.attnum = c.colnum
  GROUP BY i, tnsp.nspname, trel.relname, irel.relname
)
SELECT
  i.table_name,
  i.index_name AS "Deletion candidate index",
  i.columns AS "Deletion candidate columns",
  j.index_name AS "Existing index",
  j.columns AS "Existing columns"
FROM indexes i
JOIN indexes j
  ON i.schema_name = j.schema_name
  AND i.table_name = j.table_name
  AND j.columns LIKE i.columns || ',%';

This is a really nice case of lateral unnesting with ordinality, which you should definitely add to your PostgreSQL tool chain.

SQL Server

Now, SQL Server doesn’t have a nice STRING_AGG function (yet), but we can work around this using STUFF and XML to get the same query.

Of course, there are other solutions using recursive SQL, but I’m too lazy to translate the simple string pattern-matching approach to something recursive.

WITH 
  i AS (
    SELECT 
	  s.name AS schema_name,
      t.name AS table_name,
      i.name AS index_name,
      c.name AS column_name,
      ic.index_column_id
    FROM sys.indexes i 
    JOIN sys.index_columns ic 
      ON i.object_id = ic.object_id 
      AND i.index_id = ic.index_id 
    JOIN sys.columns c 
      ON ic.object_id = c.object_id 
      AND ic.column_id = c.column_id 
    JOIN sys.tables t 
      ON i.object_id = t.object_id 
	JOIN sys.schemas s
	  ON t.schema_id = s.schema_id
  ),
  indexes AS (
    SELECT
	  schema_name,
      table_name,
      index_name,
      STUFF((
        SELECT ',' + j.column_name 
        FROM i j
        WHERE i.table_name = j.table_name 
        AND i.index_name = j.index_name 
        FOR XML PATH('') -- Yay, XML in SQL!
      ), 1, 1, '') columns
    FROM i
    GROUP BY schema_name, table_name, index_name
  )
SELECT
  i.schema_name,
  i.table_name,
  i.index_name AS "Deletion candidate index",
  i.columns AS "Deletion candidate columns",
  j.index_name AS "Existing index",
  j.columns AS "Existing columns"
FROM indexes i
JOIN indexes j
  ON i.schema_name = j.schema_name
  AND i.table_name = j.table_name
  AND j.columns LIKE i.columns + ',%';

A note on partial indexes

SQL Server and PostgreSQL support “partial indexes”, i.e. indexes that contain only parts of your data (and Oracle can emulate them in various ways). Such indexes might appear in the resulting list – you may want to be careful to check if they’re really redundant or not. Chances are, they’re there for a very good reason.

Conclusion

Now go run the above query on your production database and… Very carefully and reasonably think about whether you really want to drop those indexes ;)