SQL IN Predicate: With IN List or With Array? Which is Faster?

Hah! Got nerd-sniped again:

http://stackoverflow.com/questions/43099226/how-to-make-jooq-to-use-arrays-in-the-in-clause/43102102

A jOOQ user was wondering why jOOQ would generate an IN list for a predicate like this:

Java

COLUMN.in(1, 2, 3, 4)

SQL

COLUMN in (?, ?, ?, ?)

… when in fact there could have been the following predicate being generated, instead:

COLUMN = any(?::int[])

In the second case, there would have been only one single bind variable instead of 4, and the SQL generation and parsing work would have been “much” less (maybe not for the IN list of size 4, but let’s imagine a list of 50 values).

A disclaimer

First off, a disclaimer: In databases that have a cursor cache / plan cache (e.g. Oracle or SQL Server), you should be careful with long IN lists, because they will probably trigger a hard parse every time you run them, as by the time you run the exact same predicate (with 371 elements in the list) again, the execution plan will have been purged from the cache. So, you cannot really profit from the cache.

I’m aware of this problem, and it will be topic of another blog post, soon. Let’s stick to PostgreSQL whose “plan cache” isn’t really that sophisticated.

Measure, don’t guess

The question was about improving the speed of parsing a SQL statement. Parsers are really fast, so parsing shouldn’t be a problem. Generating an execution plan certainly does cost more time, but again, since PostgreSQL’s plan cache isn’t very sophisticated, this won’t play into the issue here. So the question is really:

Is an IN list really that bad in PostgreSQL?

Would an array bind variable be much better?

Since our recent post about benchmarking, we now know that we shall never guess, but always measure. I’m using again the Sakila database to run these two queries:

-- IN list
SELECT * 
FROM film 
JOIN film_actor USING (film_id) 
JOIN actor USING (actor_id) 
WHERE film_id IN (?, ?, ?, ?)

-- Array
SELECT * 
FROM film 
JOIN film_actor USING (film_id) 
JOIN actor USING (actor_id) 
WHERE film_id = ANY(?)

Let’s try lists of length 4, first. The benchmark is here:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000;
  rec RECORD;
  v_e1 INT := 1;
  v_e2 INT := 2;
  v_e3 INT := 4;
  v_e4 INT := 8;
  v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4];
BEGIN
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * 
        FROM film 
        JOIN film_actor USING (film_id) 
        JOIN actor USING (actor_id) 
        WHERE film_id IN (v_e1, v_e2, v_e3, v_e4)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', 
      r, (clock_timestamp() - v_ts); 
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * 
        FROM film 
        JOIN film_actor USING (film_id) 
        JOIN actor USING (actor_id) 
        WHERE film_id = ANY(v_any_arr)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', 
      r, (clock_timestamp() - v_ts); 
  END LOOP;
END$$;

The result being:

INFO:  Run 1, Statement 1: 00:00:00.112195
INFO:  Run 1, Statement 2: 00:00:00.450461
INFO:  Run 2, Statement 1: 00:00:00.109792
INFO:  Run 2, Statement 2: 00:00:00.446518
INFO:  Run 3, Statement 1: 00:00:00.105413
INFO:  Run 3, Statement 2: 00:00:00.44298
INFO:  Run 4, Statement 1: 00:00:00.108249
INFO:  Run 4, Statement 2: 00:00:00.476527
INFO:  Run 5, Statement 1: 00:00:00.120229
INFO:  Run 5, Statement 2: 00:00:00.448214

Interesting. So, the IN list outperforms the array bind variable every time by a factor of 4 (which is the size of the array / list!) So, let’s try 8 values, then. Here are the values and the adapted query 1:

-- values
  v_e1 INT := 1;
  v_e2 INT := 2;
  v_e3 INT := 4;
  v_e4 INT := 8;
  v_e5 INT := 16;
  v_e6 INT := 32;
  v_e7 INT := 64;
  v_e8 INT := 128;
  v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8];

-- adapted query 1 ...
        WHERE film_id IN (v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8)
-- ...

The result is still impressive:

INFO:  Run 1, Statement 1: 00:00:00.182646
INFO:  Run 1, Statement 2: 00:00:00.63624
INFO:  Run 2, Statement 1: 00:00:00.184814
INFO:  Run 2, Statement 2: 00:00:00.685976
INFO:  Run 3, Statement 1: 00:00:00.188108
INFO:  Run 3, Statement 2: 00:00:00.634903
INFO:  Run 4, Statement 1: 00:00:00.184933
INFO:  Run 4, Statement 2: 00:00:00.626616
INFO:  Run 5, Statement 1: 00:00:00.185879
INFO:  Run 5, Statement 2: 00:00:00.636723

The IN list query now takes almost 2x as long (but not quite 2x), whereas the array query now takes around 1.5x as long. It looks as though arrays become the better choice when their size increases. So, let’s do this! With 32 bind variables in the IN list, or 32 array elements respectively:

INFO:  Run 1, Statement 1: 00:00:00.905064
INFO:  Run 1, Statement 2: 00:00:00.752819
INFO:  Run 2, Statement 1: 00:00:00.760475
INFO:  Run 2, Statement 2: 00:00:00.758247
INFO:  Run 3, Statement 1: 00:00:00.777667
INFO:  Run 3, Statement 2: 00:00:00.895875
INFO:  Run 4, Statement 1: 00:00:01.308167
INFO:  Run 4, Statement 2: 00:00:00.789537
INFO:  Run 5, Statement 1: 00:00:00.788606
INFO:  Run 5, Statement 2: 00:00:00.776159

Both are about equally fast. 64 bind values!

INFO:  Run 1, Statement 1: 00:00:00.915069
INFO:  Run 1, Statement 2: 00:00:01.058966
INFO:  Run 2, Statement 1: 00:00:00.951488
INFO:  Run 2, Statement 2: 00:00:00.906285
INFO:  Run 3, Statement 1: 00:00:00.907489
INFO:  Run 3, Statement 2: 00:00:00.892393
INFO:  Run 4, Statement 1: 00:00:00.900424
INFO:  Run 4, Statement 2: 00:00:00.903447
INFO:  Run 5, Statement 1: 00:00:00.961805
INFO:  Run 5, Statement 2: 00:00:00.951697

Still about the same. OK… INTERN! Get over here. I need you to “generate” 128 bind values on this query.

Yep, as expected. Finally, arrays start to outperform IN lists:

INFO:  Run 1, Statement 1: 00:00:01.122866
INFO:  Run 1, Statement 2: 00:00:01.083816
INFO:  Run 2, Statement 1: 00:00:01.416469
INFO:  Run 2, Statement 2: 00:00:01.134882
INFO:  Run 3, Statement 1: 00:00:01.122723
INFO:  Run 3, Statement 2: 00:00:01.087755
INFO:  Run 4, Statement 1: 00:00:01.143148
INFO:  Run 4, Statement 2: 00:00:01.124902
INFO:  Run 5, Statement 1: 00:00:01.236722
INFO:  Run 5, Statement 2: 00:00:01.113741

Using Oracle

Oracle also has array types (although you have to declare them as nominal types first, but that’s not a problem here).

Here are some benchmark results (as always, not actual benchmark results, but anonymised units of measurement. I.e. these aren’t seconds but… Larrys):

4 bind values

Run 1, Statement 1 : 01.911000000
Run 1, Statement 2 : 02.852000000
Run 2, Statement 1 : 01.659000000
Run 2, Statement 2 : 02.680000000
Run 3, Statement 1 : 01.628000000
Run 3, Statement 2 : 02.664000000
Run 4, Statement 1 : 01.629000000
Run 4, Statement 2 : 02.657000000
Run 5, Statement 1 : 01.636000000
Run 5, Statement 2 : 02.688000000

128 bind values

Run 1, Statement 1 : 04.010000000
Run 1, Statement 2 : 06.275000000
Run 2, Statement 1 : 03.749000000
Run 2, Statement 2 : 05.440000000
Run 3, Statement 1 : 03.985000000
Run 3, Statement 2 : 05.387000000
Run 4, Statement 1 : 03.807000000
Run 4, Statement 2 : 05.688000000
Run 5, Statement 1 : 03.782000000
Run 5, Statement 2 : 05.803000000

The size of the number of bind values doesn’t seem to matter really. There’s always a constant overhead of using the array bind variable compared to the IN list, but that might as well be a benchmarking error. For instance, when I add the /*+GATHER_PLAN_STATISTICS*/ hint to both queries, interestingly, the one with the array got significantly faster, whereas the IN list one was not affected… Weird?

Conclusion

This article doesn’t go into why there’s such a big difference for small lists when the benefit is only apparent for quite large lists.

But it has once again shown, that we must not optimise prematurely in SQL, but measure, measure, measure things. IN lists in dynamic SQL queries can be a big issue in production when they lead to cursor cache / plan cache saturation and a lot of “hard parsing”. So, the benefit of using the array is much more drastic when the content is big, as we can recycle execution plans much more often than with IN lists.

But chances are, that IN lists may be faster for single executions.

In any case: Choose carefully when following advice that you find somewhere on the Internet. Also, when following this advice. I ran the benchmark on PostgreSQL 9.5 and Oracle 11gR2 XE. Both are not the latest database versions. Try to measure things again on your side, to be sure that your “improvement” is really an actual improvement! And if in doubt, don’t optimise, until you’re sure you actually have a problem.

Are You Binding Your Oracle DATEs Correctly? I Bet You Aren’t

Oracle database has its ways. In my SQL talks at conferences, I love to confuse people with the following Oracle facts:

sql-trivia-1

… and the answer is, of course:

sql-trivia-2

Isn’t it horrible to make empty string the same thing as NULL? Please, Oracle…

The only actually reasonable slide to follow the previous two is this one:

sql-trivia-3

But the DATE type is much more subtle

So you think VARCHAR2 is weird?

Well, we all know that Oracle’s DATE is not really a date as in the SQL standard, or as in all the other databases, or as in java.sql.Date. Oracle’s DATE type is really a TIMESTAMP(0), i.e. a timestamp with a fractional second precision of zero.

Most legacy databases actually use DATE precisely for that, to store timestamps with no fractional seconds, such as:

  • 1970-01-01 00:00:00
  • 2000-02-20 20:00:20
  • 1337-01-01 13:37:00

So, it’s always a safe bet to use java.sql.Timestamp types in Java, when you’re operating with Oracle DATE.

But things can go very wrong when you bind such variables via JDBC as can be seen in this Stack Overflow question here. Let’s assume you have a range predicate like so:

// execute_at is of type DATE and there's an index
PreparedStatement stmt = connection.prepareStatement(
    "SELECT * " + 
    "FROM my_table " +
    "WHERE execute_at > ? AND execute_at < ?");

Now, naturally, we’d expect any index on execute_at to be a sensible choice to use for filtering out records from my_table, and that’s also what happens when we bind java.sql.Date

stmt.setDate(1, start);
stmt.setDate(2, end);

The execution plan is optimal:

-----------------------------------------------------
| Id  | Operation                    | Name         |
-----------------------------------------------------
|   0 | SELECT STATEMENT             |              |
|*  1 |  FILTER                      |              |
|   2 |   TABLE ACCESS BY INDEX ROWID| my_table     |
|*  3 |    INDEX RANGE SCAN          | my_index     |
-----------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(:1:1 AND ""EXECUTE_AT""<:2)

But let’s check out what happens if we assume execute_at to be a date with hours/minutes/seconds, i.e. an Oracle DATE. We’ll be binding java.sql.Timestamp

stmt.setTimestamp(1, start);
stmt.setTimestamp(2, end);

and the execution plan suddenly becomes very bad:

-----------------------------------------------------
| Id  | Operation                    | Name         |
-----------------------------------------------------
|   0 | SELECT STATEMENT             |              |
|*  1 |  FILTER                      |              |
|   2 |   TABLE ACCESS BY INDEX ROWID| my_table     |
|*  3 |    INDEX FULL SCAN           | my_index     |
-----------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(:1:1 AND 
               INTERNAL_FUNCTION(""EXECUTE_AT"")<:2))

What’s this INTERNAL_FUNCTION()

INTERNAL_FUNCTION() is Oracle’s way of silently converting values into other values in ways that are completely opaque. In fact, you cannot even place a function-based index on this pseudo-function to help the database choose a RANGE SCAN on it again. The following is not possible:

CREATE INDEX oracle_why_oh_why
  ON my_table(INTERNAL_FUNCTION(execute_at));

Nope. What the function really does, it widens the less precise DATE type of the execute_at column to the more precise TIMESTAMP type of the bind variable. Just in case.

Why? Because with exclusive range boundaries (> and <), chances are that the fractional seconds in your Timestamp may lead to the timestamp being stricly greater than the lower bound of the range, which would include it, when the same Timestamp with no fractional sections (i.e. an Oracle DATE) would have been excluded.

Duh. But we don’t care, we’re only using Timestamp as a silly workaround in the first place! Now that we know this, you might think that adding a function-based index on an explicit conversion would work, but that’s not the case either:

CREATE INDEX nope
  ON my_table(CAST(execute_at AS TIMESTAMP));

Perhaps, if you magically found the exact right precision of the implicitly used TIMESTAMP(n) type it could work, but that all feels shaky, and besides, I don’t want a second index on that same column!

The solution

The solution given by user APC is actually very simple (and it sucks). Again, you could bind a java.sql.Date, but that would make you lose all hour/minute/second information. No, you have to explicitly cast the bind variable to DATE in the database. Exactly!

PreparedStatement stmt = connection.prepareStatement(
    "SELECT * " + 
    "FROM my_table " +
    "WHERE execute_at > CAST(? AS DATE) " +
    "AND execute_at < CAST(? AS DATE)");

You have to do that every time you bind a java.sql.Timestamp variable to an Oracle DATE value, at least when used in predicates.

How to implement that?

If you’re using JDBC directly, you’re pretty much doomed. Of course, you could run AWR reports to find the worst statements in production and fix only those, but chances are that you won’t be able to fix your statements so easily and deploy them so quickly, so you might want to get it right in advance. And of course, this is production. Tomorrow, another statement would suddenly pop up in your DBA’s reports.

If you’re using JPA / Hibernate, you can only hope that they got it right, because you probably won’t be able to fix those queries, otherwise.

If you’re using jOOQ 3.5 or later, you can take advantage of jOOQ’s new custom type binding feature, which works out of the box with Oracle, and transparently renders that CAST(? AS DATE) for you, only on those columns that are really relevant.

jOOQ is the best way to write SQL in Java

Other databases

If you think that this is an Oracle issue, think again. Oracle is actually very lenient and nice to use when it comes to bind variables. Oracle can infer a lot of types for your bind variables, such that casting is almost never necessary. With other databases, that’s a different story. Read our article about RDBMS bind variable casting madness for more information.