SQL IN Predicate: With IN List or With Array? Which is Faster?

Hah! Got nerd-sniped again:

http://stackoverflow.com/questions/43099226/how-to-make-jooq-to-use-arrays-in-the-in-clause/43102102

A jOOQ user was wondering why jOOQ would generate an IN list for a predicate like this:

Java

COLUMN.in(1, 2, 3, 4)

SQL

COLUMN in (?, ?, ?, ?)

… when in fact there could have been the following predicate being generated, instead:

COLUMN = any(?::int[])

In the second case, there would have been only one single bind variable instead of 4, and the SQL generation and parsing work would have been “much” less (maybe not for the IN list of size 4, but let’s imagine a list of 50 values).

A disclaimer

First off, a disclaimer: In databases that have a cursor cache / plan cache (e.g. Oracle or SQL Server), you should be careful with long IN lists, because they will probably trigger a hard parse every time you run them, as by the time you run the exact same predicate (with 371 elements in the list) again, the execution plan will have been purged from the cache. So, you cannot really profit from the cache.

I’m aware of this problem, and it will be topic of another blog post, soon. Let’s stick to PostgreSQL whose “plan cache” isn’t really that sophisticated.

Measure, don’t guess

The question was about improving the speed of parsing a SQL statement. Parsers are really fast, so parsing shouldn’t be a problem. Generating an execution plan certainly does cost more time, but again, since PostgreSQL’s plan cache isn’t very sophisticated, this won’t play into the issue here. So the question is really:

Is an IN list really that bad in PostgreSQL?

Would an array bind variable be much better?

Since our recent post about benchmarking, we now know that we shall never guess, but always measure. I’m using again the Sakila database to run these two queries:

-- IN list
SELECT * 
FROM film 
JOIN film_actor USING (film_id) 
JOIN actor USING (actor_id) 
WHERE film_id IN (?, ?, ?, ?)

-- Array
SELECT * 
FROM film 
JOIN film_actor USING (film_id) 
JOIN actor USING (actor_id) 
WHERE film_id = ANY(?)

Let’s try lists of length 4, first. The benchmark is here:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000;
  rec RECORD;
  v_e1 INT := 1;
  v_e2 INT := 2;
  v_e3 INT := 4;
  v_e4 INT := 8;
  v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4];
BEGIN
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * 
        FROM film 
        JOIN film_actor USING (film_id) 
        JOIN actor USING (actor_id) 
        WHERE film_id IN (v_e1, v_e2, v_e3, v_e4)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', 
      r, (clock_timestamp() - v_ts); 
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * 
        FROM film 
        JOIN film_actor USING (film_id) 
        JOIN actor USING (actor_id) 
        WHERE film_id = ANY(v_any_arr)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', 
      r, (clock_timestamp() - v_ts); 
  END LOOP;
END$$;

The result being:

INFO:  Run 1, Statement 1: 00:00:00.112195
INFO:  Run 1, Statement 2: 00:00:00.450461
INFO:  Run 2, Statement 1: 00:00:00.109792
INFO:  Run 2, Statement 2: 00:00:00.446518
INFO:  Run 3, Statement 1: 00:00:00.105413
INFO:  Run 3, Statement 2: 00:00:00.44298
INFO:  Run 4, Statement 1: 00:00:00.108249
INFO:  Run 4, Statement 2: 00:00:00.476527
INFO:  Run 5, Statement 1: 00:00:00.120229
INFO:  Run 5, Statement 2: 00:00:00.448214

Interesting. So, the IN list outperforms the array bind variable every time by a factor of 4 (which is the size of the array / list!) So, let’s try 8 values, then. Here are the values and the adapted query 1:

-- values
  v_e1 INT := 1;
  v_e2 INT := 2;
  v_e3 INT := 4;
  v_e4 INT := 8;
  v_e5 INT := 16;
  v_e6 INT := 32;
  v_e7 INT := 64;
  v_e8 INT := 128;
  v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8];

-- adapted query 1 ...
        WHERE film_id IN (v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8)
-- ...

The result is still impressive:

INFO:  Run 1, Statement 1: 00:00:00.182646
INFO:  Run 1, Statement 2: 00:00:00.63624
INFO:  Run 2, Statement 1: 00:00:00.184814
INFO:  Run 2, Statement 2: 00:00:00.685976
INFO:  Run 3, Statement 1: 00:00:00.188108
INFO:  Run 3, Statement 2: 00:00:00.634903
INFO:  Run 4, Statement 1: 00:00:00.184933
INFO:  Run 4, Statement 2: 00:00:00.626616
INFO:  Run 5, Statement 1: 00:00:00.185879
INFO:  Run 5, Statement 2: 00:00:00.636723

The IN list query now takes almost 2x as long (but not quite 2x), whereas the array query now takes around 1.5x as long. It looks as though arrays become the better choice when their size increases. So, let’s do this! With 32 bind variables in the IN list, or 32 array elements respectively:

INFO:  Run 1, Statement 1: 00:00:00.905064
INFO:  Run 1, Statement 2: 00:00:00.752819
INFO:  Run 2, Statement 1: 00:00:00.760475
INFO:  Run 2, Statement 2: 00:00:00.758247
INFO:  Run 3, Statement 1: 00:00:00.777667
INFO:  Run 3, Statement 2: 00:00:00.895875
INFO:  Run 4, Statement 1: 00:00:01.308167
INFO:  Run 4, Statement 2: 00:00:00.789537
INFO:  Run 5, Statement 1: 00:00:00.788606
INFO:  Run 5, Statement 2: 00:00:00.776159

Both are about equally fast. 64 bind values!

INFO:  Run 1, Statement 1: 00:00:00.915069
INFO:  Run 1, Statement 2: 00:00:01.058966
INFO:  Run 2, Statement 1: 00:00:00.951488
INFO:  Run 2, Statement 2: 00:00:00.906285
INFO:  Run 3, Statement 1: 00:00:00.907489
INFO:  Run 3, Statement 2: 00:00:00.892393
INFO:  Run 4, Statement 1: 00:00:00.900424
INFO:  Run 4, Statement 2: 00:00:00.903447
INFO:  Run 5, Statement 1: 00:00:00.961805
INFO:  Run 5, Statement 2: 00:00:00.951697

Still about the same. OK… INTERN! Get over here. I need you to “generate” 128 bind values on this query.

Yep, as expected. Finally, arrays start to outperform IN lists:

INFO:  Run 1, Statement 1: 00:00:01.122866
INFO:  Run 1, Statement 2: 00:00:01.083816
INFO:  Run 2, Statement 1: 00:00:01.416469
INFO:  Run 2, Statement 2: 00:00:01.134882
INFO:  Run 3, Statement 1: 00:00:01.122723
INFO:  Run 3, Statement 2: 00:00:01.087755
INFO:  Run 4, Statement 1: 00:00:01.143148
INFO:  Run 4, Statement 2: 00:00:01.124902
INFO:  Run 5, Statement 1: 00:00:01.236722
INFO:  Run 5, Statement 2: 00:00:01.113741

Using Oracle

Oracle also has array types (although you have to declare them as nominal types first, but that’s not a problem here).

Here are some benchmark results (as always, not actual benchmark results, but anonymised units of measurement. I.e. these aren’t seconds but… Larrys):

4 bind values

Run 1, Statement 1 : 01.911000000
Run 1, Statement 2 : 02.852000000
Run 2, Statement 1 : 01.659000000
Run 2, Statement 2 : 02.680000000
Run 3, Statement 1 : 01.628000000
Run 3, Statement 2 : 02.664000000
Run 4, Statement 1 : 01.629000000
Run 4, Statement 2 : 02.657000000
Run 5, Statement 1 : 01.636000000
Run 5, Statement 2 : 02.688000000

128 bind values

Run 1, Statement 1 : 04.010000000
Run 1, Statement 2 : 06.275000000
Run 2, Statement 1 : 03.749000000
Run 2, Statement 2 : 05.440000000
Run 3, Statement 1 : 03.985000000
Run 3, Statement 2 : 05.387000000
Run 4, Statement 1 : 03.807000000
Run 4, Statement 2 : 05.688000000
Run 5, Statement 1 : 03.782000000
Run 5, Statement 2 : 05.803000000

The size of the number of bind values doesn’t seem to matter really. There’s always a constant overhead of using the array bind variable compared to the IN list, but that might as well be a benchmarking error. For instance, when I add the /*+GATHER_PLAN_STATISTICS*/ hint to both queries, interestingly, the one with the array got significantly faster, whereas the IN list one was not affected… Weird?

Conclusion

This article doesn’t go into why there’s such a big difference for small lists when the benefit is only apparent for quite large lists.

But it has once again shown, that we must not optimise prematurely in SQL, but measure, measure, measure things. IN lists in dynamic SQL queries can be a big issue in production when they lead to cursor cache / plan cache saturation and a lot of “hard parsing”. So, the benefit of using the array is much more drastic when the content is big, as we can recycle execution plans much more often than with IN lists.

But chances are, that IN lists may be faster for single executions.

In any case: Choose carefully when following advice that you find somewhere on the Internet. Also, when following this advice. I ran the benchmark on PostgreSQL 9.5 and Oracle 11gR2 XE. Both are not the latest database versions. Try to measure things again on your side, to be sure that your “improvement” is really an actual improvement! And if in doubt, don’t optimise, until you’re sure you actually have a problem.

What if every object was an array? No more NullPointerExceptions!

To NULL or not to NULL? Programming language designers inevitably have to decide whether they support NULLs or not. And they’ve proven to have a hard time getting this right. NULL is not intuitive in any language, because NULL is an axiom of that language, not a rule that can be derived from lower-level axioms. Take Java for instance, where

// This yields true:
null == null

// These throw an exception (or cannot be compiled)
null.toString();
int value = (Integer) null;

It’s not like there weren’t any alternatives. SQL, for instance, implements a more expressive but probably less intuitive three-value logic, which most developers get wrong in subtle ways once in a while.

At the same time, SQL doesn’t know “NULL” results, only “NULL” column values. From a set theory perspective, there are only empty sets, not NULL sets.

Other languages allow for dereferencing null through special operators, letting the compiler generate tedious null checks for you, behind the scenes. An example for this is Groovy with its null-safe dereferencing operator. This solution is far from being generally accepted, as can be seen in this discussion about a Scala equivalent. Scala uses Option, which Java 8 will imitate using Optional (or @Nullable).

Let’s think about a much broader solution

To me, nullability isn’t a first-class citizen. I personally dislike the fact that Scala’s Option[T] type pollutes my type system by introducing a generic wrapper type (even if it seems to implement similar array-features through the traversable trait). I don’t want to distinguish the types of Option[T] and T. This is specifically true when reasoning about types from a reflection API perspective, where Scala’s (and Java’s) legacy will forever keep me from accessing the type of T at runtime.

But much worse, most of the times, in my application I don’t really want to distinguish between “option” references and “some” references. Heck, I don’t even want to distinguish between having 1 reference and having dozens. jQuery got this quite right. One of the main reasons why jQuery is so popular is because everything you do, you do on a set of wrapped DOM elements. The API never distinguishes between matching 1 or 100 div’s. Check out the following code:

// This clearly operates on a single object or none
$('div#unique-id').html('new content')
                  .click(function() { ... });

// This possibly operates on several objects or none
$('div.any-class').html('new content')
                  .click(function() { ... });

This is possible because JavaScript allows you to override the prototype of the JavaScript Array type, modifying arrays in general, at least for the scope of the jQuery library. How more awesome can it get? .html() and .click() are actions performed on the array as a whole, no matter if you have zero, one, or 100 elements in your match. What would a more typesafe language look like, where everything behaves like an array (or an ArrayList)? Think about the following model:

class Customer {
  String firstNames;  // Read as String[] firstNames
  String lastName;    // Read as String[] lastName
  Order orders;       // Read as Order[] orders
}

class Order {
  int value;          // Read as int[] value
  boolean shipped() { // Read as boolean[] shipped
  }
}

Don’t rant (just yet). Let’s assume this wouldn’t lead to memory or computation overhead. Let’s continue thinking about the advantages of this. So, I want to see if a Customer’s orders have been shipped. Easy:

Customer customer = // ...
boolean shipped = customer.orders.shipped();

This doesn’t look spectacular (yet). But beware of the fact that a customer can have several orders, and the above check is really to see if all orders have been shipped. I really don’t want to write the loop, I find it quite obvious that I want to perform the shipped() check on every order. Consider:

// The length pseudo-field would still be
// present on orders
customer.orders.length;

// In fact, the length pseudo-field is also
// present on customer, in case there are several
customer.length;

// Let's add an order to the customer:
customer.orders.add(new Order());

// Let's reset order
customer.orders.clear();

// Let's calculate the sum of all values
// OO-style:
customer.orders.value.sum();
// Functional style:
sum(customer.orders.value);

Of course there would be a couple of caveats and the above choice of method names might not be the best one. But being able to deal with single references (nullable or non-nullable) or array references (empty, single-valued, multi-valued) in the same syntactic way is just pure syntax awesomeness. Null-checks would be replaced by length checks, but mostly you don’t even have to do those, because each method would always be called on every element in the array. The current single-reference vs. multi-reference semantics would be documented by naming conventions. Clearly, naming something “orders” indicates that multi-references are possible, whereas naming something “customer” indicates that multi-references are improbable.

As users have commented, this technique is commonly referred to as array programming, which is implemented in Matlab or R.

Convinced?

I’m curious to hear your thoughts!