SQL IN Predicate: With IN List or With Array? Which is Faster?

Hah! Got nerd-sniped again:

http://stackoverflow.com/questions/43099226/how-to-make-jooq-to-use-arrays-in-the-in-clause/43102102

A jOOQ user was wondering why jOOQ would generate an IN list for a predicate like this:

Java

COLUMN.in(1, 2, 3, 4)

SQL

COLUMN in (?, ?, ?, ?)

… when in fact there could have been the following predicate being generated, instead:

COLUMN = any(?::int[])

In the second case, there would have been only one single bind variable instead of 4, and the SQL generation and parsing work would have been “much” less (maybe not for the IN list of size 4, but let’s imagine a list of 50 values).

A disclaimer

First off, a disclaimer: In databases that have a cursor cache / plan cache (e.g. Oracle or SQL Server), you should be careful with long IN lists, because they will probably trigger a hard parse every time you run them, as by the time you run the exact same predicate (with 371 elements in the list) again, the execution plan will have been purged from the cache. So, you cannot really profit from the cache.

I’m aware of this problem, and it will be topic of another blog post, soon. Let’s stick to PostgreSQL whose “plan cache” isn’t really that sophisticated.

Measure, don’t guess

The question was about improving the speed of parsing a SQL statement. Parsers are really fast, so parsing shouldn’t be a problem. Generating an execution plan certainly does cost more time, but again, since PostgreSQL’s plan cache isn’t very sophisticated, this won’t play into the issue here. So the question is really:

Is an IN list really that bad in PostgreSQL?

Would an array bind variable be much better?

Since our recent post about benchmarking, we now know that we shall never guess, but always measure. I’m using again the Sakila database to run these two queries:

-- IN list
SELECT * 
FROM film 
JOIN film_actor USING (film_id) 
JOIN actor USING (actor_id) 
WHERE film_id IN (?, ?, ?, ?)

-- Array
SELECT * 
FROM film 
JOIN film_actor USING (film_id) 
JOIN actor USING (actor_id) 
WHERE film_id = ANY(?)

Let’s try lists of length 4, first. The benchmark is here:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000;
  rec RECORD;
  v_e1 INT := 1;
  v_e2 INT := 2;
  v_e3 INT := 4;
  v_e4 INT := 8;
  v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4];
BEGIN
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * 
        FROM film 
        JOIN film_actor USING (film_id) 
        JOIN actor USING (actor_id) 
        WHERE film_id IN (v_e1, v_e2, v_e3, v_e4)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', 
      r, (clock_timestamp() - v_ts); 
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * 
        FROM film 
        JOIN film_actor USING (film_id) 
        JOIN actor USING (actor_id) 
        WHERE film_id = ANY(v_any_arr)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', 
      r, (clock_timestamp() - v_ts); 
  END LOOP;
END$$;

The result being:

INFO:  Run 1, Statement 1: 00:00:00.112195
INFO:  Run 1, Statement 2: 00:00:00.450461
INFO:  Run 2, Statement 1: 00:00:00.109792
INFO:  Run 2, Statement 2: 00:00:00.446518
INFO:  Run 3, Statement 1: 00:00:00.105413
INFO:  Run 3, Statement 2: 00:00:00.44298
INFO:  Run 4, Statement 1: 00:00:00.108249
INFO:  Run 4, Statement 2: 00:00:00.476527
INFO:  Run 5, Statement 1: 00:00:00.120229
INFO:  Run 5, Statement 2: 00:00:00.448214

Interesting. So, the IN list outperforms the array bind variable every time by a factor of 4 (which is the size of the array / list!) So, let’s try 8 values, then. Here are the values and the adapted query 1:

-- values
  v_e1 INT := 1;
  v_e2 INT := 2;
  v_e3 INT := 4;
  v_e4 INT := 8;
  v_e5 INT := 16;
  v_e6 INT := 32;
  v_e7 INT := 64;
  v_e8 INT := 128;
  v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8];

-- adapted query 1 ...
        WHERE film_id IN (v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8)
-- ...

The result is still impressive:

INFO:  Run 1, Statement 1: 00:00:00.182646
INFO:  Run 1, Statement 2: 00:00:00.63624
INFO:  Run 2, Statement 1: 00:00:00.184814
INFO:  Run 2, Statement 2: 00:00:00.685976
INFO:  Run 3, Statement 1: 00:00:00.188108
INFO:  Run 3, Statement 2: 00:00:00.634903
INFO:  Run 4, Statement 1: 00:00:00.184933
INFO:  Run 4, Statement 2: 00:00:00.626616
INFO:  Run 5, Statement 1: 00:00:00.185879
INFO:  Run 5, Statement 2: 00:00:00.636723

The IN list query now takes almost 2x as long (but not quite 2x), whereas the array query now takes around 1.5x as long. It looks as though arrays become the better choice when their size increases. So, let’s do this! With 32 bind variables in the IN list, or 32 array elements respectively:

INFO:  Run 1, Statement 1: 00:00:00.905064
INFO:  Run 1, Statement 2: 00:00:00.752819
INFO:  Run 2, Statement 1: 00:00:00.760475
INFO:  Run 2, Statement 2: 00:00:00.758247
INFO:  Run 3, Statement 1: 00:00:00.777667
INFO:  Run 3, Statement 2: 00:00:00.895875
INFO:  Run 4, Statement 1: 00:00:01.308167
INFO:  Run 4, Statement 2: 00:00:00.789537
INFO:  Run 5, Statement 1: 00:00:00.788606
INFO:  Run 5, Statement 2: 00:00:00.776159

Both are about equally fast. 64 bind values!

INFO:  Run 1, Statement 1: 00:00:00.915069
INFO:  Run 1, Statement 2: 00:00:01.058966
INFO:  Run 2, Statement 1: 00:00:00.951488
INFO:  Run 2, Statement 2: 00:00:00.906285
INFO:  Run 3, Statement 1: 00:00:00.907489
INFO:  Run 3, Statement 2: 00:00:00.892393
INFO:  Run 4, Statement 1: 00:00:00.900424
INFO:  Run 4, Statement 2: 00:00:00.903447
INFO:  Run 5, Statement 1: 00:00:00.961805
INFO:  Run 5, Statement 2: 00:00:00.951697

Still about the same. OK… INTERN! Get over here. I need you to “generate” 128 bind values on this query.

Yep, as expected. Finally, arrays start to outperform IN lists:

INFO:  Run 1, Statement 1: 00:00:01.122866
INFO:  Run 1, Statement 2: 00:00:01.083816
INFO:  Run 2, Statement 1: 00:00:01.416469
INFO:  Run 2, Statement 2: 00:00:01.134882
INFO:  Run 3, Statement 1: 00:00:01.122723
INFO:  Run 3, Statement 2: 00:00:01.087755
INFO:  Run 4, Statement 1: 00:00:01.143148
INFO:  Run 4, Statement 2: 00:00:01.124902
INFO:  Run 5, Statement 1: 00:00:01.236722
INFO:  Run 5, Statement 2: 00:00:01.113741

Using Oracle

Oracle also has array types (although you have to declare them as nominal types first, but that’s not a problem here).

Here are some benchmark results (as always, not actual benchmark results, but anonymised units of measurement. I.e. these aren’t seconds but… Larrys):

4 bind values

Run 1, Statement 1 : 01.911000000
Run 1, Statement 2 : 02.852000000
Run 2, Statement 1 : 01.659000000
Run 2, Statement 2 : 02.680000000
Run 3, Statement 1 : 01.628000000
Run 3, Statement 2 : 02.664000000
Run 4, Statement 1 : 01.629000000
Run 4, Statement 2 : 02.657000000
Run 5, Statement 1 : 01.636000000
Run 5, Statement 2 : 02.688000000

128 bind values

Run 1, Statement 1 : 04.010000000
Run 1, Statement 2 : 06.275000000
Run 2, Statement 1 : 03.749000000
Run 2, Statement 2 : 05.440000000
Run 3, Statement 1 : 03.985000000
Run 3, Statement 2 : 05.387000000
Run 4, Statement 1 : 03.807000000
Run 4, Statement 2 : 05.688000000
Run 5, Statement 1 : 03.782000000
Run 5, Statement 2 : 05.803000000

The size of the number of bind values doesn’t seem to matter really. There’s always a constant overhead of using the array bind variable compared to the IN list, but that might as well be a benchmarking error. For instance, when I add the /*+GATHER_PLAN_STATISTICS*/ hint to both queries, interestingly, the one with the array got significantly faster, whereas the IN list one was not affected… Weird?

Conclusion

This article doesn’t go into why there’s such a big difference for small lists when the benefit is only apparent for quite large lists.

But it has once again shown, that we must not optimise prematurely in SQL, but measure, measure, measure things. IN lists in dynamic SQL queries can be a big issue in production when they lead to cursor cache / plan cache saturation and a lot of “hard parsing”. So, the benefit of using the array is much more drastic when the content is big, as we can recycle execution plans much more often than with IN lists.

But chances are, that IN lists may be faster for single executions.

In any case: Choose carefully when following advice that you find somewhere on the Internet. Also, when following this advice. I ran the benchmark on PostgreSQL 9.5 and Oracle 11gR2 XE. Both are not the latest database versions. Try to measure things again on your side, to be sure that your “improvement” is really an actual improvement! And if in doubt, don’t optimise, until you’re sure you actually have a problem.

SQL JOIN or EXISTS? Chances Are, You’re Doing it Wrong

I’ve noticed this very consistently with a lot of customers, and also with participants of our Data Geekery SQL Workshop (which I highly recommend to everyone, if you excuse the advertising): A lot of developers get the distinction between JOIN and SEMI-JOIN wrong. Let me explain…

What are JOIN and SEMI-JOIN

A little bit of relational algebra first. What is an (INNER) JOIN? An JOIN is nothing but a filtered cartesian product. And what is a cartesian product? Wikipedia explains this very nicely:

for sets A and B, the Cartesian product A × B is the set of all ordered pairs (a, b) where a ∈ A and b ∈ B.

That was the technical way of putting it. A more understandable way might be the following:

ranks = {A, K, Q, J, 10, 9, 8, 7, 6, 5, 4, 3, 2}
suits = {♠, ♥, ♦, ♣}

so, ranks × suits =
{(A, ♠), (A, ♥), (A, ♦), (A, ♣), (K, ♠),
…,
(3, ♣), (2, ♠), (2, ♥), (2, ♦), (2, ♣)}

Or, as an image:

By Trainler - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=7104281

By Trainler – Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=7104281

The above cartesian product models the combination of each rank with each suite. Simple, right?

In SQL, a cartesian product can be written as either a CROSS JOIN, or a table list in the FROM clause. The following query combines every customer with every staff member:

-- CROSS JOIN
SELECT c.last_name, s.last_name
FROM customer AS c
CROSS JOIN staff AS s

-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
     staff AS s

Now, as I mentioned before, an (INNER) JOIN is nothing but a filtered CROSS JOIN, where the filter is applied in a dedicated USING or ON clause.

-- INNER JOIN with USING
SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

-- INNER JOIN with ON
SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s 
  ON c.last_name = s.last_name

The above query will match only those customers with those users whose last_name are the same. As I’ve told you before, an (INNER) JOIN is just a filtered CROSS JOIN, so the below queries will be semantically equivalent to the above:

-- CROSS JOIN
SELECT c.last_name, s.last_name
FROM customer AS c
CROSS JOIN staff AS s
WHERE c.last_name = s.last_name

-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
     staff AS s
WHERE c.last_name = s.last_name

Specifically the last version is still used in many SQL codebases, which have not yet migrated to the ANSI JOIN syntax (even if ANSI joins should be preferred for readability reasons).

But that might be wrong

Unfortunately, I’m seeing this mistake all the time, as I’ve mentioned before. JOIN might appear like a useful tool to match rows between tables. But remember one thing, and I’m starting to repeat myself:

(INNER) JOIN is just a filtered CROSS JOIN

This means that if you choose INNER JOIN to find those customers for which there are matching staff, you will create a cartesian product between customer and staff, and then apply a filter. Why is that a problem? Let’s assume the following:

Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Alice      | Miller    |
| Max        | Doe       |
+------------+-----------+

Staff:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Alice      | Peterson  |
| Jane       | Doe       |
+------------+-----------+

What happens when you run the above queries that use (INNER) JOIN to match customers with staff? Exactly. You’ll form a cartesian product first:

{ (John Doe, John Doe),
  (John Doe, Alice Peterson),
  (John Doe, Jane Doe),
  (Alice Miller, John Doe),
  (Alice Miller, Alice Peterson),
  (Alice Miller, Jane Doe),
  (Max Doe, John Doe),
  (Max Doe, Alice Peterson),
  (Max Doe, Jane Doe) }

… and then filter out the tuples that shouldn’t be in the result, i.e. the ones that don’t have matching last names (of course, the database might choose to optimise this and not materialise the entire cross product):

{ (John Doe, John Doe),
  (John Doe, Jane Doe),
  (Max Doe, John Doe),
  (Max Doe, Jane Doe) }

We’re now left with 4 tuples. That’s great, if that’s what you were after in the first place. A combination of all customers with all staff, for which the combination shares the same last name. But maybe you were asking yourself something else, namely:

Do we have any customers who are staff family members?

Use-case: Exclude such customers from a raffle (let’s assume that last names are a sufficient criteria here).

In that case, we’ll get “duplicate” records. Because the query that some of you might’ve written would have been:

-- INNER JOIN with USING
SELECT c.*
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

Yielding:

Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| John       | Doe       |
| Max        | Doe       |
| Max        | Doe       |
+------------+-----------+

Bummer. How to remove duplicates? With DISTINCT you might think:

-- INNER JOIN with USING
SELECT DISTINCT c.*
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

Yielding:

Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Max        | Doe       |
+------------+-----------+

What’s wrong with DISTINCT?

Using DISTINCT in this situation is a big mistake. Why?

  • Your accidental cartesian product loads too many records from disk, and produces too many records in memory, which have to be removed again
  • DISTINCT can be expensive in some databases, that implement it via sorting, rather than via hashing
  • DISTINCT may change the semantics of your SELECT clause, with nasty side-effects
  • In order to prevent those side-effects, you might even resort to wrapping this DISTINCT query in a subselect, making performance even worse

That’s horrible. See also this list of common SQL mistakes:
https://blog.jooq.org/2013/07/30/10-common-mistakes-java-developers-make-when-writing-sql

How to do it right?

By using a SEMI-JOIN. It is called semi join (i.e. “half” join) in relational algebra, because we only care about one side of the JOIN operation in the results, not the other side. In this example, we only care about customers in the result. We don’t want to have any staff records. The relational algebra notation would be

Customer ⋉ Staff

Unfortunately, SQL doesn’t have SEMI JOIN keywords, so the following isn’t possible:

SELECT *
FROM customer AS c
LEFT SEMI JOIN staff AS s 
  USING (last_name)

The SQL way to express a SEMI JOIN is by using EXISTS () or IN (). The following two are equivalent:

-- Using EXISTS
SELECT *
FROM customer AS c
WHERE EXISTS (
  SELECT *
  FROM staff AS s
  WHERE c.last_name = s.last_name
)

-- Using IN
SELECT *
FROM customer
WHERE last_name IN (
  SELECT last_name
  FROM staff
)

(Note, that NOT EXISTS and NOT IN are NOT equivalent)

Not only are these queries more correct, they are also much faster in most SQL databases for a simple reason. The database can stop searching for staff as soon as it has encountered at least one staff for which there is a matching customer. This is also nicely explained in Dan Martensen’s article SQL Performance of JOIN and WHERE EXISTS. And we’ve blogged about a related topic here: SQL Tip of the Day: Be Wary of SELECT COUNT(*).

Semi Join and Anti Join in jOOQ

We believe that these useful relational operators should be first class citizens in SQL as we have stated in our blog post:
https://blog.jooq.org/2015/10/13/semi-join-and-anti-join-should-have-its-own-syntax-in-sql

Semi join

ctx.select()
   .from(Employee)
   .leftSemiJoin(Dept)
   .on(Employee.DeptName.eq(Dept.DeptName))
   .fetch();

Anti join

ctx.select()
   .from(Employee)
   .leftAntiJoin(Dept)
   .on(Employee.DeptName.eq(Dept.DeptName))
   .fetch();

The above is much easier to write, and will transform into the corresponding (NOT) EXISTS predicate.

Exception

There are some databases that may unfortunately show worse performance for some of these semi join / anti join operators. See, for instance this outdated article on MySQL performance:
http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql

Do measure first, before you believe any of these articles, though!

Another exception is when you have a primary key / foreign key relationship that guarantees that an (INNER) JOIN produces no duplicate values, i.e. when you’re joining a one-to-one or many-to-one relationship, then JOIN is a correct solution, but it is usually equally fast, so semi join will still be more readable.

Conclusion

If you need to check whether you have any matches between a table A and a table B, but you only really care about the results from table A, do make sure you’re using a SEMI-JOIN (i.e. an EXISTS or IN predicate), not an (INNER) JOIN.

Further reading:

INTERSECT – the Underestimated Two-Way IN Predicate

Have you ever wondered how you could express a predicate that “feels” like the following, in SQL:

WHERE Var1 OR Var2 IN (1, 2, 3)

/u/CyBerg90 has, on reddit. The idea was to create a predicate that yields true whenever both values Var1 and Var2 yield either 1, 2, or 3.

The canonical solution

The canonical solution would obviously be to write it all out as:

WHERE Var1 = 1 OR Var1 = 2 OR Var1 = 3
OR    Var2 = 1 OR Var2 = 2 OR Var2 = 3

A lot of duplication, though.

Using IN predicates

Most readers would just connect the two IN predicates:

WHERE Var1 IN (1, 2, 3)
OR    Var2 IN (1, 2, 3)

Or the clever ones might reverse the predicates as such, to form the equivalent:

WHERE 1 IN (Var1, Var2)
OR    2 IN (Var1, Var2)
OR    3 IN (Var1, Var2) 

Nicer solution using EXISTS and JOIN

All of the previous solutions require syntax / expression repetition to some extent. While this may not have any significant impact performance-wise, it can definitely explode in terms of expression length. Better solutions (from that perspective) make use of the EXISTS predicate, constructing ad-hoc sets that are non-empty when both Var1 and Var2 yield either 1, 2, or 3.

Here’s EXISTS with JOIN

WHERE EXISTS (
    SELECT 1
    FROM (VALUES (Var1), (Var2)) t1(v)
    JOIN (VALUES (1), (2), (3)) t2(v)
    ON t1.v = t2.v
)

This solution constructs two tables with a single value each, joining them on that value:

+------+    +------+
| t1.v |    | t2.v |
+------+    +------+
| Var1 |    |    1 |
| Var2 |    |    2 |
+------+    |    3 |
            +------+

Looking at a Venn Diagram, it is easy to see how JOIN will produce only those values from t1 and t2 that are present in both sets:

intersect

Nicest solution using EXISTS and INTERSECT

However, people might not think of a set intersection when they read JOIN. So why not make use of actual set intersection via INTERSECT? The following is the nicest solution in my opinion:

WHERE EXISTS (
    SELECT v
    FROM (VALUES (Var1), (Var2)) t1(v)
    INTERSECT
    SELECT v
    FROM (VALUES (1), (2), (3)) t2(v)
)

Observe, how the length of the SQL statement increases with O(m + n) (or simply O(N), where m, n = number of values in each set, whereas the original solutions using IN increase with O(m * n) (or simply O(N2)).

INTERSECT Support in popular RDBMS

INTERSECT is widely supported, both in the SQL standard as well as in any of the following RDBMS that are supported by jOOQ:

  • CUBRID
  • DB2
  • Derby
  • H2
  • HANA
  • HSQLDB
  • Informix
  • Ingres
  • Oracle
  • PostgreSQL
  • SQLite
  • SQL Server
  • Sybase ASE
  • Sybase SQL Anywhere

In fact, the following databases also support the less commonly used INTERSECT ALL, which doesn’t remove duplicate values from resulting bags (see also UNION vs. UNION ALL)

  • CUBRID
  • PostgreSQL

Happy intersecting!

ID Lists Aren’t the Best Solution for the N+1 Problem

In their eternal attempts to circumvent the N+1 problem, Hibernate users often resort to IN predicates with ID lists. In this post, we’ll see how those users might just be replacing a horrible thing with a bad one, which is better but not yet good. Here’s why:

The N+1 Problem

The N+1 problem is a well understood issue, documented in various blog posts. The previously linked article shows the following set of queries to explain the nature of this problem:

SELECT id, name FROM albums
SELECT id, name FROM songs WHERE album_id = 1
SELECT id, name FROM songs WHERE album_id = 2
SELECT id, name FROM songs WHERE album_id = 3
SELECT id, name FROM songs WHERE album_id = 4
SELECT id, name FROM songs WHERE album_id = 5

This set of queries is often produced by ORMs such as Hibernate, when entities are configured to be lazy fetched.

The article also tackles the problem by replacing the second set of N=5 queries by a single query with an IN predicate:

SELECT id, title, filename FROM songs
WHERE album_id IN (1, 2, 3, 4, 5)

This will reduce the number of queries from N+1 to 1+1, which is certainly faster.

But let’s look at things from the SQL side

Is such an IN predicate with an ID List a good solution? It is certainly viable for very small lists. But when your list grows, consider these things:

IN list size

Not all databases support arbitrary lengths of IN lists. In particular the following limitations exist:

  • Oracle IN predicate: 1000 elements
  • Ingres: 1024 total bind values
  • SQLite: 999 total bind values
  • Sybase ASE: 2000 total bind values
  • SQL Server 2008 R2: 2100 total bind values

This is quite annoying as developers have to probably learn the above the hard way. If you’re using jOOQ, you can “safely” ignore the above constraints as jOOQ will rewrite your query such that:

  • Large Oracle IN predicates are split into several OR-connected IN predicates, or AND-connected NOT IN predicates
  • Large amounts of bind values are detected at SQL rendering time and replaced by inline values

Variable binding speed

There’s quite a bit of work involved with variable binding in some JDBC drivers. Essentially, some database protocols will need to transfer many values one-by-one to the database. This doesn’t happen when you inline bind values, as the only thing transferred is a single SQL string. Having too many bind values (I’m talking about 10k or more), is certainly not a good idea.

Cursor cache misses

Sophisticated databases such as Oracle maintain cursor caches, which can be leveraged for cursor sharing. This means that subsequent executions of identical SQL statements will profit from expensive execution plan calculations having been done already, along with cursor statistics being collected. Think about it this way:

-- This is the first time Oracle encounters this 
-- query. The DB has to parse the query and 
-- calculate an execution plan, which can be quite 
-- expensive if you have lots of JOINs
SELECT id, name FROM songs 
WHERE album_id IN (?, ?)

-- This is the second time Oracle encounters this 
-- same query. The DB can now re-use the previous
-- execution plan as it is likely to be optimal 
-- again
SELECT id, name FROM songs 
WHERE album_id IN (?, ?)

-- This is not the same query as the previous ones
-- A new execution plan has to be calculated
SELECT id, name FROM songs 
WHERE album_id IN (?, ?, ?)

As you can quickly see, the above example shows that an ID list in an IN predicate is a moving target, which is likely to remove the usefulness of bind values entirely, as each query is prone to produce new cursors and new execution plans in your database. You might as well have inlined your bind values, which would have even helped you prevent bind value peeking issues.

So what’s better than ID lists?

There are a number of things that are better. Note that not all of them may be suitable for your concrete problem, and not all of them will always outperform ID lists. Use common sense and maybe a load and/or performance test to be sure, which is the best query in your situation.

Explicit “eager” fetching, using JOINs

Sometimes, it would just be easier to denormalise the data in the database. Instead of fetching songs one by one, just fetch them along with the albums:

SELECT
  a.id a_id, 
  a.name a_name,
  s.id s_id,
  s.name s_name
FROM albums a
JOIN songs s ON s.album_id = a.id

This will transfer more data over the wire (repeating album information) in exchange for executing only a single query (reducing N+1 to 1). This is only good for slight denormalisations. If you JOIN dozens of 1:N relationships, you might not be happy with this solution.

Semi-joining the original query

If you can access the original query’s SQL code, just semi-join it when fetching songs! It’s simple:

SELECT id, name FROM songs 
WHERE album_id IN (
  SELECT id FROM albums
)

-- Or using EXISTS
SELECT s.id, s.name FROM songs s
WHERE EXISTS (
  SELECT 1 FROM albums a
  WHERE a.id = s.album_id
)

This will require some SQL transformation. Again, using a typesafe query builder / SQL builder to compose queries, such as jOOQ, JaQu or Criteria API, you may be able to implement such SQL transformation / SQL composition more easily.

Note that this is probably the fastest solution that you can choose, at least in sophisticated databases with powerful query optimisers.

Using arrays for ID lists

If you really cannot query your songs without an ID list, at least, use a single array as a bind variable as such (Oracle dialect):

SELECT id, name FROM songs 
WHERE album_id IN (
  SELECT * FROM TABLE(?)
)

The above syntax is Oracle-specific. Check out this Stack Overflow question for other alternatives. Note that Oracle’s VARRAY and TABLE types are strongly typed, i.e. you will have to have such a type, first:

CREATE TYPE numbers AS TABLE OF NUMBER(38);

Alternatively, you can use one of these “built-in” table types:

  • ORA_MINING_NUMBER_NT
  • ORA_MINING_VARCHAR2_NT

Creating discrete-sized IN lists

If your database doesn’t support arrays, and you need to rely on ID lists, there is one last option that you may have to avoid too many cursor cache misses and hard parses. Create discrete-sized IN lists, filling up the bind values to the next discrete length. Let’s assume lengths 2, 3, 5, 8, 13. This is best explained by example:

-- Of course, this only makes sense with bind values
-- Inlining is done for the purpose of the example
-- only

-- Two IDs   fill up to 2
album_id IN (1, 2)

-- Three IDs fill up to 3
album_id IN (1, 2, 3)

-- Four IDs  fill up to 5
album_id IN (1, 2, 3, 4, 4)

-- Five IDs  fill up to 5
album_id IN (1, 2, 3, 4, 5)

-- Six IDs   fill up to 8
album_id IN (1, 2, 3, 4, 5, 6, 6, 6)

There is no rule of thumb at what steps your IN list sizes should increase, so you might want to actually measure this.

Note!: You may use NULL to fill up IN lists of an IN predicate, but not of a NOT IN predicate. To learn more about this, read this blog post about NULL and NOT IN predicates.

TL;DR: Get back in control of your SQL

As soon as a decent amount of data is involved with your data processing, common ORM models may not be sufficient anymore, as it is very hard to tune such ORMs. You may need to resort to SQL and explicitly express your SQL statements in the most optimal way for your problem domain.

SQL incompatibilities: NOT IN and NULL values

This is something where many hours of debugging have been spent in the lives of many SQL developers. The various situations where you can have NULL values in NOT IN predicates and anti-joins. Here’s a typical situation:

with data as (
  select 1 as id from dual union all
  select 2 as id from dual union all
  select 3 as id from dual
)
select * from data
where id not in (1, null)

What do you think this will return? Well, since “dual” indicates an Oracle database, you might say: “an empty result set”. And you would be right for Oracle. In fact, you would be right for any of these databases:

  • DB2
  • Derby
  • H2
  • Ingres
  • Oracle
  • Postgres
  • SQL Server
  • SQLite
  • Sybase

BUT! You would be wrong for any of these ones:

  • HSQLDB
  • MySQL
  • Sybase ASE

Why the discrepancy?

Intuitively, you’d say that all the big ones treat NULL specially in NOT IN predicates, and it is easy to understand, why:

-- This predicate here...
id not in (1, null)

-- Could be seen as equivalent to this one:
id != 1 and id != null

There’s no id that fulfills the above predicate id != null (not even null itself), hence an empty result set. MySQL is known for some strong abuse of SQL standards compliance, so it’s not surprising that they tweaked this syntax as well.

But wait!

HSQLDB 2.0 is one of the most standards-compliant databases out there, could they really have gotten it wrong? Let’s consider the standard: SQL 1992, chapter 8.4 <in predicate>:

<in predicate> ::=
   <row value constructor>
      [ NOT ] IN <in predicate value>

<in predicate value> ::=
   <table subquery>
      | <left paren> <in value list> <right paren>

<in value list> ::=
   <value expression> { <comma> <value expression> }...

 

And then, further down:

2) Let RVC be the <row value constructor> and 
   let IPV be the <in predicate value>.

3) The expression
     RVC NOT IN IPV

   is equivalent to
     NOT ( RVC IN IPV )

4) The expression
     RVC IN IPV

   is equivalent to
     RVC = ANY IPV

 

So in fact, this can be said:

ID NOT IN (1, NULL) is equivalent to
NOT (ID IN (1, NULL)), equivalent to
NOT (ID = ANY(1, NULL)), equivalent to
NOT (ID = 1 OR ID = NULL), equivalent to
NOT (ID = 1) AND NOT (ID = NULL), which is always UNKNOWN

Conclusion

It looks for once, that HSQLDB 2.0 is not standards-compliant in that evaluating the expression inside NOT() before applying NOT() has a different outcome from transforming NOT() into a normalised boolean expression, and then evaluating the expression. For SQL developers, all of this can just mean:

Keep NULL out of NOT IN predicates or be doomed!