When Using Bind Variables is not Enough: Dynamic IN Lists

In a previous blog post, I wrote about why you should (almost) always default to using bind variables. There are some exceptions, which I will cover in another follow-up post, but by default, bind variables are the right choice, both from a performance and from a security perspective.

In this article, I will show an example where regrettably, bind variables are not enough, and you can still run into significant performance issues in production. That’s when you create dynamic IN lists.

What’s the problem with dynamic IN lists

Again, this blog post is applicable for most databases, but I’ll use Oracle as an example, only.

An IN predicate that takes a list is a handy abbreviation for an equivalent OR predicate. The following two queries are semantically equivalent:

-- IN predicate
SELECT *
FROM actor
WHERE actor_id IN (1, 2, 3)

-- OR predicate
SELECT *
FROM actor
WHERE actor_id = 1
OR actor_id = 2
OR actor_id = 3

The execution plans are, respectively:

IN list

---------------------------------------------------------
| Id  | Operation                    | Name     | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     3 |
|   1 |  INLIST ITERATOR             |          |       |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR    |     3 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR |     3 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ACTOR_ID"=1 OR "ACTOR_ID"=2 OR "ACTOR_ID"=3)

OR predicate

---------------------------------------------------------
| Id  | Operation                    | Name     | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     3 |
|   1 |  INLIST ITERATOR             |          |       |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR    |     3 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR |     3 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ACTOR_ID"=1 OR "ACTOR_ID"=2 OR "ACTOR_ID"=3)

We get the exact same plan. And the plan is also optimal. When looking for 3 actors, we’ll run 3 primary key lookups and get a cardinality estimate of 3. Perfect, what could be wrong about this?

Remember execution plan caching

In that previous blog post, we’ve seen the detrimental effects we can get under load when execution plans are no longer cached in the execution plan cache (or in Oracle: cursor cache). But even when we’re using bind variables, we can run into these issues. Consider the following dynamic SQL queries:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?)

Now, let’s generate these SQL queries using dynamic SQL in PL/SQL. PL/SQL is not a very dynamic SQL friendly language as can be seen in the following snippet:

SET SERVEROUTPUT ON
DECLARE
  v_count NUMBER;
  
  FUNCTION binds (n NUMBER) RETURN VARCHAR2 IS
    v_result VARCHAR2(1000);
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := ':b' || i;
      ELSE
        v_result := v_result || ', :b' || i;
      END IF;
    END LOOP;
    
    RETURN v_result;
  END binds;
  
  FUNCTION vals (n NUMBER) RETURN VARCHAR2 IS
    v_result VARCHAR2(1000);
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := i;
      ELSE
        v_result := v_result || ', ' || i;
      END IF;
    END LOOP;
    
    RETURN v_result;
  END vals;
BEGIN
  FOR i IN 1..5 LOOP
    EXECUTE IMMEDIATE '
      DECLARE 
        v_count NUMBER;
      BEGIN
        EXECUTE IMMEDIATE ''
          SELECT count(*)
          FROM actor
          WHERE actor_id IN (' || binds(i) || ')
        ''
        INTO v_count
        USING ' || vals(i) || ';
        
        :v_count := v_count;
      END;
    '
    USING OUT v_count
    ;
    
    dbms_output.put_line(v_count);
  END LOOP;
END;
/

Running this snippet yields, as expected:

1
2
3
4
5

There are a few noteworthy things to say about the above:

  • The functions BINDS() and VALS() generate strings like :b1, :b2, :b3 and 1, 2, 3. We need these to generate dynamically a list of bind variables for the IN predicates, as well as a list of bind values for the USING clause of the nested EXECUTE IMMEDIATE statement
  • The EXECUTE IMMEDIATE .. USING clause does not take dynamically sized bind value lists, so we have to generate that as well, in a nested EXECUTE IMMEDIATE statement

An appropriate gif at this point when nesting EXECUTE IMMEDIATE like we did is this one:

Or also:

So, now that we got this covered…

… let’s look at execution plans:

SELECT s.sql_id, p.*
FROM v$sql s, TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
) p
WHERE lower(s.sql_text) LIKE '%actor_id in %'
AND lower(s.sql_text) NOT LIKE '%v$sql%'
AND lower(s.sql_text) NOT LIKE '%execute_immediate%';

And the result is (some additional formatting applied):

SQL_ID  cwm09k8zqp2at, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1)
 
Plan hash value: 1971314150
 
------------------------------------------------
| Id  | Operation          | Name     | E-Rows |
------------------------------------------------
|   0 | SELECT STATEMENT   |          |        |
|   1 |  SORT AGGREGATE    |          |      1 |
|*  2 |   INDEX UNIQUE SCAN| PK_ACTOR |      1 |
------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ACTOR_ID"=:B1)
 
SQL_ID  1wypurx1x0mrp, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2)
 
Plan hash value: 577114894
 
-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      2 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2))

SQL_ID  2y06nwgpxqttn, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3)
 
Plan hash value: 577114894
 
-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      3 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3))

SQL_ID  d4nn8qf9n22yt, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3, :b4)
 
Plan hash value: 577114894
 
-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      4 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3 OR 
              "ACTOR_ID"=:B4))

SQL_ID  4n9b4zgxr5cwj, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3, :b4, :b5)
 
Plan hash value: 577114894
 
-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      5 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3 OR 
              "ACTOR_ID"=:B4 OR "ACTOR_ID"=:B5))

So, again, we’re down to having 5 different SQL_IDs and statements in the execution plan cache / cursor cache, even if they all derive from a single logical statement. Now, in PL/SQL, this is so hard to do, you’re probably not doing it at all.

But if you’re using a query builder (or wrote your own), like JPA’s Criteria Query, or jOOQ, then you will be writing something like this (using jOOQ, because in fact, using the Criteria Query API, dynamic SQL is about as hard to do as in PL/SQL):

for (int i = 1; i <= 5; i++) {
    System.out.println(
        DSL.using(configuration)
           .select(count())
           .from(ACTOR)
           .where(ACTOR.ACTOR_ID.in(
                IntStream.rangeClosed(1, i)
                         .boxed()
                         .collect(toList())
           )
           .fetchOne(count())
    );
}

So this query will generate the same queries as the PL/SQL code, and this really wasn’t that hard. Perhaps, in this particular case, it’s not that much of a good thing that this isn’t so hard to do!

Let’s benchmark this, again

There are different ways to benchmark this problem. One mistake we could make here is to just benchmark small IN lists with large ones, but that’s not fair. If you have to send a large IN list to the server, you cannot easily change that.

Another option is to benchmark alternative approaches of generating those IN lists

  • The default way is to have a list of exactly size N
  • An alternative way would be to pad the list to a size M where M = 2 ^ X such that M(X) >= N and M(X - 1) < N

The latter approach would, instead of generating the original queries:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?)

… generate these ones:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?, ?, ?, ?)

Where the last bind variable is always repeated (padded) until the list is filled up to a length that is a power of 2 (and in Oracle, of course, we’d split lists longer than 1000 items into several lists).

This padding is what jOOQ already offers:
https://www.jooq.org/doc/latest/manual/sql-building/dsl-context/custom-settings/settings-in-list-padding/

And maybe, a future Hibernate will offer it too, after I’ve suggested the feature to Vlad Mihalcea from the Hibernate team:
https://hibernate.atlassian.net/browse/HHH-12469

Here’s the benchmark logic. It’s a bit more complex than previously:

SET SERVEROUTPUT ON

ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;

CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 20000;
  v_first_name actor.first_name%TYPE;
  v_last_name  actor.last_name%TYPE;
  
  FUNCTION binds (n NUMBER, pad BOOLEAN) RETURN VARCHAR2 IS
    v_result VARCHAR2(32767);
    v_n_pad  NUMBER(10) := power(2, ceil(ln(n)/ln(2)));
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := ':b' || i;
      ELSIF mod(i, 1000) = 0 THEN
        v_result := v_result || ') OR actor_id IN (:b' || i;
      ELSE
        v_result := v_result || ', :b' || i;
      END IF;
    END LOOP;
    
    IF pad THEN
      FOR i IN n + 1 .. v_n_pad LOOP
        IF mod(i, 1000) = 0 THEN
          v_result := v_result || ') OR actor_id IN (:b' || i;
        ELSE
          v_result := v_result || ', :b' || i;
        END IF;
      END LOOP;
    END IF;
    
    RETURN v_result;
  END binds;
  
  FUNCTION vals (n NUMBER, pad BOOLEAN) RETURN VARCHAR2 IS
    v_result VARCHAR2(32767);
    v_n_pad  NUMBER(10) := power(2, ceil(ln(n)/ln(2)));
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := i;
      ELSE
        v_result := v_result || ', ' || i;
      END IF;
    END LOOP;
    
    IF pad THEN
      FOR i IN n + 1 .. v_n_pad LOOP
        v_result := v_result || ', ' || n;
      END LOOP;
    END IF;
    
    RETURN v_result;
  END vals;
  
  PROCEDURE run(i NUMBER, pad BOOLEAN) IS
  BEGIN
    EXECUTE IMMEDIATE '
      DECLARE 
        v_count NUMBER;
      BEGIN
        EXECUTE IMMEDIATE ''
          SELECT count(*)
          FROM actor
          WHERE actor_id IN (' || binds(i, pad) || ')
        ''
        INTO v_count
        USING ' || vals(i, pad) || ';
      END;
    ';
  END run;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      run(mod(i, 100) + 1, FALSE);
    END LOOP;
  
    INSERT INTO results VALUES (r, 1, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      run(mod(i, 100) + 1, TRUE);
    END LOOP;
  
    INSERT INTO results VALUES (r, 2, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio 
    FROM results
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio);
  END LOOP;
END;
/

DROP TABLE results;

Bear in mind the following:

  • We run the benchmark 5 times
  • In each run, we run 20000 queries with each technique
  • Those 20000 queries have between 1 and 100 elements in the IN list

The results show that both approaches fare equally well:

Run 1, Statement 1 : 1.16944
Run 1, Statement 2 : 1.11204
Run 2, Statement 1 : 1.06086
Run 2, Statement 2 : 1.03591
Run 3, Statement 1 : 1.03589
Run 3, Statement 2 : 1.03605
Run 4, Statement 1 : 1.33935
Run 4, Statement 2 : 1.2822
Run 5, Statement 1 : 1
Run 5, Statement 2 : 1.04648

This means that there is no significant overhead caused by the padding in this benchmark. That’s good news.

Running the following query shows the number of executions per statement:

SELECT executions, regexp_count(sql_text, ':'), sql_text
FROM v$sql
WHERE lower(sql_text) LIKE '%actor_id in %'
AND lower(sql_text) NOT LIKE '%v$sql%'
AND lower(sql_text) NOT LIKE '%execute_immediate%'
ORDER BY sql_text;

The result being:

EXECS   BINDS   SQL_TEXT
2000	1       SELECT count(*) FROM actor WHERE actor_id IN (:b1)         
2000	2       SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2)         
1000	3       SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3)         
3000	4       ...
1000	5       ...
1000	6       ...
1000	7       ...
5000	8       ...
1000	9       ...
1000	10      ...
1000	11      ...
1000	12      ...
1000	13	...
1000	14	...
1000	15	...
9000	16	...
...
1000	100	...
36000	128	...

The version that creates exactly sized IN lists is always executed 1000 times. The version that has padded IN lists sized to the next power of 2 (M = 2 ^ X) is executed 1000 * X times (plus another 1000 times because there’s also an IN list of that exact size)

At the end, there are some “lone” 128-element sized IN lists. This is expected. The number of different SQL statements is:

  • 100 for the exactly sized IN lists
  • 7 for the padded IN lists (128 = 2 ^ 7)

So, thus far, we haven’t gained anything, but also not lost much.

What if the lists get longer

The real problem isn’t the individual query execution time, which is fine in both cases, but the effect that dynamic IN lists have on a system. Let’s generate IN lists of up to size 4096, and bearing in mind that Oracle won’t allow this, the lists will be of the form:

SELECT count(*)
FROM actor
WHERE actor_id IN (:b1, :b2, ..., :b1000)
OR actor_id IN (:b1001, :b1002, ..., :b2000)
OR actor_id IN (:b2001, ...)

Now, with a maximum size of 4096, there are quite a few distinct statements that need to be parsed. As opposed to only 12 when we use the padded in lists. The padded size grows logarithmically, so we’ll never really run into this same problem again, at least not for a single IN list.

How does the comparison fare in another benchmark? We must be careful not to overdo things. Running 2x 20000 queries with variable sized IN lists can easily run for more than an hour on Oracle. Perhaps, it’s Oracle’s way of punishing you for doing something you shouldn’t: Having super-long IN lists. I’ve played around with different sets of parameters to the benchmark and the clear message here is that having long bind variable lists creates a bottleneck per se (i.e. the binding of the variables takes time), so the difference in parse times isn’t as drastic as it was before, for individual query executions. The overall improvement is this:

Run 1, Statement 1 : 1.42696
Run 1, Statement 2 : 1

So, we get around a 1.5x increase of performance when repeating the bind values. Yet still, the fact that we have far less cached execution plans does matter, especially when we start exhausting the cursor cache, because with this approach, at least this exhaustion is not going to happen.

Alternative options

The above shows that if you really cannot get around an IN list, a simple, viable workaround is to pad the lists. But often, you can get around them. In databases like Oracle and PostgreSQL, that offer array types, you can consider using those once your lists start growing. I’ve blogged about this in a previous blog post:
https://blog.jooq.org/2017/03/30/sql-in-predicate-with-in-list-or-with-array-which-is-faster

If your lists become huge, a bulk/batch inserting the IDs into a temporary table might be a far better option, and then semi-join that one instead:

SELECT count(*)
FROM actor
WHERE actor_id IN (
  SELECT id FROM temp_table
)

And ideally, if your list is really not dynamic, but originates from some other table, then please, stop passing around IDs and semi-join the original query that produced the IDs in the first place:

SELECT count(*)
FROM actor
WHERE actor_id IN (
  SELECT id FROM other_table WHERE ...
)

That last option should always be your first choice as it will very likely outperform all the others.

7 thoughts on “When Using Bind Variables is not Enough: Dynamic IN Lists

  1. Some other cons.
    Problem with skewed columns, bind peek done on parse phase only (diff ways to workaround).
    Slightly more involved to get query and values while looking at dynamic views.
    From dev side, and that is the big problem, writing raw SQL without binds is much easier and follows naturally.
    Some values should be hardcoded, dev should know to chose.
    Not all rdbms do big difference from perf side, which can be the problem when switching db type and have larger codebase.
    Likely you wrote already about that stuff, sorry if I jumped the gun here 🙂

  2. Normally, I’ll dynamically generate a huge UNION sub query to solve large IN parameter
    (Oracle)
    WHERE column IN (
    SELECT ‘1’ FROM DUAL
    UNION ALL
    SELECT ‘2’ FROM DUAL
    UNION ALL
    SELECT ‘3’ FROM DUAL
    UNION ALL



    UNION ALL
    SELECT ‘1001’ FROM DUAL
    )

      • OK, benchmarked it now. Code here:

        Results on my machine:

        Run 1, Statement 1 : +000000000 00:00:00.099910000
        Run 1, Statement 2 : +000000000 00:00:06.192357000
        
        Run 2, Statement 1 : +000000000 00:00:00.021193000
        Run 2, Statement 2 : +000000000 00:00:00.150208000
        
        Run 3, Statement 1 : +000000000 00:00:00.023189000
        Run 3, Statement 2 : +000000000 00:00:00.143014000
        
        Run 4, Statement 1 : +000000000 00:00:00.021395000
        Run 4, Statement 2 : +000000000 00:00:00.150450000
        
        Run 5, Statement 1 : +000000000 00:00:00.020566000
        Run 5, Statement 2 : +000000000 00:00:00.158801000
        

        Your solution is much worse than the OR-connected IN list. But anyway, at some point, both solutions suck as an array-based (or even temp table based) solution will be better:
        https://blog.jooq.org/2017/03/30/sql-in-predicate-with-in-list-or-with-array-which-is-faster

      • I tested both queries with a simple JDBC program, the result is vary. sometimes UNION ALL is faster than OR.

        import java.sql.*;
        import java.util.stream.Collectors;
        import java.util.stream.IntStream;
        
        public class OracleINPerf {
        
            private static final String URL = "";
            private static final String USER = "";
            private static final String PASSWORD = "";
        
            private static final int ITERATION = 5;
        
            public static void main(String[] args) throws SQLException {
                try (Connection conn = getConnection()) {
                    run("OR", conn, buildORSql());
                    System.out.println();
                    run("UNION", conn, buildUnionSql());
                }
            }
        
            private static void run(String name, Connection conn, String sql) throws SQLException {
                long total = 0L;
                for (int i = 0; i  i.toString())
                                     .collect(Collectors.joining(", "))
                        )
                        .append(") OR ACTOR_ID IN(")
                        .append(
                            IntStream.rangeClosed(1001, 2000)
                                     .boxed()
                                     .map(i -> i.toString())
                                     .collect(Collectors.joining(", "))
                        )
                        .append(')')
                        .toString();
            }
        
            private static String buildUnionSql() {
                return new StringBuilder()
                        .append("SELECT * FROM ACTOR WHERE ACTOR_ID IN(")
                        .append(
                            IntStream.rangeClosed(1, 2000)
                                     .boxed()
                                     .map(i -> "SELECT " + i + " FROM DUAL")
                                     .collect(Collectors.joining(" UNION ALL "))
                        )
                        .append(')')
                        .toString();
            }
        
            private static Connection getConnection() throws SQLException {
                return DriverManager.getConnection(URL, USER, PASSWORD);
            }
        }
        
        • Also, always make sure you run your benchmarks several times. You now ran the OR query first, which means that it gets all the penalty of warming up the buffer caches and other things. Clearly, the second run gets an unfair benefit from being second. Compare this to my benchmark results, where both initial runs had a small penalty, which is why the benchmark is run 5 times.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s