When Using Bind Variables is not Enough: Dynamic IN Lists

In a previous blog post, I wrote about why you should (almost) always default to using bind variables. There are some exceptions, which I will cover in another follow-up post, but by default, bind variables are the right choice, both from a performance and from a security perspective.

In this article, I will show an example where regrettably, bind variables are not enough, and you can still run into significant performance issues in production. That’s when you create dynamic IN lists.

What’s the problem with dynamic IN lists

Again, this blog post is applicable for most databases, but I’ll use Oracle as an example, only.

An IN predicate that takes a list is a handy abbreviation for an equivalent OR predicate. The following two queries are semantically equivalent:

-- IN predicate
SELECT *
FROM actor
WHERE actor_id IN (1, 2, 3)

-- OR predicate
SELECT *
FROM actor
WHERE actor_id = 1
OR actor_id = 2
OR actor_id = 3

The execution plans are, respectively:

IN list

---------------------------------------------------------
| Id  | Operation                    | Name     | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     3 |
|   1 |  INLIST ITERATOR             |          |       |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR    |     3 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR |     3 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ACTOR_ID"=1 OR "ACTOR_ID"=2 OR "ACTOR_ID"=3)

OR predicate

---------------------------------------------------------
| Id  | Operation                    | Name     | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     3 |
|   1 |  INLIST ITERATOR             |          |       |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR    |     3 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR |     3 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ACTOR_ID"=1 OR "ACTOR_ID"=2 OR "ACTOR_ID"=3)

We get the exact same plan. And the plan is also optimal. When looking for 3 actors, we’ll run 3 primary key lookups and get a cardinality estimate of 3. Perfect, what could be wrong about this?

Remember execution plan caching

In that previous blog post, we’ve seen the detrimental effects we can get under load when execution plans are no longer cached in the execution plan cache (or in Oracle: cursor cache). But even when we’re using bind variables, we can run into these issues. Consider the following dynamic SQL queries:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?)

Now, let’s generate these SQL queries using dynamic SQL in PL/SQL. PL/SQL is not a very dynamic SQL friendly language as can be seen in the following snippet:

SET SERVEROUTPUT ON
DECLARE
  v_count NUMBER;
  
  FUNCTION binds (n NUMBER) RETURN VARCHAR2 IS
    v_result VARCHAR2(1000);
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := ':b' || i;
      ELSE
        v_result := v_result || ', :b' || i;
      END IF;
    END LOOP;
    
    RETURN v_result;
  END binds;
  
  FUNCTION vals (n NUMBER) RETURN VARCHAR2 IS
    v_result VARCHAR2(1000);
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := i;
      ELSE
        v_result := v_result || ', ' || i;
      END IF;
    END LOOP;
    
    RETURN v_result;
  END vals;
BEGIN
  FOR i IN 1..5 LOOP
    EXECUTE IMMEDIATE '
      DECLARE 
        v_count NUMBER;
      BEGIN
        EXECUTE IMMEDIATE ''
          SELECT count(*)
          FROM actor
          WHERE actor_id IN (' || binds(i) || ')
        ''
        INTO v_count
        USING ' || vals(i) || ';
        
        :v_count := v_count;
      END;
    '
    USING OUT v_count
    ;
    
    dbms_output.put_line(v_count);
  END LOOP;
END;
/

Running this snippet yields, as expected:

1
2
3
4
5

There are a few noteworthy things to say about the above:

  • The functions BINDS() and VALS() generate strings like :b1, :b2, :b3 and 1, 2, 3. We need these to generate dynamically a list of bind variables for the IN predicates, as well as a list of bind values for the USING clause of the nested EXECUTE IMMEDIATE statement
  • The EXECUTE IMMEDIATE .. USING clause does not take dynamically sized bind value lists, so we have to generate that as well, in a nested EXECUTE IMMEDIATE statement

An appropriate gif at this point when nesting EXECUTE IMMEDIATE like we did is this one:

Or also:

So, now that we got this covered…

… let’s look at execution plans:

SELECT s.sql_id, p.*
FROM v$sql s, TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
) p
WHERE lower(s.sql_text) LIKE '%actor_id in %'
AND lower(s.sql_text) NOT LIKE '%v$sql%'
AND lower(s.sql_text) NOT LIKE '%execute_immediate%';

And the result is (some additional formatting applied):

SQL_ID  cwm09k8zqp2at, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1)
 
Plan hash value: 1971314150
 
------------------------------------------------
| Id  | Operation          | Name     | E-Rows |
------------------------------------------------
|   0 | SELECT STATEMENT   |          |        |
|   1 |  SORT AGGREGATE    |          |      1 |
|*  2 |   INDEX UNIQUE SCAN| PK_ACTOR |      1 |
------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ACTOR_ID"=:B1)
 
SQL_ID  1wypurx1x0mrp, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2)
 
Plan hash value: 577114894
 
-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      2 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2))

SQL_ID  2y06nwgpxqttn, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3)
 
Plan hash value: 577114894
 
-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      3 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3))

SQL_ID  d4nn8qf9n22yt, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3, :b4)
 
Plan hash value: 577114894
 
-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      4 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3 OR 
              "ACTOR_ID"=:B4))

SQL_ID  4n9b4zgxr5cwj, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3, :b4, :b5)
 
Plan hash value: 577114894
 
-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      5 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3 OR 
              "ACTOR_ID"=:B4 OR "ACTOR_ID"=:B5))

So, again, we’re down to having 5 different SQL_IDs and statements in the execution plan cache / cursor cache, even if they all derive from a single logical statement. Now, in PL/SQL, this is so hard to do, you’re probably not doing it at all.

But if you’re using a query builder (or wrote your own), like JPA’s Criteria Query, or jOOQ, then you will be writing something like this (using jOOQ, because in fact, using the Criteria Query API, dynamic SQL is about as hard to do as in PL/SQL):

for (int i = 1; i <= 5; i++) {
    System.out.println(
        DSL.using(configuration)
           .select(count())
           .from(ACTOR)
           .where(ACTOR.ACTOR_ID.in(
                IntStream.rangeClosed(1, i)
                         .boxed()
                         .collect(toList())
           )
           .fetchOne(count())
    );
}

So this query will generate the same queries as the PL/SQL code, and this really wasn’t that hard. Perhaps, in this particular case, it’s not that much of a good thing that this isn’t so hard to do!

Let’s benchmark this, again

There are different ways to benchmark this problem. One mistake we could make here is to just benchmark small IN lists with large ones, but that’s not fair. If you have to send a large IN list to the server, you cannot easily change that.

Another option is to benchmark alternative approaches of generating those IN lists

  • The default way is to have a list of exactly size N
  • An alternative way would be to pad the list to a size M where M = 2 ^ X such that M(X) >= N and M(X - 1) < N

The latter approach would, instead of generating the original queries:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?)

… generate these ones:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?, ?, ?, ?)

Where the last bind variable is always repeated (padded) until the list is filled up to a length that is a power of 2 (and in Oracle, of course, we’d split lists longer than 1000 items into several lists).

This padding is what jOOQ already offers:
https://www.jooq.org/doc/latest/manual/sql-building/dsl-context/custom-settings/settings-in-list-padding/

And maybe, a future Hibernate will offer it too, after I’ve suggested the feature to Vlad Mihalcea from the Hibernate team:
https://hibernate.atlassian.net/browse/HHH-12469

Here’s the benchmark logic. It’s a bit more complex than previously:

SET SERVEROUTPUT ON

ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;

CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 20000;
  v_first_name actor.first_name%TYPE;
  v_last_name  actor.last_name%TYPE;
  
  FUNCTION binds (n NUMBER, pad BOOLEAN) RETURN VARCHAR2 IS
    v_result VARCHAR2(32767);
    v_n_pad  NUMBER(10) := power(2, ceil(ln(n)/ln(2)));
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := ':b' || i;
      ELSIF mod(i, 1000) = 0 THEN
        v_result := v_result || ') OR actor_id IN (:b' || i;
      ELSE
        v_result := v_result || ', :b' || i;
      END IF;
    END LOOP;
    
    IF pad THEN
      FOR i IN n + 1 .. v_n_pad LOOP
        IF mod(i, 1000) = 0 THEN
          v_result := v_result || ') OR actor_id IN (:b' || i;
        ELSE
          v_result := v_result || ', :b' || i;
        END IF;
      END LOOP;
    END IF;
    
    RETURN v_result;
  END binds;
  
  FUNCTION vals (n NUMBER, pad BOOLEAN) RETURN VARCHAR2 IS
    v_result VARCHAR2(32767);
    v_n_pad  NUMBER(10) := power(2, ceil(ln(n)/ln(2)));
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := i;
      ELSE
        v_result := v_result || ', ' || i;
      END IF;
    END LOOP;
    
    IF pad THEN
      FOR i IN n + 1 .. v_n_pad LOOP
        v_result := v_result || ', ' || n;
      END LOOP;
    END IF;
    
    RETURN v_result;
  END vals;
  
  PROCEDURE run(i NUMBER, pad BOOLEAN) IS
  BEGIN
    EXECUTE IMMEDIATE '
      DECLARE 
        v_count NUMBER;
      BEGIN
        EXECUTE IMMEDIATE ''
          SELECT count(*)
          FROM actor
          WHERE actor_id IN (' || binds(i, pad) || ')
        ''
        INTO v_count
        USING ' || vals(i, pad) || ';
      END;
    ';
  END run;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      run(mod(i, 100) + 1, FALSE);
    END LOOP;
  
    INSERT INTO results VALUES (r, 1, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      run(mod(i, 100) + 1, TRUE);
    END LOOP;
  
    INSERT INTO results VALUES (r, 2, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio 
    FROM results
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio);
  END LOOP;
END;
/

DROP TABLE results;

Bear in mind the following:

  • We run the benchmark 5 times
  • In each run, we run 20000 queries with each technique
  • Those 20000 queries have between 1 and 100 elements in the IN list

The results show that both approaches fare equally well:

Run 1, Statement 1 : 1.16944
Run 1, Statement 2 : 1.11204
Run 2, Statement 1 : 1.06086
Run 2, Statement 2 : 1.03591
Run 3, Statement 1 : 1.03589
Run 3, Statement 2 : 1.03605
Run 4, Statement 1 : 1.33935
Run 4, Statement 2 : 1.2822
Run 5, Statement 1 : 1
Run 5, Statement 2 : 1.04648

This means that there is no significant overhead caused by the padding in this benchmark. That’s good news.

Running the following query shows the number of executions per statement:

SELECT executions, regexp_count(sql_text, ':'), sql_text
FROM v$sql
WHERE lower(sql_text) LIKE '%actor_id in %'
AND lower(sql_text) NOT LIKE '%v$sql%'
AND lower(sql_text) NOT LIKE '%execute_immediate%'
ORDER BY sql_text;

The result being:

EXECS   BINDS   SQL_TEXT
2000	1       SELECT count(*) FROM actor WHERE actor_id IN (:b1)         
2000	2       SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2)         
1000	3       SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3)         
3000	4       ...
1000	5       ...
1000	6       ...
1000	7       ...
5000	8       ...
1000	9       ...
1000	10      ...
1000	11      ...
1000	12      ...
1000	13	...
1000	14	...
1000	15	...
9000	16	...
...
1000	100	...
36000	128	...

The version that creates exactly sized IN lists is always executed 1000 times. The version that has padded IN lists sized to the next power of 2 (M = 2 ^ X) is executed 1000 * X times (plus another 1000 times because there’s also an IN list of that exact size)

At the end, there are some “lone” 128-element sized IN lists. This is expected. The number of different SQL statements is:

  • 100 for the exactly sized IN lists
  • 7 for the padded IN lists (128 = 2 ^ 7)

So, thus far, we haven’t gained anything, but also not lost much.

What if the lists get longer

The real problem isn’t the individual query execution time, which is fine in both cases, but the effect that dynamic IN lists have on a system. Let’s generate IN lists of up to size 4096, and bearing in mind that Oracle won’t allow this, the lists will be of the form:

SELECT count(*)
FROM actor
WHERE actor_id IN (:b1, :b2, ..., :b1000)
OR actor_id IN (:b1001, :b1002, ..., :b2000)
OR actor_id IN (:b2001, ...)

Now, with a maximum size of 4096, there are quite a few distinct statements that need to be parsed. As opposed to only 12 when we use the padded in lists. The padded size grows logarithmically, so we’ll never really run into this same problem again, at least not for a single IN list.

How does the comparison fare in another benchmark? We must be careful not to overdo things. Running 2x 20000 queries with variable sized IN lists can easily run for more than an hour on Oracle. Perhaps, it’s Oracle’s way of punishing you for doing something you shouldn’t: Having super-long IN lists. I’ve played around with different sets of parameters to the benchmark and the clear message here is that having long bind variable lists creates a bottleneck per se (i.e. the binding of the variables takes time), so the difference in parse times isn’t as drastic as it was before, for individual query executions. The overall improvement is this:

Run 1, Statement 1 : 1.42696
Run 1, Statement 2 : 1

So, we get around a 1.5x increase of performance when repeating the bind values. Yet still, the fact that we have far less cached execution plans does matter, especially when we start exhausting the cursor cache, because with this approach, at least this exhaustion is not going to happen.

Alternative options

The above shows that if you really cannot get around an IN list, a simple, viable workaround is to pad the lists. But often, you can get around them. In databases like Oracle and PostgreSQL, that offer array types, you can consider using those once your lists start growing. I’ve blogged about this in a previous blog post:
https://blog.jooq.org/2017/03/30/sql-in-predicate-with-in-list-or-with-array-which-is-faster

If your lists become huge, a bulk/batch inserting the IDs into a temporary table might be a far better option, and then semi-join that one instead:

SELECT count(*)
FROM actor
WHERE actor_id IN (
  SELECT id FROM temp_table
)

And ideally, if your list is really not dynamic, but originates from some other table, then please, stop passing around IDs and semi-join the original query that produced the IDs in the first place:

SELECT count(*)
FROM actor
WHERE actor_id IN (
  SELECT id FROM other_table WHERE ...
)

That last option should always be your first choice as it will very likely outperform all the others.

Why SQL Bind Variables are Important for Performance

A common problem with dynamic SQL is parsing performance in production. What makes matters worse is that many developers do not have access to production environments, so they are unaware of the problem (even if there’s nothing new about this topic). What exactly is the problem?

Execution plan caches

Most database vendors these days ship with an execution plan cache (Oracle calls it cursor cache), where previously parsed SQL statements are stored and their execution plan(s) is cached for reuse. This is the main reason why bind variables are so important (the other reason being SQL injection prevention). By using bind variables, we can make sure that the database will easily recognise an identical SQL statement from a previous execution and be able to re-execute the previously found execution plan. This is in fact one of my favourite topics from my SQL training.

Let’s see what happens in various databases, if we run the following queries:

-- First, run them with "inline values" or "constant literals"
SELECT first_name, last_name FROM actor WHERE actor_id = 1;
SELECT first_name, last_name FROM actor WHERE actor_id = 2;

-- Then, run the same queries again with bind values
SELECT first_name, last_name FROM actor WHERE actor_id = ?;
SELECT first_name, last_name FROM actor WHERE actor_id = ?;

Note, it doesn’t matter if the queries are run from JDBC, jOOQ, Hibernate, or the procedural language in the database, e.g. PL/SQL, T-SQL, pgplsql. The result is always the same.

Let’s run an example

I’ll run the following examples using Oracle, only. Other databases behave in a similar fashion.

We’ll run the following script, which includes the above queries, and a query to fetch all the execution plans:

SELECT first_name, last_name FROM actor WHERE actor_id = 1;
SELECT first_name, last_name FROM actor WHERE actor_id = 2;

SET SERVEROUTPUT ON
DECLARE
  v_first_name actor.first_name%TYPE;
  v_last_name  actor.last_name%TYPE;
BEGIN
  FOR i IN 1 .. 2 LOOP
    SELECT first_name, last_name 
    INTO v_first_name, v_last_name
    FROM actor 
    WHERE actor_id = i;
    
    dbms_output.put_line(v_first_name || ' ' || v_last_name);
  END LOOP;
END;
/

SELECT s.sql_id, p.*
FROM v$sql s, TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
) p
WHERE lower(s.sql_text) LIKE '%actor_id = %';

The output is:

SQL_ID  90rk04nhr45yz, child number 0
-------------------------------------
SELECT FIRST_NAME, LAST_NAME FROM ACTOR WHERE ACTOR_ID = :B1
 
Plan hash value: 457831946
 
---------------------------------------------------------
| Id  | Operation                   | Name     | E-Rows |
---------------------------------------------------------
|   0 | SELECT STATEMENT            |          |        |
|   1 |  TABLE ACCESS BY INDEX ROWID| ACTOR    |      1 |
|*  2 |   INDEX UNIQUE SCAN         | PK_ACTOR |      1 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ACTOR_ID"=:B1)
 

SQL_ID  283s8m524c9rk, child number 0
-------------------------------------
SELECT first_name, last_name FROM actor WHERE actor_id = 2
 
Plan hash value: 457831946
 
---------------------------------------------------------
| Id  | Operation                   | Name     | E-Rows |
---------------------------------------------------------
|   0 | SELECT STATEMENT            |          |        |
|   1 |  TABLE ACCESS BY INDEX ROWID| ACTOR    |      1 |
|*  2 |   INDEX UNIQUE SCAN         | PK_ACTOR |      1 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ACTOR_ID"=2)
 

SQL_ID  3mks715670mqw, child number 0
-------------------------------------
SELECT first_name, last_name FROM actor WHERE actor_id = 1
 
Plan hash value: 457831946
 
---------------------------------------------------------
| Id  | Operation                   | Name     | E-Rows |
---------------------------------------------------------
|   0 | SELECT STATEMENT            |          |        |
|   1 |  TABLE ACCESS BY INDEX ROWID| ACTOR    |      1 |
|*  2 |   INDEX UNIQUE SCAN         | PK_ACTOR |      1 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ACTOR_ID"=1)

The plans are always the same and since we’re accessing primary key values, we’ll always get the same cardinalities, so there doesn’t seem to be anything wrong on each individual execution. But notice how the predicate information is slightly different. When querying for a constant value, then the predicate will include that value right there, whereas with the bind variable, we don’t know what the predicate value is, from the plan. This is perfectly expected, because we want to reuse that plan for both executions of the query.

With another query, we can see the number of executions of each statement:

SELECT sql_id, sql_text, executions
FROM v$sql
WHERE sql_id IN (
  '90rk04nhr45yz', 
  '283s8m524c9rk', 
  '3mks715670mqw'
);
SQL_ID          SQL_TEXT                                                        EXECUTIONS
------------------------------------------------------------------------------------------
90rk04nhr45yz	SELECT FIRST_NAME, LAST_NAME FROM ACTOR WHERE ACTOR_ID = :B1 	2
283s8m524c9rk	SELECT first_name, last_name FROM actor WHERE actor_id = 2	1
3mks715670mqw	SELECT first_name, last_name FROM actor WHERE actor_id = 1	1

This is where it gets more interesting. In the second case where we used a bind variable (which was generated by PL/SQL, automatically), we could reuse the statement, cache its plan, and run it twice.

Meh, does it matter?

It matters for two reasons:

  • Performance of individual executions
  • Performance of your entire system

How this affects individual executions

It seems very obvious that when being able to cache something, there’s a slight overhead to the cache maintenance compared to the gain in not having to do the work whose result is cached. The work in question here is parsing the SQL statement and creating an execution plan for it. Even if the plan is trivial, as in the above examples, there is overhead involved with calculating this plan.

This overhead can best be shown in a benchmark, a technique that I also display in my SQL training:

SET SERVEROUTPUT ON

-- Don't run these on production
-- But on your development environment, this guarantees clean caches
ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;

CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2000;
  v_first_name actor.first_name%TYPE;
  v_last_name  actor.last_name%TYPE;
BEGIN

  -- Repeat whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      BEGIN
        EXECUTE IMMEDIATE '
          SELECT first_name, last_name 
          FROM actor 
          WHERE actor_id = ' || i 
          -- Just fixing a syntax highlighting bug of this blog '
        INTO v_first_name, v_last_name;
      EXCEPTION
        -- Please forgive me
        WHEN OTHERS THEN NULL;
      END;
    END LOOP;
  
    INSERT INTO results VALUES (
      r, 1, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      BEGIN
        EXECUTE IMMEDIATE '
          SELECT first_name, last_name 
          FROM actor 
          WHERE actor_id = :i'
        INTO v_first_name, v_last_name
        USING i;
      EXCEPTION
        -- Please forgive me
        WHEN OTHERS THEN NULL;
      END;
    END LOOP;
      
    INSERT INTO results VALUES (
      r, 2, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio 
    FROM results
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio);
  END LOOP;
END;
/

DROP TABLE results;

As always, on the jOOQ blog, we don’t publish actual execution times to comply with license restrictions on benchmark publications, so we’re only comparing each execution with the fastest execution. This is the result of the above:

Run 1, Statement 1 : 83.39893
Run 1, Statement 2 : 1.1685
Run 2, Statement 1 : 3.02697
Run 2, Statement 2 : 1
Run 3, Statement 1 : 2.72028
Run 3, Statement 2 : 1.03996
Run 4, Statement 1 : 2.70929
Run 4, Statement 2 : 1.00866
Run 5, Statement 1 : 2.71895
Run 5, Statement 2 : 1.02198

We can see that consistently, the SQL version using a bind variable is 2.5x as fast as the one not using the bind variable. This overhead is very significant for trivial queries – it might be a bit less so for more complex queries, where the execution itself takes more time, compared to the parsing. But it should be obvious that the overhead is a price we do not want to pay. We want the query and its plan to be cached!

Notice also how the very first execution of the benchmark has a very significant overhead, because all of the 2000 queries will have been encountered for the first time before they’re cached for the second run. That’s a price we’re only paying during the first run, though.

How this affects your entire system

Not only do individual query executions suffer, your entire system does, too. After running the benchmark a few times, these are the execution statistics I’m getting from the Oracle cursor cache:

SELECT 
  count(*), 
  avg(executions), 
  min(executions), 
  max(executions)
FROM v$sql
WHERE lower(sql_text) LIKE '%actor_id = %'
AND sql_text NOT LIKE '%v$sql%';

Yielding:

count  avg      min  max
2001   9.9950   5    10000

There are now, currently, 2000 queries in my cache. The one that has been executed 10000 times (benchmark was repeated 5x and 2000 executions of the query per run), and 2000 queries that have been executed 5x (benchmark was repeated 5x).

If instead we run the query 20000 times (and remember, the query run corresponds to the filtered ACTOR_ID), then the result will be vastly different!

Run 1, Statement 1 : 86.85862
Run 1, Statement 2 : 1.13546
Run 2, Statement 1 : 78.39842
Run 2, Statement 2 : 1.01298
Run 3, Statement 1 : 72.45254
Run 3, Statement 2 : 1
Run 4, Statement 1 : 73.78357
Run 4, Statement 2 : 2.24365
Run 5, Statement 1 : 84.89842
Run 5, Statement 2 : 1.143

Oh, my. Why has this happened? Let’s check again the cursor cache stats:

SELECT 
  count(*), 
  avg(executions), 
  min(executions), 
  max(executions)
FROM v$sql
WHERE lower(sql_text) LIKE '%actor_id = %'
AND sql_text NOT LIKE '%v$sql%';

Yielding:

count  avg      min  max
15738  3.4144   1    20000

This is a vastly different result. We don’t have all of our 20000 queries in the cursor cache, only some of them. This means that some statements have been purged from the cache to make room for new ones (which is reasonable behaviour for any cache).

But purging them is problematic too, because the way the benchmark was designed, they will re-appear again in the second, third, fourth, and fifth run, so we should have kept them in the cache. And since we’re executing every query the same number of times, there really wasn’t any way of identifying a “more reasonable” (i.e. rare) query to purge.

Resources in a system are always limited, and so is the cursor cache size. The more distinct queries we’re running in a system, the less they can profit from the cursor cache.

This is not a problem for rarely run queries, including reports, analytics, or some special queries run only by some very few users. But the queries that are being run all the time should always be cached.

I cannot stress enough how serious this can be:

In the above case, what should have been a single query in the cursor cache exploded into 20000 queries, shoving a lot of much more useful queries out of the cache. Not only does this slow down the execution of this particular query, it will purge tons of completely unrelated queries from the cache, thus slowing down the entire system by similar factors. If everyone is slowed down drastically, everyone will start to queue up to have their SQL queries parsed, and you can bring down your entire server with this problem (in the worst case)!

Workaround

Some databases support enforcing parsing the constant literals to bind variables. In Oracle, you can specify CURSOR_SHARING = FORCE as a “quick fix”. In SQL Server, it’s called forced parametrization.

But this approach has its own limitations and overhead, as this means that additional parsing work needs to be performed every time to recognise constant literals and replace them by bind variables. This overhead will then apply to all queries!

Conclusion

Bind variables are very important for SQL performance. After tons of training people to use them for SQL injection reasons (which is already a good thing), we have now seen how important they are also for performance reasons.

Not using a bind variable for values like IDs, timestamps, names, or anything that is uniformly distributed and has many values in your column will produce the above problem. The exception are bind variables for columns with only very few distinct values (like true/false flags, codes that encode a given state, etc.), in case of which a constant literal can be a reasonable option (follow-up blog post coming, soon).

But bind variables should always be your default choice. If you’re using a client-side tool like jOOQ or Hibernate, bind variables tend to be the default, and you’re fine. If you’re using a stored procedural language like PL/SQL or T-SQL, bind variables are generated automatically and you’re fine as well. But if you’re using JDBC or any JDBC wrapper like Spring’s JdbcTemplates, or any other string based API, like JPA’s native query API, then you are on your own again, and you must explicitly take care of using bind variables every time you have variable input.

And in our next article, we’ll see how bind variables are actually not enough, when using dynamic IN lists, another topic that I borrowed from my SQL training.

When to Use Bind Values, and When to Use Inline Values in SQL

Users of jOOQ, PL/SQL, T-SQL are spoiled as they hardly ever need to worry about bind values. Consider the following statements:

Using jOOQ

public int countActors(String firstName, String lastName) {
    return ctx.selectCount()
              .from(ACTOR)
              .where(ACTOR.FIRST_NAME.eq(firstName))
              .and(ACTOR.LAST_NAME.eq(lastName))
              .fetchOneInto(int.class);
    );
}

The method parameters firstName and lastName will be automatically mapped to bind values in the generated SQL statement. Here’s the debug log output when running the above, where the first statement is sent to the JDBC driver and then to the database, wheas the second statement is generated for debugging purposes only:

Executing query          : 
    select count(*)
    from "SAKILA"."ACTOR"
    where (
      "SAKILA"."ACTOR"."FIRST_NAME" = ?
      and "SAKILA"."ACTOR"."LAST_NAME" = ?
    )
-> with bind values      : 
    select count(*)
    from "SAKILA"."ACTOR"
    where (
      "SAKILA"."ACTOR"."FIRST_NAME" = 'SUSAN'
      and "SAKILA"."ACTOR"."LAST_NAME" = 'DAVIS'
    )
Fetched result           : +-----+
                         : |count|
                         : +-----+
                         : |    2|
                         : +-----+

With this API design, users don’t have to worry about binding variables explicitly at all, nor about remembering bind variable indexes or the data type of a bind value. This works because the overloaded Field<T>.eq(T) method (as well as pretty much every other method that works in a similar way) internally delegates to the more generic Field<T>.eq(Field<T>) method by wrapping the argument value in an explicit bind variable expression DSL.val(T) in the jOOQ SQL expression tree.

Using PL/SQL

The same is true when you’re using PL/SQL (or any other stored procedure language of another database), for instance:

SET SERVEROUTPUT ON
DECLARE
  
  FUNCTION count_actors (
    p_first_name VARCHAR2, 
    p_last_name VARCHAR2
  ) RETURN NUMBER IS 
    v_result NUMBER(10);
  BEGIN
    SELECT count(*)
    INTO v_result
    FROM actor
    WHERE first_name = p_first_name
    AND last_name = p_last_name;
    
    RETURN v_result;
  END count_actors;
  
BEGIN
  dbms_output.put_line(count_actors('SUSAN', 'DAVIS'));
END;
/

To be sure what happened, let’s consider the execution plan:

SELECT p.*
FROM (
  SELECT *
  FROM v$sql
  WHERE upper(sql_text) LIKE 'SELECT COUNT(*) FROM ACTOR%'
  ORDER BY last_active_time DESC
  FETCH NEXT 1 ROW ONLY
) s 
CROSS APPLY TABLE(dbms_xplan.display_cursor(sql_id => s.sql_id)) p;

As you can see below, bind variables were generated for the SQL query that was embedded in the PL/SQL function:

SQL_ID  9dgammbskx5tx, child number 0
-------------------------------------
SELECT COUNT(*) FROM ACTOR WHERE FIRST_NAME = :B2 AND LAST_NAME = :B1
 
Plan hash value: 3384208144
 
----------------------------------------------------
| Id  | Operation         | Name           | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT  |                |       |
|   1 |  SORT AGGREGATE   |                |     1 |
|*  2 |   INDEX RANGE SCAN| IDX_ACTOR_NAME |     1 |
----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("LAST_NAME"=:B1 AND "FIRST_NAME"=:B2)

Why is it important? 1: SQL injection

I’ve noticed this time and again when talking to Java developers: Many developers are aware of the risk of SQL injection when not using bind variables. This can happen when we would write dynamic SQL like this, e.g. using JDBC or jOOQ’s plain SQL API:

String sql = 
  "SELECT count(*) "
+ "FROM actor "
+ "WHERE 1 = 1 "
+ (firstName != null ? "AND first_name = " + firstName + " " : "")
+ (lastName != null ? "AND last_name = " + lastName + " " : "");

// JDBC
try (Statement s = connection.createStatement();
     ResultSet rs = s.executeQuery(sql)) {

    ...
}

// jOOQ
Result<?> result = ctx.fetch(sql);

Seriously. Don’t do this. Ever! Always use bind values for user input. There’s hardly any reason at all why you should inline the values. I mean, of course, you could if you always remember to manually escape all strings, e.g.:

public static String escape(String string) {
    // TODO: Handle MySQL's non-standard backslash escaping, too
    return string == null ? null : string.replace("'", "''");
}

And then:

String sql = 
  "SELECT count(*) "
+ "FROM actor "
+ "WHERE 1 = 1 "
+ (firstName != null ? "AND first_name = " + escape(firstName) + " " : "")
+ (lastName != null ? "AND last_name = " + escape(lastName) + " " : "");

Notice that there’s still a vulnerability risk in MySQL, which doesn’t necessarily conform to standard SQL string literal escaping. A very unfortunate MySQL “feature”, which is handled correctly by jOOQ:
https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_no_backslash_escapes

But why run the risk? It’s so much easier with bind values.

Why is it important? 2: Performance!

So, most Java developers, luckily, are aware of SQL injection vulnerabilities and mostly get this right. But there’s another thing that few Java developers are aware of, unfortunately. And that’s the performance aspect of using bind values. Let’s assume we’re not using bind values for the above query. Check this out:

SELECT count(*)
FROM actor
WHERE first_name = 'SUSAN'
AND last_name = 'DAVIS';

SELECT count(*)
FROM actor
WHERE first_name = 'NICK'
AND last_name = 'WAHLBERG';

And now, let’s find execution plans:

SELECT p.*
FROM (
  SELECT *
  FROM v$sql
  WHERE upper(sql_text) LIKE 'SELECT COUNT(*) FROM ACTOR%'
  ORDER BY last_active_time DESC
  FETCH NEXT 2 ROWS ONLY
) s 
CROSS APPLY TABLE(dbms_xplan.display_cursor(sql_id => s.sql_id)) p;

Result:

SQL_ID  12r8afykqkzbd, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE first_name = 'NICK' AND last_name = 'WAHLBERG'
 
Plan hash value: 3384208144
 
----------------------------------------------------
| Id  | Operation         | Name           | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT  |                |       |
|   1 |  SORT AGGREGATE   |                |     1 |
|*  2 |   INDEX RANGE SCAN| IDX_ACTOR_NAME |     1 |
----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("LAST_NAME"='WAHLBERG' AND "FIRST_NAME"='NICK')
 
SQL_ID  gfppqr9gpjqws, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE first_name = 'SUSAN' AND last_name = 'DAVIS'
 
Plan hash value: 3384208144
 
----------------------------------------------------
| Id  | Operation         | Name           | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT  |                |       |
|   1 |  SORT AGGREGATE   |                |     1 |
|*  2 |   INDEX RANGE SCAN| IDX_ACTOR_NAME |     1 |
----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("LAST_NAME"='DAVIS' AND "FIRST_NAME"='SUSAN')

As you can see, the two almost identical queries produced two times the exact same execution plan (same plan hash value), but they are stored under two distinct SQL_ID values. Yes, they’re distinct SQL statements!

Why?

Most sophisticated databases (including Oracle, SQL Server, DB2 and others) implement an execution plan cache. The reason for this is simple. Calculating an execution plan with cost based optimisation is expensive. Not for these trivial statements, but imagine you have dozens of JOINs, semi joins, unions, and what not. There are thousands of candidate execution plans, and the database needs to find the best one for you. That can be tons of work and we don’t want to let the database do that work every time.

So, the database will run a soft-parse (Oracle speak) to identify a SQL query and translate its SQL string to a SQL_ID. It will then check if there’s already a suitable plan available for that particular SQL_ID, and if so, it will avoid the so-called hard-parse (Oracle speak) to calculate a new plan.

Let me repeat this one more time: Making best use of this plan cache (Oracle speak: Cursor cache) is extremely important. You can severely overload a system up to the point of bringing it down, if you’re not using bind variables.

There’s a workaround to use CURSOR_SHARING=FORCE, which will transform inline values to bind values, but I’m not even going to explain how it works, because most Oracle experts advise you not to use that feature due to the significant drawbacks it will bring.

What if I want to inline values?

As we’ve seen above, when using jOOQ or PL/SQL, the above problems are really non-discussions, because it is quite unlikely that you run into a situation of accidentally inlining your bind values:

  • In jOOQ, you’d have to use plain SQL
  • In PL/SQL, you’d have to use dynamic SQL using DBMS_SQL

Anyway, in rare cases, users want to inline their bind variables for them to appear directly in the resulting SQL statement. This is never the case for user input, for ID values, or for ordinary search values. But it can be the case when you query for discriminators or “enum” values (enforced through a CHECK constraint, for instance), or when you run a report only once a year (plan is never available from the cache).

In these cases, it can be of advantage to not use bind values specifically to prevent the database from re-using a cached execution plan, because you know that the cached plan, which works well for 90% of the queries, won’t work well for this particular bind value (or in the case of the once-per-year report, you might get a slightly better plan by giving the database more information).

An example, let’s insert 1×1 and 99999×0 into a table:

CREATE TABLE skewed (
  v NUMBER(1)
);

INSERT INTO skewed
SELECT decode(level, 1, 1, 0)
FROM dual
CONNECT BY level <= 100000;

CREATE INDEX skewed_i ON skewed(v);

EXEC dbms_stats.gather_table_stats('TEST', 'SKEWED');

Now, clearly we see that when querying the table SKEWED for values V = 0, the index is useless, whereas it is very useful for values V = 1. Let’s run this statement here:

SET SERVEROUTPUT ON
DECLARE
  v_bind NUMBER(1);
  v_result NUMBER(10);
BEGIN
  v_bind := 0;
  
  SELECT count(*)
  INTO v_result
  FROM skewed
  WHERE v = v_bind;
  
  FOR rec IN (
    SELECT * FROM TABLE(dbms_xplan.display_cursor)
  ) LOOP
    dbms_output.put_line(rec.plan_table_output);
  END LOOP;

  v_bind := 1;
  
  SELECT count(*)
  INTO v_result
  FROM skewed
  WHERE v = v_bind;
  
  FOR rec IN (
    SELECT * FROM TABLE(dbms_xplan.display_cursor)
  ) LOOP
    dbms_output.put_line(rec.plan_table_output);
  END LOOP;
  
  SELECT count(*)
  INTO v_result
  FROM skewed
  WHERE v = 1;
  
  FOR rec IN (
    SELECT * FROM TABLE(dbms_xplan.display_cursor)
  ) LOOP
    dbms_output.put_line(rec.plan_table_output);
  END LOOP;
END;

The above block runs the exact same statement three times:

  1. With a bind variable of 0 (can’t really use the index)
  2. With a bind variable of 1 (should be using the index)
  3. With an inline value of 1 (should be using the index)

Here’s the result:

SQL_ID  1q0qjm8za06w3, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 4055318479
 
--------------------------------------------------
| Id  | Operation             | Name     | Rows  |
--------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |
|   1 |  SORT AGGREGATE       |          |     1 |
|*  2 |   INDEX FAST FULL SCAN| SKEWED_I | 99999 |
--------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("V"=:B1)
 
SQL_ID  1q0qjm8za06w3, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 4055318479
 
--------------------------------------------------
| Id  | Operation             | Name     | Rows  |
--------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |
|   1 |  SORT AGGREGATE       |          |     1 |
|*  2 |   INDEX FAST FULL SCAN| SKEWED_I | 99999 |
--------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("V"=:B1)
 
SQL_ID  bdpjxzqpg2416, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = 1
 
Plan hash value: 276090370
 
----------------------------------------------
| Id  | Operation         | Name     | Rows  |
----------------------------------------------
|   0 | SELECT STATEMENT  |          |       |
|   1 |  SORT AGGREGATE   |          |     1 |
|*  2 |   INDEX RANGE SCAN| SKEWED_I |     1 |
----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V"=1)

Observe the estimated numbers of rows above (in red):

  1. In the first case, the estimate is correct. With V = 0, we’ll get 99999 rows, so we might just as well scan the entire index to calculate the count value
  2. In the second case, we’ve inherited the cached execution plan from the first run, including the estimate of 99999 rows, which is clearly wrong in this case. The database should’ve estimated 1 row here
  3. In the third case, the estimate is again correct (note: different SQL_ID, and we get an optimal plan

We’re just out of luck. If we had inversed the order of queries, we would have gotten the right estimate for V = 1 but a wrong estimate for V = 0

Adaptive Cursor Sharing

Oracle knows some features to remedy the above problems. Oracle 11g introduced adaptive cursor sharing, which means that if we re-execute the above statements, the database will have already figured out that in this particular case, the plans should depend on the actual bind variable, because in hindsight, the second plan was wrong for V = 1

On a second execution of the previous block, we’ll see:

SQL_ID  1q0qjm8za06w3, child number 1
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 4055318479
 
--------------------------------------------------
| Id  | Operation             | Name     | Rows  |
--------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |
|   1 |  SORT AGGREGATE       |          |     1 |
|*  2 |   INDEX FAST FULL SCAN| SKEWED_I | 99999 |
--------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("V"=:B1)
 
SQL_ID  1q0qjm8za06w3, child number 2
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 276090370
 
----------------------------------------------
| Id  | Operation         | Name     | Rows  |
----------------------------------------------
|   0 | SELECT STATEMENT  |          |       |
|   1 |  SORT AGGREGATE   |          |     1 |
|*  2 |   INDEX RANGE SCAN| SKEWED_I |     1 |
----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V"=:B1)
 
SQL_ID  bdpjxzqpg2416, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = 1
 
Plan hash value: 276090370
 
----------------------------------------------
| Id  | Operation         | Name     | Rows  |
----------------------------------------------
|   0 | SELECT STATEMENT  |          |       |
|   1 |  SORT AGGREGATE   |          |     1 |
|*  2 |   INDEX RANGE SCAN| SKEWED_I |     1 |
----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V"=1)

As you can see above, the first two executions are still using the same plan, but they now have a new “child number” value, indicating that for any given SQL_ID, there are now different candidate plan hash values, depending on the bind variable input that we’re getting.

This often works well with the following caveats:

  • It only works after the plan was wrong at least once. This translates to at least one poor user who suffers from a slow query
  • It only works if the plans remain in the cache (which isn’t the case for rarely executed queries). Once the plans are purged from the cache, we have to run the query again at least twice for the alternative “child numbers” to appear
  • It stops working well if your query is much more complex, i.e. when complicated correlations exist between the bind variables that should each produce different plans

So, if in doubt, in this particular case, it’s not a bad idea to simply use an inline value / constant literal in the query directly, to help the optimiser make the right choice.

Another interesting article about this topic is here, where the optimiser will always be fooled by bind variables, and where you should always use inline values: When querying discriminators from views.

Adaptive Execution Plans

Oracle 12c took adaptive cursor sharing one step further and now supports adaptive execution plans, meaning that certain execution plans are known to be “shaky” in advance and the optimiser already pre-calculated a fallback plan that applies if the estimates are wrong for a given execution. In that case, the plan can be changed “in flight” and an alternative plan, which is better for a particular execution, is applied.

This feature is still not too stable in Oracle 12cR1, which is why some Oracle experts generally recommend turning it off.

Conclusion

You should always use bind values by default. In 99% of all cases, they’re the right choice for two reasons:

  • SQL injection prevention (obvious)
  • Statement caching optimisation (less obvious)

The latter reason is one that is not really well known among developers, because it’s not a problem that appears on development environments. It’s a production-only problem that happens under heavy load. Yet, you should be aware of this problem, and always remember to avoid generating too many distinct SQL strings (see also this related article about IN lists)

In rare cases, it is better to use inline values / literals, as this will help the optimiser make a much better choice in a predictable manner. These cases include:

  • Querying skewed data (unless adaptive features can be expected to kick in)
  • Querying discriminators (in this case, it’s always advisable to use inline values)

Using languages like PL/SQL, T-SQL, pgplsql, or APIs like jOOQ definitely helps you get this right, because you don’t have to think about this problem anymore. You’ll get it right automatically.

Side-note: Hibernate’s Criteria Query

Like jOOQ, Hibernate supports a type safe DSL for constructing JPQL queries, which to some extent cover basic SQL functionality when querying entities. Hibernate historically chose quite interesting defaults (as of version 5.2.10):

  • String values are always transformed to bind values, regardless if you’re using implicit values, explicit parameters, or explicit literals. The goal of this is to prevent SQL injection (because currently, Hibernate doesn’t auto-escape inline string literals)
  • Implicit integer values are always inlined
  • Explicit parameters are always sent as parameters
  • Explicit literals are usually sent as literals (unless they’re strings)

The above implementation is unfortunate as we’ve seen before for these reasons:

  • Bind values should always be the default, especially when using ID values, which are integers, and as such, often inlined in the current implementation. Luckily, this has been recognised as a bug and will be fixed, soon: https://hibernate.atlassian.net/browse/HHH-9576
  • Literals should be sent to the server as literals, because when users need literals (e.g. in the above edge cases), they don’t want the API to override this behaviour through bind values. This might also be solved soon: https://hibernate.atlassian.net/browse/HHH-11778

I’m currently working with the team to remedy these problems, such that the criteria API won’t inhibit users from a performance perspective:

SQL IN Predicate: With IN List or With Array? Which is Faster?

Hah! Got nerd-sniped again:

http://stackoverflow.com/questions/43099226/how-to-make-jooq-to-use-arrays-in-the-in-clause/43102102

A jOOQ user was wondering why jOOQ would generate an IN list for a predicate like this:

Java

COLUMN.in(1, 2, 3, 4)

SQL

COLUMN in (?, ?, ?, ?)

… when in fact there could have been the following predicate being generated, instead:

COLUMN = any(?::int[])

In the second case, there would have been only one single bind variable instead of 4, and the SQL generation and parsing work would have been “much” less (maybe not for the IN list of size 4, but let’s imagine a list of 50 values).

A disclaimer

First off, a disclaimer: In databases that have a cursor cache / plan cache (e.g. Oracle or SQL Server), you should be careful with long IN lists, because they will probably trigger a hard parse every time you run them, as by the time you run the exact same predicate (with 371 elements in the list) again, the execution plan will have been purged from the cache. So, you cannot really profit from the cache.

I’m aware of this problem, and it will be topic of another blog post, soon. Let’s stick to PostgreSQL whose “plan cache” isn’t really that sophisticated.

Measure, don’t guess

The question was about improving the speed of parsing a SQL statement. Parsers are really fast, so parsing shouldn’t be a problem. Generating an execution plan certainly does cost more time, but again, since PostgreSQL’s plan cache isn’t very sophisticated, this won’t play into the issue here. So the question is really:

Is an IN list really that bad in PostgreSQL?

Would an array bind variable be much better?

Since our recent post about benchmarking, we now know that we shall never guess, but always measure. I’m using again the Sakila database to run these two queries:

-- IN list
SELECT * 
FROM film 
JOIN film_actor USING (film_id) 
JOIN actor USING (actor_id) 
WHERE film_id IN (?, ?, ?, ?)

-- Array
SELECT * 
FROM film 
JOIN film_actor USING (film_id) 
JOIN actor USING (actor_id) 
WHERE film_id = ANY(?)

Let’s try lists of length 4, first. The benchmark is here:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000;
  rec RECORD;
  v_e1 INT := 1;
  v_e2 INT := 2;
  v_e3 INT := 4;
  v_e4 INT := 8;
  v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4];
BEGIN
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * 
        FROM film 
        JOIN film_actor USING (film_id) 
        JOIN actor USING (actor_id) 
        WHERE film_id IN (v_e1, v_e2, v_e3, v_e4)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', 
      r, (clock_timestamp() - v_ts); 
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * 
        FROM film 
        JOIN film_actor USING (film_id) 
        JOIN actor USING (actor_id) 
        WHERE film_id = ANY(v_any_arr)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', 
      r, (clock_timestamp() - v_ts); 
  END LOOP;
END$$;

The result being:

INFO:  Run 1, Statement 1: 00:00:00.112195
INFO:  Run 1, Statement 2: 00:00:00.450461
INFO:  Run 2, Statement 1: 00:00:00.109792
INFO:  Run 2, Statement 2: 00:00:00.446518
INFO:  Run 3, Statement 1: 00:00:00.105413
INFO:  Run 3, Statement 2: 00:00:00.44298
INFO:  Run 4, Statement 1: 00:00:00.108249
INFO:  Run 4, Statement 2: 00:00:00.476527
INFO:  Run 5, Statement 1: 00:00:00.120229
INFO:  Run 5, Statement 2: 00:00:00.448214

Interesting. So, the IN list outperforms the array bind variable every time by a factor of 4 (which is the size of the array / list!) So, let’s try 8 values, then. Here are the values and the adapted query 1:

-- values
  v_e1 INT := 1;
  v_e2 INT := 2;
  v_e3 INT := 4;
  v_e4 INT := 8;
  v_e5 INT := 16;
  v_e6 INT := 32;
  v_e7 INT := 64;
  v_e8 INT := 128;
  v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8];

-- adapted query 1 ...
        WHERE film_id IN (v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8)
-- ...

The result is still impressive:

INFO:  Run 1, Statement 1: 00:00:00.182646
INFO:  Run 1, Statement 2: 00:00:00.63624
INFO:  Run 2, Statement 1: 00:00:00.184814
INFO:  Run 2, Statement 2: 00:00:00.685976
INFO:  Run 3, Statement 1: 00:00:00.188108
INFO:  Run 3, Statement 2: 00:00:00.634903
INFO:  Run 4, Statement 1: 00:00:00.184933
INFO:  Run 4, Statement 2: 00:00:00.626616
INFO:  Run 5, Statement 1: 00:00:00.185879
INFO:  Run 5, Statement 2: 00:00:00.636723

The IN list query now takes almost 2x as long (but not quite 2x), whereas the array query now takes around 1.5x as long. It looks as though arrays become the better choice when their size increases. So, let’s do this! With 32 bind variables in the IN list, or 32 array elements respectively:

INFO:  Run 1, Statement 1: 00:00:00.905064
INFO:  Run 1, Statement 2: 00:00:00.752819
INFO:  Run 2, Statement 1: 00:00:00.760475
INFO:  Run 2, Statement 2: 00:00:00.758247
INFO:  Run 3, Statement 1: 00:00:00.777667
INFO:  Run 3, Statement 2: 00:00:00.895875
INFO:  Run 4, Statement 1: 00:00:01.308167
INFO:  Run 4, Statement 2: 00:00:00.789537
INFO:  Run 5, Statement 1: 00:00:00.788606
INFO:  Run 5, Statement 2: 00:00:00.776159

Both are about equally fast. 64 bind values!

INFO:  Run 1, Statement 1: 00:00:00.915069
INFO:  Run 1, Statement 2: 00:00:01.058966
INFO:  Run 2, Statement 1: 00:00:00.951488
INFO:  Run 2, Statement 2: 00:00:00.906285
INFO:  Run 3, Statement 1: 00:00:00.907489
INFO:  Run 3, Statement 2: 00:00:00.892393
INFO:  Run 4, Statement 1: 00:00:00.900424
INFO:  Run 4, Statement 2: 00:00:00.903447
INFO:  Run 5, Statement 1: 00:00:00.961805
INFO:  Run 5, Statement 2: 00:00:00.951697

Still about the same. OK… INTERN! Get over here. I need you to “generate” 128 bind values on this query.

Yep, as expected. Finally, arrays start to outperform IN lists:

INFO:  Run 1, Statement 1: 00:00:01.122866
INFO:  Run 1, Statement 2: 00:00:01.083816
INFO:  Run 2, Statement 1: 00:00:01.416469
INFO:  Run 2, Statement 2: 00:00:01.134882
INFO:  Run 3, Statement 1: 00:00:01.122723
INFO:  Run 3, Statement 2: 00:00:01.087755
INFO:  Run 4, Statement 1: 00:00:01.143148
INFO:  Run 4, Statement 2: 00:00:01.124902
INFO:  Run 5, Statement 1: 00:00:01.236722
INFO:  Run 5, Statement 2: 00:00:01.113741

Using Oracle

Oracle also has array types (although you have to declare them as nominal types first, but that’s not a problem here).

Here are some benchmark results (as always, not actual benchmark results, but anonymised units of measurement. I.e. these aren’t seconds but… Larrys):

4 bind values

Run 1, Statement 1 : 01.911000000
Run 1, Statement 2 : 02.852000000
Run 2, Statement 1 : 01.659000000
Run 2, Statement 2 : 02.680000000
Run 3, Statement 1 : 01.628000000
Run 3, Statement 2 : 02.664000000
Run 4, Statement 1 : 01.629000000
Run 4, Statement 2 : 02.657000000
Run 5, Statement 1 : 01.636000000
Run 5, Statement 2 : 02.688000000

128 bind values

Run 1, Statement 1 : 04.010000000
Run 1, Statement 2 : 06.275000000
Run 2, Statement 1 : 03.749000000
Run 2, Statement 2 : 05.440000000
Run 3, Statement 1 : 03.985000000
Run 3, Statement 2 : 05.387000000
Run 4, Statement 1 : 03.807000000
Run 4, Statement 2 : 05.688000000
Run 5, Statement 1 : 03.782000000
Run 5, Statement 2 : 05.803000000

The size of the number of bind values doesn’t seem to matter really. There’s always a constant overhead of using the array bind variable compared to the IN list, but that might as well be a benchmarking error. For instance, when I add the /*+GATHER_PLAN_STATISTICS*/ hint to both queries, interestingly, the one with the array got significantly faster, whereas the IN list one was not affected… Weird? An explanation can be found in the comments, specifically Kim Berg Hansen’s comment:

It looks like the default cardinality of the collection is assumed by the optimizer to be 8168 (at least in my 12.1.0.2 db). So probably full scan of actor with hash join.

Indeed, the TABLE() constructor (in this case) always seems to yield a constant cardinality estimate of 8168, despite the array containing much less data. So hinting approximate cardinalities might help here, to get nested loop joins for small arrays.

Conclusion

This article doesn’t go into why there’s such a big difference for small lists when the benefit is only apparent for quite large lists.

But it has once again shown, that we must not optimise prematurely in SQL, but measure, measure, measure things. IN lists in dynamic SQL queries can be a big issue in production when they lead to cursor cache / plan cache saturation and a lot of “hard parsing”. So, the benefit of using the array is much more drastic when the content is big, as we can recycle execution plans much more often than with IN lists.

But chances are, that IN lists may be faster for single executions.

In any case: Choose carefully when following advice that you find somewhere on the Internet. Also, when following this advice. I ran the benchmark on PostgreSQL 9.5 and Oracle 11gR2 XE. Both are not the latest database versions. Try to measure things again on your side, to be sure that your “improvement” is really an actual improvement! And if in doubt, don’t optimise, until you’re sure you actually have a problem.

Are You Binding Your Oracle DATEs Correctly? I Bet You Aren’t

Oracle database has its ways. In my SQL talks at conferences, I love to confuse people with the following Oracle facts:

sql-trivia-1

… and the answer is, of course:

sql-trivia-2

Isn’t it horrible to make empty string the same thing as NULL? Please, Oracle…

The only actually reasonable slide to follow the previous two is this one:

sql-trivia-3

But the DATE type is much more subtle

So you think VARCHAR2 is weird?

Well, we all know that Oracle’s DATE is not really a date as in the SQL standard, or as in all the other databases, or as in java.sql.Date. Oracle’s DATE type is really a TIMESTAMP(0), i.e. a timestamp with a fractional second precision of zero.

Most legacy databases actually use DATE precisely for that, to store timestamps with no fractional seconds, such as:

  • 1970-01-01 00:00:00
  • 2000-02-20 20:00:20
  • 1337-01-01 13:37:00

So, it’s always a safe bet to use java.sql.Timestamp types in Java, when you’re operating with Oracle DATE.

But things can go very wrong when you bind such variables via JDBC as can be seen in this Stack Overflow question here. Let’s assume you have a range predicate like so:

// execute_at is of type DATE and there's an index
PreparedStatement stmt = connection.prepareStatement(
    "SELECT * " + 
    "FROM my_table " +
    "WHERE execute_at > ? AND execute_at < ?");

Now, naturally, we’d expect any index on execute_at to be a sensible choice to use for filtering out records from my_table, and that’s also what happens when we bind java.sql.Date

stmt.setDate(1, start);
stmt.setDate(2, end);

The execution plan is optimal:

-----------------------------------------------------
| Id  | Operation                    | Name         |
-----------------------------------------------------
|   0 | SELECT STATEMENT             |              |
|*  1 |  FILTER                      |              |
|   2 |   TABLE ACCESS BY INDEX ROWID| my_table     |
|*  3 |    INDEX RANGE SCAN          | my_index     |
-----------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(:1:1 AND ""EXECUTE_AT""<:2)

But let’s check out what happens if we assume execute_at to be a date with hours/minutes/seconds, i.e. an Oracle DATE. We’ll be binding java.sql.Timestamp

stmt.setTimestamp(1, start);
stmt.setTimestamp(2, end);

and the execution plan suddenly becomes very bad:

-----------------------------------------------------
| Id  | Operation                    | Name         |
-----------------------------------------------------
|   0 | SELECT STATEMENT             |              |
|*  1 |  FILTER                      |              |
|   2 |   TABLE ACCESS BY INDEX ROWID| my_table     |
|*  3 |    INDEX FULL SCAN           | my_index     |
-----------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(:1:1 AND 
               INTERNAL_FUNCTION(""EXECUTE_AT"")<:2))

What’s this INTERNAL_FUNCTION()

INTERNAL_FUNCTION() is Oracle’s way of silently converting values into other values in ways that are completely opaque. In fact, you cannot even place a function-based index on this pseudo-function to help the database choose a RANGE SCAN on it again. The following is not possible:

CREATE INDEX oracle_why_oh_why
  ON my_table(INTERNAL_FUNCTION(execute_at));

Nope. What the function really does, it widens the less precise DATE type of the execute_at column to the more precise TIMESTAMP type of the bind variable. Just in case.

Why? Because with exclusive range boundaries (> and <), chances are that the fractional seconds in your Timestamp may lead to the timestamp being stricly greater than the lower bound of the range, which would include it, when the same Timestamp with no fractional sections (i.e. an Oracle DATE) would have been excluded.

Duh. But we don’t care, we’re only using Timestamp as a silly workaround in the first place! Now that we know this, you might think that adding a function-based index on an explicit conversion would work, but that’s not the case either:

CREATE INDEX nope
  ON my_table(CAST(execute_at AS TIMESTAMP));

Perhaps, if you magically found the exact right precision of the implicitly used TIMESTAMP(n) type it could work, but that all feels shaky, and besides, I don’t want a second index on that same column!

The solution

The solution given by user APC is actually very simple (and it sucks). Again, you could bind a java.sql.Date, but that would make you lose all hour/minute/second information. No, you have to explicitly cast the bind variable to DATE in the database. Exactly!

PreparedStatement stmt = connection.prepareStatement(
    "SELECT * " + 
    "FROM my_table " +
    "WHERE execute_at > CAST(? AS DATE) " +
    "AND execute_at < CAST(? AS DATE)");

You have to do that every time you bind a java.sql.Timestamp variable to an Oracle DATE value, at least when used in predicates.

How to implement that?

If you’re using JDBC directly, you’re pretty much doomed. Of course, you could run AWR reports to find the worst statements in production and fix only those, but chances are that you won’t be able to fix your statements so easily and deploy them so quickly, so you might want to get it right in advance. And of course, this is production. Tomorrow, another statement would suddenly pop up in your DBA’s reports.

If you’re using JPA / Hibernate, you can only hope that they got it right, because you probably won’t be able to fix those queries, otherwise.

If you’re using jOOQ 3.5 or later, you can take advantage of jOOQ’s new custom type binding feature, which works out of the box with Oracle, and transparently renders that CAST(? AS DATE) for you, only on those columns that are really relevant.

jOOQ is the best way to write SQL in Java

Other databases

If you think that this is an Oracle issue, think again. Oracle is actually very lenient and nice to use when it comes to bind variables. Oracle can infer a lot of types for your bind variables, such that casting is almost never necessary. With other databases, that’s a different story. Read our article about RDBMS bind variable casting madness for more information.