When Using Bind Variables is not Enough: Dynamic IN Lists

In a previous blog post, I wrote about why you should (almost) always default to using bind variables. There are some exceptions, which I will cover in another follow-up post, but by default, bind variables are the right choice, both from a performance and from a security perspective.

In this article, I will show an example where regrettably, bind variables are not enough, and you can still run into significant performance issues in production. That’s when you create dynamic IN lists.

What’s the problem with dynamic IN lists

Again, this blog post is applicable for most databases, but I’ll use Oracle as an example, only.

An IN predicate that takes a list is a handy abbreviation for an equivalent OR predicate. The following two queries are semantically equivalent:

-- IN predicate
SELECT *
FROM actor
WHERE actor_id IN (1, 2, 3)

-- OR predicate
SELECT *
FROM actor
WHERE actor_id = 1
OR actor_id = 2
OR actor_id = 3

The execution plans are, respectively:

IN list

---------------------------------------------------------
| Id  | Operation                    | Name     | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     3 |
|   1 |  INLIST ITERATOR             |          |       |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR    |     3 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR |     3 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ACTOR_ID"=1 OR "ACTOR_ID"=2 OR "ACTOR_ID"=3)

OR predicate

---------------------------------------------------------
| Id  | Operation                    | Name     | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     3 |
|   1 |  INLIST ITERATOR             |          |       |
|   2 |   TABLE ACCESS BY INDEX ROWID| ACTOR    |     3 |
|*  3 |    INDEX UNIQUE SCAN         | PK_ACTOR |     3 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ACTOR_ID"=1 OR "ACTOR_ID"=2 OR "ACTOR_ID"=3)

We get the exact same plan. And the plan is also optimal. When looking for 3 actors, we’ll run 3 primary key lookups and get a cardinality estimate of 3. Perfect, what could be wrong about this?

Remember execution plan caching

In that previous blog post, we’ve seen the detrimental effects we can get under load when execution plans are no longer cached in the execution plan cache (or in Oracle: cursor cache). But even when we’re using bind variables, we can run into these issues. Consider the following dynamic SQL queries:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?)

Now, let’s generate these SQL queries using dynamic SQL in PL/SQL. PL/SQL is not a very dynamic SQL friendly language as can be seen in the following snippet:

SET SERVEROUTPUT ON
DECLARE
  v_count NUMBER;
  
  FUNCTION binds (n NUMBER) RETURN VARCHAR2 IS
    v_result VARCHAR2(1000);
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := ':b' || i;
      ELSE
        v_result := v_result || ', :b' || i;
      END IF;
    END LOOP;
    
    RETURN v_result;
  END binds;
  
  FUNCTION vals (n NUMBER) RETURN VARCHAR2 IS
    v_result VARCHAR2(1000);
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := i;
      ELSE
        v_result := v_result || ', ' || i;
      END IF;
    END LOOP;
    
    RETURN v_result;
  END vals;
BEGIN
  FOR i IN 1..5 LOOP
    EXECUTE IMMEDIATE '
      DECLARE 
        v_count NUMBER;
      BEGIN
        EXECUTE IMMEDIATE ''
          SELECT count(*)
          FROM actor
          WHERE actor_id IN (' || binds(i) || ')
        ''
        INTO v_count
        USING ' || vals(i) || ';
        
        :v_count := v_count;
      END;
    '
    USING OUT v_count
    ;
    
    dbms_output.put_line(v_count);
  END LOOP;
END;
/

Running this snippet yields, as expected:

1
2
3
4
5

There are a few noteworthy things to say about the above:

  • The functions BINDS() and VALS() generate strings like :b1, :b2, :b3 and 1, 2, 3. We need these to generate dynamically a list of bind variables for the IN predicates, as well as a list of bind values for the USING clause of the nested EXECUTE IMMEDIATE statement
  • The EXECUTE IMMEDIATE .. USING clause does not take dynamically sized bind value lists, so we have to generate that as well, in a nested EXECUTE IMMEDIATE statement

An appropriate gif at this point when nesting EXECUTE IMMEDIATE like we did is this one:

Or also:

So, now that we got this covered…

… let’s look at execution plans:

SELECT s.sql_id, p.*
FROM v$sql s, TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
) p
WHERE lower(s.sql_text) LIKE '%actor_id in %'
AND lower(s.sql_text) NOT LIKE '%v$sql%'
AND lower(s.sql_text) NOT LIKE '%execute_immediate%';

And the result is (some additional formatting applied):

SQL_ID  cwm09k8zqp2at, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1)
 
Plan hash value: 1971314150
 
------------------------------------------------
| Id  | Operation          | Name     | E-Rows |
------------------------------------------------
|   0 | SELECT STATEMENT   |          |        |
|   1 |  SORT AGGREGATE    |          |      1 |
|*  2 |   INDEX UNIQUE SCAN| PK_ACTOR |      1 |
------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ACTOR_ID"=:B1)
 
SQL_ID  1wypurx1x0mrp, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2)
 
Plan hash value: 577114894
 
-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      2 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2))

SQL_ID  2y06nwgpxqttn, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3)
 
Plan hash value: 577114894
 
-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      3 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3))

SQL_ID  d4nn8qf9n22yt, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3, :b4)
 
Plan hash value: 577114894
 
-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      4 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3 OR 
              "ACTOR_ID"=:B4))

SQL_ID  4n9b4zgxr5cwj, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3, :b4, :b5)
 
Plan hash value: 577114894
 
-------------------------------------------------
| Id  | Operation           | Name     | E-Rows |
-------------------------------------------------
|   0 | SELECT STATEMENT    |          |        |
|   1 |  SORT AGGREGATE     |          |      1 |
|   2 |   INLIST ITERATOR   |          |        |
|*  3 |    INDEX UNIQUE SCAN| PK_ACTOR |      5 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3 OR 
              "ACTOR_ID"=:B4 OR "ACTOR_ID"=:B5))

So, again, we’re down to having 5 different SQL_IDs and statements in the execution plan cache / cursor cache, even if they all derive from a single logical statement. Now, in PL/SQL, this is so hard to do, you’re probably not doing it at all.

But if you’re using a query builder (or wrote your own), like JPA’s Criteria Query, or jOOQ, then you will be writing something like this (using jOOQ, because in fact, using the Criteria Query API, dynamic SQL is about as hard to do as in PL/SQL):

for (int i = 1; i <= 5; i++) {
    System.out.println(
        DSL.using(configuration)
           .select(count())
           .from(ACTOR)
           .where(ACTOR.ACTOR_ID.in(
                IntStream.rangeClosed(1, i)
                         .boxed()
                         .collect(toList())
           )
           .fetchOne(count())
    );
}

So this query will generate the same queries as the PL/SQL code, and this really wasn’t that hard. Perhaps, in this particular case, it’s not that much of a good thing that this isn’t so hard to do!

Let’s benchmark this, again

There are different ways to benchmark this problem. One mistake we could make here is to just benchmark small IN lists with large ones, but that’s not fair. If you have to send a large IN list to the server, you cannot easily change that.

Another option is to benchmark alternative approaches of generating those IN lists

  • The default way is to have a list of exactly size N
  • An alternative way would be to pad the list to a size M where M = 2 ^ X such that M(X) >= N and M(X - 1) < N

The latter approach would, instead of generating the original queries:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?)

… generate these ones:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?, ?, ?, ?)

Where the last bind variable is always repeated (padded) until the list is filled up to a length that is a power of 2 (and in Oracle, of course, we’d split lists longer than 1000 items into several lists).

This padding is what jOOQ already offers:
https://www.jooq.org/doc/latest/manual/sql-building/dsl-context/custom-settings/settings-in-list-padding/

And maybe, a future Hibernate will offer it too, after I’ve suggested the feature to Vlad Mihalcea from the Hibernate team:
https://hibernate.atlassian.net/browse/HHH-12469

Here’s the benchmark logic. It’s a bit more complex than previously:

SET SERVEROUTPUT ON

ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;

CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 20000;
  v_first_name actor.first_name%TYPE;
  v_last_name  actor.last_name%TYPE;
  
  FUNCTION binds (n NUMBER, pad BOOLEAN) RETURN VARCHAR2 IS
    v_result VARCHAR2(32767);
    v_n_pad  NUMBER(10) := power(2, ceil(ln(n)/ln(2)));
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := ':b' || i;
      ELSIF mod(i, 1000) = 0 THEN
        v_result := v_result || ') OR actor_id IN (:b' || i;
      ELSE
        v_result := v_result || ', :b' || i;
      END IF;
    END LOOP;
    
    IF pad THEN
      FOR i IN n + 1 .. v_n_pad LOOP
        IF mod(i, 1000) = 0 THEN
          v_result := v_result || ') OR actor_id IN (:b' || i;
        ELSE
          v_result := v_result || ', :b' || i;
        END IF;
      END LOOP;
    END IF;
    
    RETURN v_result;
  END binds;
  
  FUNCTION vals (n NUMBER, pad BOOLEAN) RETURN VARCHAR2 IS
    v_result VARCHAR2(32767);
    v_n_pad  NUMBER(10) := power(2, ceil(ln(n)/ln(2)));
  BEGIN
    FOR i IN 1..n LOOP
      IF v_result IS NULL THEN
        v_result := i;
      ELSE
        v_result := v_result || ', ' || i;
      END IF;
    END LOOP;
    
    IF pad THEN
      FOR i IN n + 1 .. v_n_pad LOOP
        v_result := v_result || ', ' || n;
      END LOOP;
    END IF;
    
    RETURN v_result;
  END vals;
  
  PROCEDURE run(i NUMBER, pad BOOLEAN) IS
  BEGIN
    EXECUTE IMMEDIATE '
      DECLARE 
        v_count NUMBER;
      BEGIN
        EXECUTE IMMEDIATE ''
          SELECT count(*)
          FROM actor
          WHERE actor_id IN (' || binds(i, pad) || ')
        ''
        INTO v_count
        USING ' || vals(i, pad) || ';
      END;
    ';
  END run;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      run(mod(i, 100) + 1, FALSE);
    END LOOP;
  
    INSERT INTO results VALUES (r, 1, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      run(mod(i, 100) + 1, TRUE);
    END LOOP;
  
    INSERT INTO results VALUES (r, 2, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio 
    FROM results
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio);
  END LOOP;
END;
/

DROP TABLE results;

Bear in mind the following:

  • We run the benchmark 5 times
  • In each run, we run 20000 queries with each technique
  • Those 20000 queries have between 1 and 100 elements in the IN list

The results show that both approaches fare equally well:

Run 1, Statement 1 : 1.16944
Run 1, Statement 2 : 1.11204
Run 2, Statement 1 : 1.06086
Run 2, Statement 2 : 1.03591
Run 3, Statement 1 : 1.03589
Run 3, Statement 2 : 1.03605
Run 4, Statement 1 : 1.33935
Run 4, Statement 2 : 1.2822
Run 5, Statement 1 : 1
Run 5, Statement 2 : 1.04648

This means that there is no significant overhead caused by the padding in this benchmark. That’s good news.

Running the following query shows the number of executions per statement:

SELECT executions, regexp_count(sql_text, ':'), sql_text
FROM v$sql
WHERE lower(sql_text) LIKE '%actor_id in %'
AND lower(sql_text) NOT LIKE '%v$sql%'
AND lower(sql_text) NOT LIKE '%execute_immediate%'
ORDER BY sql_text;

The result being:

EXECS   BINDS   SQL_TEXT
2000	1       SELECT count(*) FROM actor WHERE actor_id IN (:b1)         
2000	2       SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2)         
1000	3       SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3)         
3000	4       ...
1000	5       ...
1000	6       ...
1000	7       ...
5000	8       ...
1000	9       ...
1000	10      ...
1000	11      ...
1000	12      ...
1000	13	...
1000	14	...
1000	15	...
9000	16	...
...
1000	100	...
36000	128	...

The version that creates exactly sized IN lists is always executed 1000 times. The version that has padded IN lists sized to the next power of 2 (M = 2 ^ X) is executed 1000 * X times (plus another 1000 times because there’s also an IN list of that exact size)

At the end, there are some “lone” 128-element sized IN lists. This is expected. The number of different SQL statements is:

  • 100 for the exactly sized IN lists
  • 7 for the padded IN lists (128 = 2 ^ 7)

So, thus far, we haven’t gained anything, but also not lost much.

What if the lists get longer

The real problem isn’t the individual query execution time, which is fine in both cases, but the effect that dynamic IN lists have on a system. Let’s generate IN lists of up to size 4096, and bearing in mind that Oracle won’t allow this, the lists will be of the form:

SELECT count(*)
FROM actor
WHERE actor_id IN (:b1, :b2, ..., :b1000)
OR actor_id IN (:b1001, :b1002, ..., :b2000)
OR actor_id IN (:b2001, ...)

Now, with a maximum size of 4096, there are quite a few distinct statements that need to be parsed. As opposed to only 12 when we use the padded in lists. The padded size grows logarithmically, so we’ll never really run into this same problem again, at least not for a single IN list.

How does the comparison fare in another benchmark? We must be careful not to overdo things. Running 2x 20000 queries with variable sized IN lists can easily run for more than an hour on Oracle. Perhaps, it’s Oracle’s way of punishing you for doing something you shouldn’t: Having super-long IN lists. I’ve played around with different sets of parameters to the benchmark and the clear message here is that having long bind variable lists creates a bottleneck per se (i.e. the binding of the variables takes time), so the difference in parse times isn’t as drastic as it was before, for individual query executions. The overall improvement is this:

Run 1, Statement 1 : 1.42696
Run 1, Statement 2 : 1

So, we get around a 1.5x increase of performance when repeating the bind values. Yet still, the fact that we have far less cached execution plans does matter, especially when we start exhausting the cursor cache, because with this approach, at least this exhaustion is not going to happen.

Alternative options

The above shows that if you really cannot get around an IN list, a simple, viable workaround is to pad the lists. But often, you can get around them. In databases like Oracle and PostgreSQL, that offer array types, you can consider using those once your lists start growing. I’ve blogged about this in a previous blog post:
https://blog.jooq.org/2017/03/30/sql-in-predicate-with-in-list-or-with-array-which-is-faster

If your lists become huge, a bulk/batch inserting the IDs into a temporary table might be a far better option, and then semi-join that one instead:

SELECT count(*)
FROM actor
WHERE actor_id IN (
  SELECT id FROM temp_table
)

And ideally, if your list is really not dynamic, but originates from some other table, then please, stop passing around IDs and semi-join the original query that produced the IDs in the first place:

SELECT count(*)
FROM actor
WHERE actor_id IN (
  SELECT id FROM other_table WHERE ...
)

That last option should always be your first choice as it will very likely outperform all the others.

Map Reducing a Set of Values Into a Dynamic SQL UNION Query

Sounds fancy, right? But it’s a really nice and reasonable approach to doing dynamic SQL with jOOQ.

This blog post is inspired by a Stack Overflow question, where a user wanted to turn a set of values into a dynamic UNION query like this:

SELECT T.COL1
FROM T
WHERE T.COL2 = 'V1'
UNION
SELECT T.COL1
FROM T
WHERE T.COL2 = 'V2'
...
UNION
SELECT T.COL1
FROM T
WHERE T.COL2 = 'VN'

Note, both the Stack Overflow user and I are well aware of the possibility of using IN predicates :-), let’s just assume for the sake of argument, that the UNION query indeed outperforms the IN predicate in the user’s particular MySQL version and database. If this cannot be accepted, just imagine a more complex use case.

The solution in Java is really very simple:

import static org.jooq.impl.DSL.*;
import java.util.*;
import org.jooq.*;

public class Unions {
    public static void main(String[] args) {
        List<String> list = Arrays.asList("V1", "V2", "V3", "V4");

        System.out.println(
            list.stream()
                .map(Unions::query)
                .reduce(Select::union));
    }

    // Dynamically construct a query from an input string
    private static Select<Record1<String>> query(String s) {
        return select(T.COL1).from(T).where(T.COL2.eq(s));
    }
}

The output is:

Optional[(
  select T.COL1
  from T
  where T.COL2 = 'V1'
)
union (
  select T.COL1
  from T
  where T.COL2 = 'V2'
)
union (
  select T.COL1
  from T
  where T.COL2 = 'V3'
)
union (
  select T.COL1
  from T
  where T.COL2 = 'V4'
)]

If you’re using JDK 9+ (which has Optional.stream()), you can further proceed to running the query fluently as follows:

List<String> list = Arrays.asList("V1", "V2", "V3", "V4");

try (Stream<Record1<String>> stream = list.stream()
    .map(Unions::query)
    .reduce(Select::union))
    .stream() // Optional.stream()!
    .flatMap(Select::fetchStream)) {
    ...
}

This way, if the list is empty, reduce will return an empty optional. Streaming that empty optional will result in not fetching any results from the database.

5 Things You May Not Have Known About jOOQ

jOOQ has been around for a while now (since 2009!) and by now we can say we’ve seen quite a bit of things about the SQL and Java languages. Some of our design decisions are particular in the way jOOQ thinks about programming with SQL. These include:

  • Nullability (let’s stop fighting it)
  • Value types (let’s stop pretending SQL has identities)
  • Everything is a table (this really helps get the most out of SQL)
  • Queries are side-effect free functions

jOOQ incorporates all of these ideas. Here are 5 Things You May Not Have Known About jOOQ:

1. Every Column Type is Nullable

SQL NULL is a subtly different beast from Java null, even if programmers often use it for the same thing: Something that is “uninitialised”, some value that we don’t “care about” (yet), or some value that we “don’t need”. A good example would be a middle name:

CREATE TABLE person (
  first_name  VARCHAR(50) NOT NULL,
  middle_name VARCHAR(50),
  last_name   VARCHAR(50) NOT NULL,
  ..
);

Of course, a sufficiently pessimistic programmer will immediately see tons of flaws with the above design. Go read this article about falsehoods programmers believe about names for details.

But anyway, the important thing to understand about NOT NULL constraints in SQL is the fact that they’re… constaints. Just like UNIQUE constraints, FOREIGN KEY constraints, or CHECK constraints.

In fact, they are CHECK constraints. We could rewrite the above table as such:

CREATE TABLE person (
  first_name  VARCHAR(50) CHECK (first_name IS NOT NULL),
  middle_name VARCHAR(50),
  last_name   VARCHAR(50) CHECK (last_name IS NOT NULL),
  ..
);

… and the table would be semantically equivalent. This constraint is just so common that it has a special syntax for it (which is also sometimes better optimised than the equivalent check constraint).

Sidenote: An even more sophisticated constraint type is the SQL standard assertion, which unfortunately hasn’t been implemented in any database I’m aware of yet. There are discussions of adding it to a future Oracle version, though. Assertions are like CHECK constraints, but they work on the entire table / schema / whatever scope. For instance, we could assert that every department of a company must have at least one manager. Currently, we can do this sort of thing only through triggers.

The important message here is that a constraint is a validation on the entire data set (or on a subset, down to an individual row). It is not a type modifier, because even if the NOT NULL constraint may have direct optimisation implications on the column type it is attached to, it is a separate construct that can even be deferred. While languages don’t have to be this way (e.g. Ceylon models constraints directly on types), SQL works like this.

Two examples:

  1. DEFAULT columns: When you have an identity column or some sort of GENERATED BY DEFAULT AS... clause on your column, then the value in the column may be generated by default (duh), which may include – depending on the RDBMS vendor – the generation of a value when it is NULL.
  2. DEFERRED constraints: Some databases (e.g. PostgreSQL) support deferred constraints, i.e. constraints that are validated only when the transaction is committed. This can be specified on individual constraints, or on the session. Which means that the value NULL is a totally acceptable value for a NOT NULL column for a certain amount of time.

Both of the above imply that we must not take NOT NULL as a type modifier, the way some languages have started doing it, like Ceylon or Kotlin:

val a: String? = null;
val b: String = a; // Error

In such languages, String? and String are distinct types, specifically in Ceylon where String? is just syntax sugar for the union type String|Null.

But not in SQL. If a Java API wants to properly reflect the SQL language the way jOOQ does, then all types must be nullable. It is a mistake to:

  • Use primitive types
  • Use Option(al) (there are other caveats with these related to generic type erasure)
  • Use non-null types in languages that have them
  • Use validation annotations (we made that mistake, unfortunately)
  • Use JSR-305 or JSR-308 annotations

Sidenote: If this constraint information should be annotated in Java classes, then JPA @Column(nullable=true) annotations are acceptable, because they simply map to the constraint without any implications on the type. The implications are applied on the persistence behaviour, which is reasonable.

Besides, even if at first, encoding nullability through e.g. Option(al) seems reasonable, it breaks as soon as you outer join anything, e.g.:

SELECT p.*
FROM dual
LEFT JOIN person p
ON p.first_name = 'Ooops, no one by that name'

The above query will produce a single person record with only NULL values in its columns. DESPITE the NOT NULL constraints. Ooops. We’ll get null in non-optional types.

Similar things can happen with unions, grouping sets, functions, and a few other operations.

Takeaway

In SQL, all types are always nullable. We simply have to deal with this. Every clever type safety is contrary to SQL logic. If your API does this, you may get some minor convenience in 80% of the use-cases for the price of a major annoyance in 20% of the use-cases. That’s not a reasonable tradeoff given that in Java, every non-primitive type is nullable as well, so we got a perfect and intuitive match.

2. SQL is a Set-Based, Values-Only Language

Values or Objects? That’s a tricky question for people who work with Java, a language that claims to be mainly object-oriented. Java has value support as well. There are 8 different value types as of Java 8:

  • byte
  • short
  • int
  • long
  • float
  • double
  • boolean
  • char

Values have a couple of nice properties:

  • They are immutable. It may be possible to mutate a variable holding such a value, but we cannot mutate the value itself. 42 will always stay 42
  • Two values that are equal are undistinguishable. 42 == 42 really means that they’re the exact same thing. Reusing == for value equality and identity equality has been a bit of an unfortunate choice in Java, because technically, a String is also a value, yet we cannot compare it with == because there’s a possibility of two identical strings having different identity. (True) values don’t have identity.

Java 8 introduced the notion of a “ValueBased” class, which is really a weird thing, because a “ValueBased” wrapper like Optional can reference a non-value based type, say, a java.sql.Connection. Not a good idea, but certainly possible.

A future Java might have more complex value types, for instance:

// Hypothetical syntax
value Point(int x, int y) {}
value Box(Point a, Point b) {
  int area() {
    return Math.abs(a.x - b.x * a.y - b.y);
  }
}

This will certainly be helpful (as soon as we’ll figure out how to model nullability in such scenarios).

In SQL, all records are values. They do not have a true identity (although most databases choose to provide implementation specific identities like ROWIDs). Do not confuse primary keys with identity descriptors. A primary key is a special value that is guaranteed to be unique within a table. It happens to be used as a logical identity (at least when using surrogate keys). But as NOT NULL constraints, PRIMARY KEY constraints are constraints, and they’re deferrable in some databases.

And there are many ways how we can produce results where primary keys are no longer meaningful, e.g. this:

SELECT * FROM person
UNION ALL
SELECT * FROM person

SQL, unlike relational algebra, doesn’t operate on sets but on bags (or multisets), i.e. data structures that allow for duplicate values. Multisets make analytics much more powerful, while making OLTP quite harder. As always, with useful things, they come at a price.

jOOQ, by consequence, also works in the value-oriented multi set paradigm. This is completely contrary to what Hibernate / JPA does, as Hibernate emulates entity identity through the primary key, which it has to do, being an object-graph persistence API. It doesn’t have to do this because of working with sets rather than multisets, although having identities does make things easier in that paradigm. If you want to read an interesting and entertaining discussion on the subject, check out these tweets between Gavin King and myself:

The importance here is to understand: Neither approach is absolutely better. Both have their advantages. If a RDBMS vendor had implemented a database following a set-based approach instead of SQL’s multiset-based approach, a lot of persistence problems would have been much easier to implement on that RDBMS. On the other hand, a lot of reporting and analytics would have been harder, because with sets being sets, we’d have to constantly prevent “duplicates” from being removed early by keeping primary keys around in queries until the final aggregation.

Now even if we could re-start this interesting discussion, fact is, that we have SQL and it is multiset-based. The default is SELECT "ALL", not SELECT DISTINCT (the ALL keyword being specified in the standard, but not available in most implementations).

When using jOOQ, a value-based record-centric programming approach is recommended, where result sets from jOOQ queries are really “just” streams of records, which will be further transformed without ever thinking about persisting any elements from those streams again. Sure there can be write operations as well, but a jOOQ (or SQL) write operation is also a multiset-based streaming of records (values) back into the database. That’s important to know, because all of

  • INSERT
  • UPDATE
  • DELETE
  • MERGE

statements are multiset-based, i.e. they take a set of values, not just a single row. For instance, INSERT:

-- Not all databases support this standard syntax:
INSERT INTO t (a, b, c)
VALUES (1, 2, 3),
       (4, 5, 6),
       (7, 8, 9);

-- But all databases support this one:
INSERT INTO t1 (a, b, c)
SELECT a, b, c
FROM t2;

Notice how this has absolutely nothing to do with identity-based object-graph persistence. In SQL, we’re always streaming a set of values from one place to another, possibly targeting a table where we store that set. The approach is really beautiful, try to think this way and it’ll open up a whole new world to the SQL-oriented programmer.

In a future article, I’ll even go a step further and claim that SQL is an (almost) completely side-effect free language (and this includes statements like INSERT – stay tuned).

Takeaway

In SQL, everything is a value. There is no identity. It is not needed, because SQL is a multiset-based language, where we’re always operating on the entire data set, not on individual records, even if CRUD operations may make you think otherwise. jOOQ encourages this way of thinking by putting the table and the “value-based” record into the center of the programming model.

3. ResultQuery is an Iterable

I’ve blogged about this before, and some users may have discovered it by accident, intrigued. A jOOQ ResultQuery is an Iterable, meaning that you can “foreach it”:

ResultQuery<?> query =
DSL.using(configuration)
   .select(PERSON.FIRST_NAME, PERSON.LAST_NAME)
   .from(PERSON);

// Java 5 style
for (Record record : query)
  System.out.println(record);

// Java 8 style
query.forEach(System.out::println);

It makes a lot of sense. A SQL query is a description of a set of tuples. SQL is a functional programming language, and if you forget about some concurrency aspects, it is, in principle, side-effect free. This means that the query really IS the set of tuples (another nice way to think about SQL!). With that thought in mind, we can simply iterate it.

To the procedural mind of many Java developers, this might be a bit funky and surprising, but give this a little thought and it might “click”. Consider also this previous article, claiming that streams, for comprehensions, and SQL are all the same:

Or also this fun tweet:

Takeaway

We’re not there yet in Java, we still explicitly iterate, but when we do, and the data source is a SQL query, make it a jOOQ query because that helps you forget about the difference between the query and the data, which are really the same thing in SQL.

4. Ordering is Nice When It’s Cheap. Let’s Retain It

You should avoid ORDER BY in SQL if you don’t really need it. Why? Because unless you can profit from an index that has already pre-ordered your result sets, sorting is a super expensive operation in all programming languages, including SQL. It’s essentially O(n log n).

But let’s assume you do have to sort your results, well, we better want to make sure that this ordering stays the same for as long as possible.

By default, jOOQ returns a Result type, or List types, but there are many utility methods like the ResultQuery.fetchMap() method, which can return something like this:

Map<Integer, String> people =
DSL.using(configuration)
   .select(PERSON.ID, PERSON.FIRST_NAME)
   .from(PERSON)
   .orderBy(PERSON.ID)
   .fetchMap(PERSON.ID, PERSON.FIRST_NAME);

Internally, jOOQ collects all data into a LinkedHashMap, which is a slightly more resource intensive map than the similar HashMap. In case you haven’t used this very often, it’s a map that preserves the insertion order when iterating the map using Map.entrySet() and all the other methods. Quite useful when displaying the map, too. After all, if you do specify the ordering, then you wanted that order to appear in the results, right?

In a similar way, when using Collections.sort() in Java, the sort algorithm guarantees that sorting is stable. If you sort a list twice, then the original ordering will be retained for elements that are not re-ordered. I.e. when sorting by first name, and then by last name, the first name ordering will be retained for equal last names.

Takeaway

ORDER BY is expensive, so if you go through the trouble of actually doing it, you want to retain that order.

5. Dynamic SQL is the Default

In the old days, people mostly wrote static SQL, e.g. using stored procedures in languages like PL/SQL. When you write an implicit cursor loop in PL/SQL:

FOR rec IN (SELECT * FROM person)
LOOP
  dbms_output.put_line(rec.first_name || ' ' || rec.last_name);
END LOOP;

… then, this SQL statement is compiled along with the surrounding procedural code and it will never be changed again. That’s useful for batch processing, reporting, etc. (Strictly speaking it isn’t really “static”, because the SQL statement will still be parsed by the SQL engine like any other query, but the PL/SQL programming model allows for hiding this from you).

In modern days, we require dynamic SQL very often, because the SQL code is often generated from user input. Mostly, because:

  • Users can add predicates through the UI
  • Users can specify aggregations through the UI
  • Users can specify ordering through the UI

In some more remote use-cases, users might also influence the JOIN tree and other parts of a dynamically created query.

From a JDBC perspective, all queries are dynamic, even if you’re doing something like this:

try (ResultSet rs = stmt.executeQuery(
  "SELECT * FROM person"
)) {
  while (rs.next())
    out.println(rs.getString(1) + " " + rs.getString(2));
}

Clearly, the SQL string seems “static” in the way that the Java compiler will compile it once and then never touch it again. The above program will always send the exact same SQL string to the server. Yet from a JDBC API perspective, the string is just an argument to the executeQuery() method, just as if we wrote it like this:

try (ResultSet rs = stmt.executeQuery(
  "SELECT * FROM person" + 
  (active ? " WHERE active = 1" : "")
)) {
  while (rs.next())
    out.println(rs.getString(1) + " " + rs.getString(2));
}

Yuck! String concatenation to build SQL strings. There’s a substantial risk of:

Of course, the above example is SQL injection “safe”, because the SQL string is entirely constructed from constants, not user input. But how quickly could the code be refactored to this?

try (ResultSet rs = stmt.executeQuery(
  "SELECT * FROM person" + 
  (active ? (" WHERE active = " + active) : "")
)) {
  while (rs.next())
    out.println(rs.getString(1) + " " + rs.getString(2));
}

SQL builders like jOOQ help prevent SQL injection, even for dynamic SQL queries. The above query will be written as follows in jOOQ:

for (PersonRecord rec : DSL.using(configuration)
        .selectFrom(person)
		.where(active
		    ? PERSON.ACTIVE.eq(active)
			: trueCondition()))
  out.println(rec.getFirstName() + " " + rec.getLastName());

The active flag check that is added to the SQL query dynamically will default to creating a bind variable, and even if it is inlined, it will be escaped, depending on its type.

The interesting bit here, however, is that the jOOQ query is always a dynamic SQL query. The above approach used an inline expression to decide whether a certain predicate needs to be added to the statement. If that predicate gets more complex, we can extract the construction of the predicate to a local variable, or a function.

Local variable

Condition condition = trueCondition();

if (active)
  condition = PERSON.ACTIVE.eq(active);
	
if (searchForFirstName)
  condition = condition.and(PERSON.FIRST_NAME.like(pattern));

for (PersonRecord rec : DSL.using(configuration)
        .selectFrom(person)
		.where(condition))
  out.println(rec.getFirstName() + " " + rec.getLastName());

This is quite neat.

Functions

Or, if things get even more complex, we might like to factor out the logic to a method, or a function. Some people have started calling such an approach “functional relational mapping”:

Condition personCondition(boolean active, String pattern) {
  Condition condition = trueCondition();

  if (active)
    condition = PERSON.ACTIVE.eq(active);
	
  if (pattern != null)
    condition = condition.and(PERSON.FIRST_NAME.like(pattern));
	
  return condition;
}

// And then:
for (PersonRecord rec : DSL.using(configuration)
        .selectFrom(person)
		.where(personCondition(active, pattern)))
  out.println(rec.getFirstName() + " " + rec.getLastName());

Or even:

BiFunction<Boolean, String, Condition> personCondition() {
  return (active, pattern) -> {
    Condition condition = trueCondition();

    if (active)
      condition = PERSON.ACTIVE.eq(active);
	
    if (pattern != null)
      condition = condition.and(PERSON.FIRST_NAME.like(pattern));
	
    return condition;
  };
}

// And then:
for (PersonRecord rec : DSL.using(configuration)
        .selectFrom(person)
		.where(personCondition.apply(active, pattern)))
  out.println(rec.getFirstName() + " " + rec.getLastName());

Not only is this approach to writing dynamic SQL extremely useful for client code that relies on dynamic SQL, the expression tree that is built behind the scenes is also available at runtime for more complex transformations, such as applying row level security to certain queries, or more simply to apply something like schema-based multi-tenancy. While the Java code stays exactly the same, the generated SQL string may be transformed by your own library code, behind the scenes.

Static SQL

Of course, jOOQ doesn’t imply that you have to write dynamic SQL. You can store jOOQ-generated SQL strings in caches, or you can use stored procedures with jOOQ. In fact, jOOQ encourages you to use stored procedures!

Takeaway

Dynamic SQL is really useful. jOOQ defaults to writing dynamic SQL in a way that you don’t even notice. A SQL query is a function just as much as it is a collection description. jOOQ helps you think about SQL this way.

Conclusion

SQL is a beautiful language with an interesting syntax. If we look at the concepts that are the foundation of the SQL language, we see that SQL queries are functional / declarative collection descriptions. With this paradigm in mind, we can write really powerful SQL statements, and jOOQ encourages this as this paradigm is at the core of the jOOQ API design.

Enjoy writing functional-relational mapping code.

Using Kotlin’s Apply Function for Dynamic SQL with jOOQ

It was hard to limit ourselves to 10 Nice Examples of Writing SQL in Kotlin With jOOQ, recently, because the Kotlin language has many nice little features that really help a lot when working with Java libraries. We’ve talked about the nice with() stdlib function, which allows to “import” a namespace for a local scope or closure:

with (AUTHOR) {
    ctx.select(FIRST_NAME, LAST_NAME)
       .from(AUTHOR)
       .where(ID.lt(5))
       .orderBy(ID)
       .fetch {
           println("${it[FIRST_NAME]} ${it[LAST_NAME]}")
       }
}

In the above example, the AUTHOR table is made available as the this reference in the closure following the with function, which works exactly like JavaScript’s with(). Everything in AUTHOR is available, without dereferencing it from AUTHOR.

Apply is very similar

A very similar feature is made available through apply(), although with different syntactic implications. Check out this Stack Overflow question for some details about with() vs. apply() in Kotlin.

When using jOOQ, apply() is most useful for dynamic SQL. Imagine you have local variables indicating whether some parts of a query should be added to the query:

val filtering = true;
val joining = true;

These boolean variables would be evaluated dynamically, of course. filtering specifies whether a dynamic filter / where clause is needed, whereas joining specifies whether an additional JOIN is required.

So, the following query will select authors, and:

  • if “filtering”, we’re selecting only author ID = 1
  • if “joining”, we’ll join the books table and count the number of books per author

Both of these predicates are independent. Enter the game: apply():

ctx.select(
      a.FIRST_NAME, 
      a.LAST_NAME, 
      if (joining) count() else value(""))
   .from(a)
   .apply { if (filtering) where(a.ID.eq(1)) }
   .apply { if (joining) join(b).on(a.ID.eq(b.AUTHOR_ID)) }
   .apply { if (joining) groupBy(a.FIRST_NAME, a.LAST_NAME) }
   .orderBy(a.ID)
   .fetch {
       println(it[a.FIRST_NAME] + " " + 
               it[a.LAST_NAME] +
               (if (joining) " " + it[count()] else ""))
   }

That’s neat! See, the jOOQ API doesn’t specify any apply() method / function, yet you can chain the apply() function to the jOOQ API as if it were natively supported.

Like with(), apply() makes a reference available to a closure as this, so it doesn’t have to be referenced explicitly anymore. Which means, we can write neat things like

   .apply { if (filtering) where(a.ID.eq(1)) }

Where a where() clause is added only if we’re filtering!

Of course, jOOQ (or any other query builder) lends itself to this kind of dynamic SQL, and it can be done in Java too:
https://www.jooq.org/doc/latest/manual/sql-building/dynamic-sql

But the Kotlin-specific fluent integration using apply() is exceptionally neat. Well done, Kotlin!

Side-note

This only works because the jOOQ DSL API of jOOQ 3.x is mutable and every operation returns the same this reference as was kindly pointed out by Ilya Ryzhenkov

In the future (e.g. version 4.0), we’re planning on making the jOOQ API more immutable – mutability is a historic legacy (although, often, it’s the desired behaviour for a query builder).

More nice Kotlin/jOOQ tricks in this article here.

A Functional Programming Approach to Dynamic SQL with jOOQ

Typesafe embedded DSLs like jOOQ are extremely powerful for dynamic SQL, because the query you’re constructing with the jOOQ DSL is a dynamic query by nature. You’re constructing a query expression tree using a convenient API (the “DSL”), even if you think your SQL statement is static. For instance:

for (Record rec : ctx.select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
                     .from(ACTOR)
                     .where(ACTOR.FIRST_NAME.like("A%")))

    System.out.println(rec.get(ACTOR.FIRST_NAME) 
               + " " + rec.get(ACTOR.LAST_NAME));

The above query looks like a static SQL statement, the way you would write it in PL/SQL, for instance:

FOR rec IN (
  SELECT first_name, last_name
  FROM actor
  WHERE first_name LIKE 'A%'
) LOOP
  dbms_output.put_line(rec.first_name
             || ' ' || rec.last_name);
END LOOP;

The PL/SQL implicit cursor loop runs over the records produced by a pre-compiled SQL statement. That’s not the case with the jOOQ statement, in case of which the Java runtime re-creates the jOOQ statement expression tree every time afresh by dynamically creating an org.jooq.Select object, step by step (more about how the DSL works here).

Using jOOQ for actual dynamic SQL

As we’ve seen before, all jOOQ statements are dynamic statements, even if they “feel” static. Sometimes, you actually want a dynamic SQL query, e.g. when the user is allowed to specify custom predicates. In this case, you could do something like this:

// By default, make the dynamic predicate "TRUE"
Condition condition = DSL.trueCondition();

// If the user entered something in the text search field...
if (hasFirstNameSearch())
    condition = condition.and(FIRST_NAME.like(firstNameSearch()));

// If the user entered something in another text search field...
if (hasLastNameSearch())
    condition = condition.and(LAST_NAME.like(lastNameSearch()));

// The query now uses a dynamically created predicate
for (Record rec : ctx.select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
                     .from(ACTOR)
                     .where(condition))

    System.out.println(rec.get(ACTOR.FIRST_NAME) 
               + " " + rec.get(ACTOR.LAST_NAME));

The above is not possible with PL/SQL easily, you’d have to resort to the dynamic SQL API called DBMS_SQL, which is about as verbose (and error-prone) as JDBC, as you’re concatenating SQL strings.

Adding functional programming to the mix

If you’re able to construct the entire query in a local scope, e.g. inside of a method, the above imperative style is quite sufficient. But sometimes, you may have something like a “base” query that you want to re-use all the time, and only sometimes, you want to add a custom predicate, or JOIN operation, etc.

In this case, using a more functional approach is optimal. For instance, you could offer a convenience API that produces a query fetching actor first and last names, with custom predicates:

// Higher order, SQL query producing function:
public static ResultQuery<Record2<String, String>> actors(
    Function<Actor, Condition> where
) {
    return ctx.select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
              .from(ACTOR)
              .where(where.apply(ACTOR)));
}

The above utility method doesn’t actually execute the query, it just constructs it and takes a function as an argument. In the old days, this used to be called the “strategy pattern”, which is implemented much more easily with a function, than with an object oriented approach (see also Mario Fusco’s interesting blog series about the Gang of Four design patterns).

How to call the above utility? Easy!

// Get only actors whose first name starts with "A"
for (Record rec : actors(a -> a.FIRST_NAME.like("A%")))
    System.out.println(rec);

Now, this is not versatile enough yet, as we can pass only one function. How about this, instead:

@SafeVarargs
public static ResultQuery<Record2<String, String>> actors(
    Function<Actor, Condition>... where
) {
    return dsl().select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
                .from(ACTOR)
                .where(Arrays.stream(where)
                             .map(f -> f.apply(ACTOR))
                             .collect(Collectors.toList()));
}

(notice how we can immediately execute and iterate over the ResultQuery, as it implements Iterable)

We can now call this with any number of input functions to form dynamic predicates. E.g.:

// Get all actors
for (Record rec : actors())
    System.out.println(rec);

// Get only actors whose first name starts with "A"
for (Record rec : actors(a -> a.FIRST_NAME.like("A%"))) {
    System.out.println(rec);

// Get actors whose first/last name matches "A% B%"
for (Record rec : actors(
        a -> a.FIRST_NAME.like("A%"),
        a -> a.LAST_NAME.like("B%"))) {
    System.out.println(rec);

You get the idea.

Conclusion

… the idea is that jOOQ is an extremely powerful SQL expression tree API, which allows you to dynamically construct SQL queries of arbitrary complexity. If you’re running a static query, this just means that all of your SQL expressions are constant every time you execute the query.

There are no limits to how far you can push this. We’ve seen jOOQ users write queries that dynamically assemble dozens of common table expressions with several levels of dynamically nested derived tables, too. If you have a crazy example to share, we’re looking forward to it!