How to Execute a SQL Query Only if Another SQL Query has no Results

I stumbled upon an interesting question on Stack Overflow recently. A user wanted to query a table for a given predicate. If that predicate returns no rows, they wanted to run another query using a different predicate. Preferably in a single query.

Challenge accepted!

Canonical Idea: Use a Common Table Expression

We’re querying the Sakila database and we’re trying to find films of length 120 minutes. If there are no such films, then let’s find films of length 130 minutes. The following query is formally correct and runs without any adaptations on all of Oracle, PostgreSQL and SQL Server (and probably on other DBs too, as it’s pretty standard):

WITH r AS (
  SELECT * FROM film WHERE length = 120
)
SELECT * FROM r
UNION ALL
SELECT * FROM film
WHERE length = 130
AND NOT EXISTS (
  SELECT * FROM r
)

How does it work?

The common table expression (WITH clause) wraps the first query that we want to execute no matter what. We then select from the first query, and use UNION ALL to combine the result with the result of the second query, which we’re executing only if the first query didn’t yield any results (through NOT EXISTS). We’re hoping here that the database will be smart enough to run the existence check on a pre-calculated set from the first subquery, in order to be able to avoid running the second subquery.

Let’s see, which database actually does this.

PostgreSQL

Running EXPLAIN ANALYZE

EXPLAIN ANALYZE
WITH r AS (
  SELECT * FROM film WHERE length = 120
)
SELECT * FROM r
UNION ALL
SELECT * FROM film
WHERE length = 130
AND NOT EXISTS (
  SELECT * FROM r
)

… we can see the following plan:

Append  (cost=68.50..137.26 rows=15 width=561) (actual time=0.052..0.300 rows=9 loops=1)
  CTE r
    ->  Seq Scan on film film_1  (cost=0.00..68.50 rows=9 width=394) (actual time=0.047..0.289 rows=9 loops=1)
          Filter: (length = 120)
          Rows Removed by Filter: 991
  ->  CTE Scan on r  (cost=0.00..0.18 rows=9 width=672) (actual time=0.051..0.297 rows=9 loops=1)
  ->  Result  (cost=0.02..68.52 rows=6 width=394) (actual time=0.002..0.002 rows=0 loops=1)
        One-Time Filter: (NOT $1)
        InitPlan 2 (returns $1)
          ->  CTE Scan on r r_1  (cost=0.00..0.18 rows=9 width=0) (actual time=0.000..0.000 rows=1 loops=1)
        ->  Seq Scan on film  (cost=0.00..68.50 rows=6 width=394) (never executed)
              Filter: (length = 130)
Planning time: 0.952 ms
Execution time: 0.391 ms

So, indeed, the database seems to be smart enough to avoid the second query, because the first one does yield 9 rows.

Can we see this in a benchmark as well? In principle, the complete query should take about as much time in a benchmark as the Common Table Expression alone. Here’s the benchmark logic:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 2000;
  rec RECORD;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * FROM film WHERE length = 120
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', r, 
      (clock_timestamp() - v_ts); 
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        WITH r AS (
          SELECT * FROM film WHERE length = 120
        )
        SELECT * FROM r
        UNION ALL
        SELECT * FROM film
        WHERE length = 130
        AND NOT EXISTS (
          SELECT * FROM r
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', r, 
      (clock_timestamp() - v_ts); 
    RAISE INFO '';
  END LOOP;
END$$;

The result is:

INFO:  Run 1, Statement 1: 00:00:00.310325
INFO:  Run 1, Statement 2: 00:00:00.427744

INFO:  Run 2, Statement 1: 00:00:00.303202
INFO:  Run 2, Statement 2: 00:00:00.33568

INFO:  Run 3, Statement 1: 00:00:00.323699
INFO:  Run 3, Statement 2: 00:00:00.339835

INFO:  Run 4, Statement 1: 00:00:00.301084
INFO:  Run 4, Statement 2: 00:00:00.343838

INFO:  Run 5, Statement 1: 00:00:00.356343
INFO:  Run 5, Statement 2: 00:00:00.359891

As you can see, the second statement is consistently slower by around 5% – 10%. So we can safely say, the second subquery looking for length = 130 is not executed, but there’s still some overhead compared to making a decision in a client application to avoid that second subquery entirely. My guess here is that this is due to PostgreSQL’s Common Table Expression (CTE) being “optimisation fences”, i.e. the CTE is materialised every time. See also:
https://blog.2ndquadrant.com/postgresql-ctes-are-optimization-fences/

What about the inverse case?

In the above benchmark, we’ve measured how much time it takes when the first query succeeds (and the second query should be avoided). What about the inverse case, where the first query doesn’t match any rows and we have to run another query?

Benchmark time!

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 2000;
  rec RECORD;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * FROM film WHERE length = 1200
      ) LOOP
        NULL;
      END LOOP;
      FOR rec IN (
        SELECT * FROM film WHERE length = 130
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', r, 
      (clock_timestamp() - v_ts); 
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        WITH r AS (
          SELECT * FROM film WHERE length = 1200
        )
        SELECT * FROM r
        UNION ALL
        SELECT * FROM film
        WHERE length = 130
        AND NOT EXISTS (
          SELECT * FROM r
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', r, 
      (clock_timestamp() - v_ts); 
    RAISE INFO '';
  END LOOP;
END$$;

The result is roughly the same:

INFO:  Run 1, Statement 1: 00:00:00.680222
INFO:  Run 1, Statement 2: 00:00:00.696036

INFO:  Run 2, Statement 1: 00:00:00.673141
INFO:  Run 2, Statement 2: 00:00:00.709034

INFO:  Run 3, Statement 1: 00:00:00.626873
INFO:  Run 3, Statement 2: 00:00:00.679469

INFO:  Run 4, Statement 1: 00:00:00.619584
INFO:  Run 4, Statement 2: 00:00:00.639092

INFO:  Run 5, Statement 1: 00:00:00.616275
INFO:  Run 5, Statement 2: 00:00:00.675317

A slight overhead in the single query case.

But what’s this? We didn’t even have an index on the LENGTH column. Let’s add one!

Now, the result is very different. Query 1 succeeds:

INFO:  Run 1, Statement 1: 00:00:00.055835
INFO:  Run 1, Statement 2: 00:00:00.093982

INFO:  Run 2, Statement 1: 00:00:00.038817
INFO:  Run 2, Statement 2: 00:00:00.084092

INFO:  Run 3, Statement 1: 00:00:00.041911
INFO:  Run 3, Statement 2: 00:00:00.078062

INFO:  Run 4, Statement 1: 00:00:00.039367
INFO:  Run 4, Statement 2: 00:00:00.081752

INFO:  Run 5, Statement 1: 00:00:00.039983
INFO:  Run 5, Statement 2: 00:00:00.081227

Query 1 fails:

INFO:  Run 1, Statement 1: 00:00:00.075469
INFO:  Run 1, Statement 2: 00:00:00.081766

INFO:  Run 2, Statement 1: 00:00:00.058276
INFO:  Run 2, Statement 2: 00:00:00.079613

INFO:  Run 3, Statement 1: 00:00:00.060492
INFO:  Run 3, Statement 2: 00:00:00.080672

INFO:  Run 4, Statement 1: 00:00:00.05877
INFO:  Run 4, Statement 2: 00:00:00.07936

INFO:  Run 5, Statement 1: 00:00:00.057584
INFO:  Run 5, Statement 2: 00:00:00.085798

Oracle

In Oracle, I couldn’t find any difference in execution speed (see below). The plan of a combined query also contains an element that prevents the execution of the second subquery. In this case, I’m using the /*+GATHER_PLAN_STATISTICS*/ hint to make sure we get actual execution values / times in our execution plan:

WITH r AS (
  SELECT * FROM film WHERE length = 120
)
SELECT /*+GATHER_PLAN_STATISTICS*/ * FROM r
UNION ALL
SELECT * FROM film
WHERE length = 130
AND NOT EXISTS (
  SELECT * FROM r
);

SELECT p.*
FROM (
  SELECT *
  FROM v$sql
  WHERE upper(sql_text) LIKE '%LENGTH = 120%'
  ORDER BY last_active_time DESC
  FETCH NEXT 1 ROW ONLY
) s 
CROSS APPLY TABLE(dbms_xplan.display_cursor(
  sql_id => s.sql_id, 
  format => 'ALLSTATS LAST'
)) p;
---------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      9 |
|   1 |  UNION-ALL          |      |      1 |        |      9 |
|*  2 |   TABLE ACCESS FULL | FILM |      1 |      7 |      9 |
|*  3 |   FILTER            |      |      1 |        |      0 |
|*  4 |    TABLE ACCESS FULL| FILM |      0 |      7 |      0 |
|*  5 |    TABLE ACCESS FULL| FILM |      1 |      2 |      1 |
---------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("LENGTH"=120)
   3 - filter( IS NULL)
   4 - filter("LENGTH"=130)
   5 - filter("LENGTH"=120)

While the estimates are off just as in PostgreSQL (an error that can propagate, see conclusion), the actual rows for the second subquery is zero, and the second subquery is run zero times (“Starts”), because we don’t have to really access it at all. Excellent. Exactly what we expected!

Here, I’ve finally created a benchmark that anonymises the results properly by normalising them in order to comply with Oracle’s forbidding of publishing benchmark results. The fastest execution time is simply 1, and the other execution times are multiples of that value:

SET SERVEROUTPUT ON
CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2000;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * FROM film WHERE length = 120
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, 1, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        WITH r AS (
          SELECT * FROM film WHERE length = 120
        )
        SELECT * FROM r
        UNION ALL
        SELECT * FROM film
        WHERE length = 130
        AND NOT EXISTS (
          SELECT * FROM r
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    INSERT INTO results VALUES (r, 2, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(5, 4)) ratio 
    FROM results
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio);
  END LOOP;
END;
/

DROP TABLE results;

The result being (query 1 succeeds, no index):

Run 1, Statement 1 : 1
Run 1, Statement 2 : 1.26901

Run 2, Statement 1 : 1.10218
Run 2, Statement 2 : 1.08792

Run 3, Statement 1 : 1.26038
Run 3, Statement 2 : 1.09426

Run 4, Statement 1 : 1.2245
Run 4, Statement 2 : 1.10829

Run 5, Statement 1 : 1.07164
Run 5, Statement 2 : 1.18562

Or in the inverse case (query 1 fails, no index):

Run 1, Statement 1 : 1
Run 1, Statement 2 : 1.17871

Run 2, Statement 1 : 1.07377
Run 2, Statement 2 : 1.12489

Run 3, Statement 1 : 1.05745
Run 3, Statement 2 : 1.13711

Run 4, Statement 1 : 1.11118
Run 4, Statement 2 : 1.23508

Run 5, Statement 1 : 1.08535
Run 5, Statement 2 : 1.11271

Adding an index doesn’t change much (query 1 succeeds):

Run 1, Statement 1 : 1.20699
Run 1, Statement 2 : 1.28221

Run 2, Statement 1 : 1
Run 2, Statement 2 : 1.21174

Run 3, Statement 1 : 1.0054
Run 3, Statement 2 : 1.2643

Run 4, Statement 1 : 1.0491
Run 4, Statement 2 : 1.31103

Run 5, Statement 1 : 1.02547
Run 5, Statement 2 : 1.23192

Yet, when query 1 fails:

Run 1, Statement 1 : 1.56287
Run 1, Statement 2 : 1.09471

Run 2, Statement 1 : 1.22219
Run 2, Statement 2 : 1.11227

Run 3, Statement 1 : 1.19739
Run 3, Statement 2 : 1.03929

Run 4, Statement 1 : 1.13503
Run 4, Statement 2 : 1

Run 5, Statement 1 : 1.14289
Run 5, Statement 2 : 1.01919

This time, the combined query is a bit faster!

As can be seen, both queries are executed in roughly the same time on Oracle 12c although again the single query seems to be a little bit slower, but not always. Which is an important reminder to do benchmarking properly! Meaning:

  • Repeat benchmarks several times
  • Beware of warmup penalties (the first run is often the slowest)
  • Beware of excessive caching effects in benchmarks
  • Don’t trust performance differences that aren’t significant
  • Don’t compile any Scala code or chat on Slack while benchmarking. Your system should be idle, otherwise
  • Remember to benchmark the right data set. We only have 600 films in this table. What would happen with 60 million films?

SQL Server

Same exercise again:

DECLARE @ts DATETIME;
DECLARE @repeat INT = 2000;
DECLARE @r INT;
DECLARE @i INT;
DECLARE @dummy VARCHAR;

DECLARE @s1 CURSOR;
DECLARE @s2 CURSOR;

DECLARE @results TABLE (
  run     INT,
  stmt    INT,
  elapsed DECIMAL
);

SET @r = 0;
WHILE @r < 5
BEGIN
  SET @r = @r + 1

  SET @s1 = CURSOR FOR 
    SELECT title FROM film WHERE length = 120;

  SET @s2 = CURSOR FOR 
    WITH r AS (
      SELECT * FROM film WHERE length = 120
    )
    SELECT title FROM r
    UNION ALL
    SELECT title FROM film
    WHERE length = 130
    AND NOT EXISTS (
      SELECT * FROM r
    );

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s1;
    FETCH NEXT FROM @s1 INTO @dummy;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s1 INTO @dummy;
    END;

    CLOSE @s1;
  END;

  DEALLOCATE @s1;
  INSERT INTO @results VALUES (@r, 2, DATEDIFF(ms, @ts, current_timestamp));

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s2;
    FETCH NEXT FROM @s2 INTO @dummy;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s2 INTO @dummy;
    END;

    CLOSE @s2;
  END;

  DEALLOCATE @s2;
  INSERT INTO @results VALUES (@r, 1, DATEDIFF(ms, @ts, current_timestamp));
END;

SELECT 'Run ' + CAST(run AS VARCHAR) + 
  ', Statement ' + CAST(stmt AS VARCHAR) + 
  ': ' + CAST(CAST(elapsed / MIN(elapsed) OVER() AS DECIMAL(10, 5)) AS VARCHAR)
FROM @results;

The result, this time, is more drastic (no index, query 1 succeeds):

Run 1, Statement 1: 1.07292
Run 1, Statement 2: 1.35000

Run 2, Statement 1: 1.07604
Run 2, Statement 2: 1.40625

Run 3, Statement 1: 1.08333
Run 3, Statement 2: 1.40208

Run 4, Statement 1: 1.09375
Run 4, Statement 2: 1.34375

Run 5, Statement 1: 1.00000
Run 5, Statement 2: 1.46458

There is a 30% – 40% overhead for the CTE solution over the two query solution. If we don’t find any rows in the first query (no index):

Run 1, Statement 1: 1.08256
Run 1, Statement 2: 1.27546

Run 2, Statement 1: 1.16512
Run 2, Statement 2: 1.27778

Run 3, Statement 1: 1.00000
Run 3, Statement 2: 1.26235

Run 4, Statement 1: 1.04167
Run 4, Statement 2: 1.26003

Run 5, Statement 1: 1.05401
Run 5, Statement 2: 1.34259

… then the difference is slightly less drastic but still clear. The reason here is that SQL Server doesn’t avoid the unnecessary subquery:

Too bad! (Note I was using SQL Server 2014. Perhaps in 2016, this optimisation is implemented)

Note, you can trust me that adding an index doesn’t change much in this case.

Conclusion

We’ve seen that we can easily solve the original problem with SQL only: Select some data from a table using predicate A, and if we don’t find any data for predicate A, then try finding data using predicate B from the same table.

Oracle and PostgreSQL can both optimise away the unnecessary query 2 by inserting a “probe” in their execution plans that knows whether the query 2 needs to be executed or not. In Oracle, we’ve even seen a situation where the combined query outperforms two individual queries. SQL Server 2014 surprisingly does not have such an optimisation.

While the performance impact was negligible in all benchmarks (even in SQL Server), we should be careful with these kinds of queries and not entirely rely on the optimiser to “get it right”. In all three databases, the cardinality estimates were off. We’re working with small data sets, but if data sets grow larger, and queries like the above are embedded in more complex queries, then the wrong cardinality estimates can easily produce wrong execution plans (e.g. favouring hash join over nested loop joins because of a high number of estimated rows). An example of this was given in a previous blog post.

Nevertheless, we can get quite far with SQL, without resorting to procedural client languages and if I had conducted my benchmark with a JDBC client instead of procedural blocks directly inside of the database, perhaps the single query would have outperformed the double query case – at least in those cases where query 1 yielded no rows and query 2 had to be executed from a remote client. Probably in Oracle.

Ultimately, I can only repeat myself. Measure! Measure! Measure! There’s no point in guessing. Truth can only be found by measuring actual executions.

When to Use Bind Values, and When to Use Inline Values in SQL

Users of jOOQ, PL/SQL, T-SQL are spoiled as they hardly ever need to worry about bind values. Consider the following statements:

Using jOOQ

public int countActors(String firstName, String lastName) {
    return ctx.selectCount()
              .from(ACTOR)
              .where(ACTOR.FIRST_NAME.eq(firstName))
              .and(ACTOR.LAST_NAME.eq(lastName))
              .fetchOneInto(int.class);
    );
}

The method parameters firstName and lastName will be automatically mapped to bind values in the generated SQL statement. Here’s the debug log output when running the above, where the first statement is sent to the JDBC driver and then to the database, wheas the second statement is generated for debugging purposes only:

Executing query          : 
    select count(*)
    from "SAKILA"."ACTOR"
    where (
      "SAKILA"."ACTOR"."FIRST_NAME" = ?
      and "SAKILA"."ACTOR"."LAST_NAME" = ?
    )
-> with bind values      : 
    select count(*)
    from "SAKILA"."ACTOR"
    where (
      "SAKILA"."ACTOR"."FIRST_NAME" = 'SUSAN'
      and "SAKILA"."ACTOR"."LAST_NAME" = 'DAVIS'
    )
Fetched result           : +-----+
                         : |count|
                         : +-----+
                         : |    2|
                         : +-----+

With this API design, users don’t have to worry about binding variables explicitly at all, nor about remembering bind variable indexes or the data type of a bind value. This works because the overloaded Field<T>.eq(T) method (as well as pretty much every other method that works in a similar way) internally delegates to the more generic Field<T>.eq(Field<T>) method by wrapping the argument value in an explicit bind variable expression DSL.val(T) in the jOOQ SQL expression tree.

Using PL/SQL

The same is true when you’re using PL/SQL (or any other stored procedure language of another database), for instance:

SET SERVEROUTPUT ON
DECLARE
  
  FUNCTION count_actors (
    p_first_name VARCHAR2, 
    p_last_name VARCHAR2
  ) RETURN NUMBER IS 
    v_result NUMBER(10);
  BEGIN
    SELECT count(*)
    INTO v_result
    FROM actor
    WHERE first_name = p_first_name
    AND last_name = p_last_name;
    
    RETURN v_result;
  END count_actors;
  
BEGIN
  dbms_output.put_line(count_actors('SUSAN', 'DAVIS'));
END;
/

To be sure what happened, let’s consider the execution plan:

SELECT p.*
FROM (
  SELECT *
  FROM v$sql
  WHERE upper(sql_text) LIKE 'SELECT COUNT(*) FROM ACTOR%'
  ORDER BY last_active_time DESC
  FETCH NEXT 1 ROW ONLY
) s 
CROSS APPLY TABLE(dbms_xplan.display_cursor(sql_id => s.sql_id)) p;

As you can see below, bind variables were generated for the SQL query that was embedded in the PL/SQL function:

SQL_ID  9dgammbskx5tx, child number 0
-------------------------------------
SELECT COUNT(*) FROM ACTOR WHERE FIRST_NAME = :B2 AND LAST_NAME = :B1
 
Plan hash value: 3384208144
 
----------------------------------------------------
| Id  | Operation         | Name           | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT  |                |       |
|   1 |  SORT AGGREGATE   |                |     1 |
|*  2 |   INDEX RANGE SCAN| IDX_ACTOR_NAME |     1 |
----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("LAST_NAME"=:B1 AND "FIRST_NAME"=:B2)

Why is it important? 1: SQL injection

I’ve noticed this time and again when talking to Java developers: Many developers are aware of the risk of SQL injection when not using bind variables. This can happen when we would write dynamic SQL like this, e.g. using JDBC or jOOQ’s plain SQL API:

String sql = 
  "SELECT count(*) "
+ "FROM actor "
+ "WHERE 1 = 1 "
+ (firstName != null ? "AND first_name = " + firstName + " " : "")
+ (lastName != null ? "AND last_name = " + lastName + " " : "");

// JDBC
try (Statement s = connection.createStatement();
     ResultSet rs = s.executeQuery(sql)) {

    ...
}

// jOOQ
Result<?> result = ctx.fetch(sql);

Seriously. Don’t do this. Ever! Always use bind values for user input. There’s hardly any reason at all why you should inline the values. I mean, of course, you could if you always remember to manually escape all strings, e.g.:

public static String escape(String string) {
    // TODO: Handle MySQL's non-standard backslash escaping, too
    return string == null ? null : string.replace("'", "''");
}

And then:

String sql = 
  "SELECT count(*) "
+ "FROM actor "
+ "WHERE 1 = 1 "
+ (firstName != null ? "AND first_name = " + escape(firstName) + " " : "")
+ (lastName != null ? "AND last_name = " + escape(lastName) + " " : "");

Notice that there’s still a vulnerability risk in MySQL, which doesn’t necessarily conform to standard SQL string literal escaping. A very unfortunate MySQL “feature”, which is handled correctly by jOOQ:
https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_no_backslash_escapes

But why run the risk? It’s so much easier with bind values.

Why is it important? 2: Performance!

So, most Java developers, luckily, are aware of SQL injection vulnerabilities and mostly get this right. But there’s another thing that few Java developers are aware of, unfortunately. And that’s the performance aspect of using bind values. Let’s assume we’re not using bind values for the above query. Check this out:

SELECT count(*)
FROM actor
WHERE first_name = 'SUSAN'
AND last_name = 'DAVIS';

SELECT count(*)
FROM actor
WHERE first_name = 'NICK'
AND last_name = 'WAHLBERG';

And now, let’s find execution plans:

SELECT p.*
FROM (
  SELECT *
  FROM v$sql
  WHERE upper(sql_text) LIKE 'SELECT COUNT(*) FROM ACTOR%'
  ORDER BY last_active_time DESC
  FETCH NEXT 2 ROWS ONLY
) s 
CROSS APPLY TABLE(dbms_xplan.display_cursor(sql_id => s.sql_id)) p;

Result:

SQL_ID  12r8afykqkzbd, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE first_name = 'NICK' AND last_name = 'WAHLBERG'
 
Plan hash value: 3384208144
 
----------------------------------------------------
| Id  | Operation         | Name           | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT  |                |       |
|   1 |  SORT AGGREGATE   |                |     1 |
|*  2 |   INDEX RANGE SCAN| IDX_ACTOR_NAME |     1 |
----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("LAST_NAME"='WAHLBERG' AND "FIRST_NAME"='NICK')
 
SQL_ID  gfppqr9gpjqws, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE first_name = 'SUSAN' AND last_name = 'DAVIS'
 
Plan hash value: 3384208144
 
----------------------------------------------------
| Id  | Operation         | Name           | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT  |                |       |
|   1 |  SORT AGGREGATE   |                |     1 |
|*  2 |   INDEX RANGE SCAN| IDX_ACTOR_NAME |     1 |
----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("LAST_NAME"='DAVIS' AND "FIRST_NAME"='SUSAN')

As you can see, the two almost identical queries produced two times the exact same execution plan (same plan hash value), but they are stored under two distinct SQL_ID values. Yes, they’re distinct SQL statements!

Why?

Most sophisticated databases (including Oracle, SQL Server, DB2 and others) implement an execution plan cache. The reason for this is simple. Calculating an execution plan with cost based optimisation is expensive. Not for these trivial statements, but imagine you have dozens of JOINs, semi joins, unions, and what not. There are thousands of candidate execution plans, and the database needs to find the best one for you. That can be tons of work and we don’t want to let the database do that work every time.

So, the database will run a soft-parse (Oracle speak) to identify a SQL query and translate its SQL string to a SQL_ID. It will then check if there’s already a suitable plan available for that particular SQL_ID, and if so, it will avoid the so-called hard-parse (Oracle speak) to calculate a new plan.

Let me repeat this one more time: Making best use of this plan cache (Oracle speak: Cursor cache) is extremely important. You can severely overload a system up to the point of bringing it down, if you’re not using bind variables.

There’s a workaround to use CURSOR_SHARING=FORCE, which will transform inline values to bind values, but I’m not even going to explain how it works, because most Oracle experts advise you not to use that feature due to the significant drawbacks it will bring.

What if I want to inline values?

As we’ve seen above, when using jOOQ or PL/SQL, the above problems are really non-discussions, because it is quite unlikely that you run into a situation of accidentally inlining your bind values:

  • In jOOQ, you’d have to use plain SQL
  • In PL/SQL, you’d have to use dynamic SQL using DBMS_SQL

Anyway, in rare cases, users want to inline their bind variables for them to appear directly in the resulting SQL statement. This is never the case for user input, for ID values, or for ordinary search values. But it can be the case when you query for discriminators or “enum” values (enforced through a CHECK constraint, for instance), or when you run a report only once a year (plan is never available from the cache).

In these cases, it can be of advantage to not use bind values specifically to prevent the database from re-using a cached execution plan, because you know that the cached plan, which works well for 90% of the queries, won’t work well for this particular bind value (or in the case of the once-per-year report, you might get a slightly better plan by giving the database more information).

An example, let’s insert 1×1 and 99999×0 into a table:

CREATE TABLE skewed (
  v NUMBER(1)
);

INSERT INTO skewed
SELECT decode(level, 1, 1, 0)
FROM dual
CONNECT BY level <= 100000;

CREATE INDEX skewed_i ON skewed(v);

EXEC dbms_stats.gather_table_stats('TEST', 'SKEWED');

Now, clearly we see that when querying the table SKEWED for values V = 0, the index is useless, whereas it is very useful for values V = 1. Let’s run this statement here:

SET SERVEROUTPUT ON
DECLARE
  v_bind NUMBER(1);
  v_result NUMBER(10);
BEGIN
  v_bind := 0;
  
  SELECT count(*)
  INTO v_result
  FROM skewed
  WHERE v = v_bind;
  
  FOR rec IN (
    SELECT * FROM TABLE(dbms_xplan.display_cursor)
  ) LOOP
    dbms_output.put_line(rec.plan_table_output);
  END LOOP;

  v_bind := 1;
  
  SELECT count(*)
  INTO v_result
  FROM skewed
  WHERE v = v_bind;
  
  FOR rec IN (
    SELECT * FROM TABLE(dbms_xplan.display_cursor)
  ) LOOP
    dbms_output.put_line(rec.plan_table_output);
  END LOOP;
  
  SELECT count(*)
  INTO v_result
  FROM skewed
  WHERE v = 1;
  
  FOR rec IN (
    SELECT * FROM TABLE(dbms_xplan.display_cursor)
  ) LOOP
    dbms_output.put_line(rec.plan_table_output);
  END LOOP;
END;

The above block runs the exact same statement three times:

  1. With a bind variable of 0 (can’t really use the index)
  2. With a bind variable of 1 (should be using the index)
  3. With an inline value of 1 (should be using the index)

Here’s the result:

SQL_ID  1q0qjm8za06w3, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 4055318479
 
--------------------------------------------------
| Id  | Operation             | Name     | Rows  |
--------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |
|   1 |  SORT AGGREGATE       |          |     1 |
|*  2 |   INDEX FAST FULL SCAN| SKEWED_I | 99999 |
--------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("V"=:B1)
 
SQL_ID  1q0qjm8za06w3, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 4055318479
 
--------------------------------------------------
| Id  | Operation             | Name     | Rows  |
--------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |
|   1 |  SORT AGGREGATE       |          |     1 |
|*  2 |   INDEX FAST FULL SCAN| SKEWED_I | 99999 |
--------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("V"=:B1)
 
SQL_ID  bdpjxzqpg2416, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = 1
 
Plan hash value: 276090370
 
----------------------------------------------
| Id  | Operation         | Name     | Rows  |
----------------------------------------------
|   0 | SELECT STATEMENT  |          |       |
|   1 |  SORT AGGREGATE   |          |     1 |
|*  2 |   INDEX RANGE SCAN| SKEWED_I |     1 |
----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V"=1)

Observe the estimated numbers of rows above (in red):

  1. In the first case, the estimate is correct. With V = 0, we’ll get 99999 rows, so we might just as well scan the entire index to calculate the count value
  2. In the second case, we’ve inherited the cached execution plan from the first run, including the estimate of 99999 rows, which is clearly wrong in this case. The database should’ve estimated 1 row here
  3. In the third case, the estimate is again correct (note: different SQL_ID, and we get an optimal plan

We’re just out of luck. If we had inversed the order of queries, we would have gotten the right estimate for V = 1 but a wrong estimate for V = 0

Adaptive Cursor Sharing

Oracle knows some features to remedy the above problems. Oracle 11g introduced adaptive cursor sharing, which means that if we re-execute the above statements, the database will have already figured out that in this particular case, the plans should depend on the actual bind variable, because in hindsight, the second plan was wrong for V = 1

On a second execution of the previous block, we’ll see:

SQL_ID  1q0qjm8za06w3, child number 1
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 4055318479
 
--------------------------------------------------
| Id  | Operation             | Name     | Rows  |
--------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |
|   1 |  SORT AGGREGATE       |          |     1 |
|*  2 |   INDEX FAST FULL SCAN| SKEWED_I | 99999 |
--------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("V"=:B1)
 
SQL_ID  1q0qjm8za06w3, child number 2
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 276090370
 
----------------------------------------------
| Id  | Operation         | Name     | Rows  |
----------------------------------------------
|   0 | SELECT STATEMENT  |          |       |
|   1 |  SORT AGGREGATE   |          |     1 |
|*  2 |   INDEX RANGE SCAN| SKEWED_I |     1 |
----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V"=:B1)
 
SQL_ID  bdpjxzqpg2416, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = 1
 
Plan hash value: 276090370
 
----------------------------------------------
| Id  | Operation         | Name     | Rows  |
----------------------------------------------
|   0 | SELECT STATEMENT  |          |       |
|   1 |  SORT AGGREGATE   |          |     1 |
|*  2 |   INDEX RANGE SCAN| SKEWED_I |     1 |
----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V"=1)

As you can see above, the first two executions are still using the same plan, but they now have a new “child number” value, indicating that for any given SQL_ID, there are now different candidate plan hash values, depending on the bind variable input that we’re getting.

This often works well with the following caveats:

  • It only works after the plan was wrong at least once. This translates to at least one poor user who suffers from a slow query
  • It only works if the plans remain in the cache (which isn’t the case for rarely executed queries). Once the plans are purged from the cache, we have to run the query again at least twice for the alternative “child numbers” to appear
  • It stops working well if your query is much more complex, i.e. when complicated correlations exist between the bind variables that should each produce different plans

So, if in doubt, in this particular case, it’s not a bad idea to simply use an inline value / constant literal in the query directly, to help the optimiser make the right choice.

Another interesting article about this topic is here, where the optimiser will always be fooled by bind variables, and where you should always use inline values: When querying discriminators from views.

Adaptive Execution Plans

Oracle 12c took adaptive cursor sharing one step further and now supports adaptive execution plans, meaning that certain execution plans are known to be “shaky” in advance and the optimiser already pre-calculated a fallback plan that applies if the estimates are wrong for a given execution. In that case, the plan can be changed “in flight” and an alternative plan, which is better for a particular execution, is applied.

This feature is still not too stable in Oracle 12cR1, which is why some Oracle experts generally recommend turning it off.

Conclusion

You should always use bind values by default. In 99% of all cases, they’re the right choice for two reasons:

  • SQL injection prevention (obvious)
  • Statement caching optimisation (less obvious)

The latter reason is one that is not really well known among developers, because it’s not a problem that appears on development environments. It’s a production-only problem that happens under heavy load. Yet, you should be aware of this problem, and always remember to avoid generating too many distinct SQL strings (see also this related article about IN lists)

In rare cases, it is better to use inline values / literals, as this will help the optimiser make a much better choice in a predictable manner. These cases include:

  • Querying skewed data (unless adaptive features can be expected to kick in)
  • Querying discriminators (in this case, it’s always advisable to use inline values)

Using languages like PL/SQL, T-SQL, pgplsql, or APIs like jOOQ definitely helps you get this right, because you don’t have to think about this problem anymore. You’ll get it right automatically.

Side-note: Hibernate’s Criteria Query

Like jOOQ, Hibernate supports a type safe DSL for constructing JPQL queries, which to some extent cover basic SQL functionality when querying entities. Hibernate historically chose quite interesting defaults (as of version 5.2.10):

  • String values are always transformed to bind values, regardless if you’re using implicit values, explicit parameters, or explicit literals. The goal of this is to prevent SQL injection (because currently, Hibernate doesn’t auto-escape inline string literals)
  • Implicit integer values are always inlined
  • Explicit parameters are always sent as parameters
  • Explicit literals are usually sent as literals (unless they’re strings)

The above implementation is unfortunate as we’ve seen before for these reasons:

  • Bind values should always be the default, especially when using ID values, which are integers, and as such, often inlined in the current implementation. Luckily, this has been recognised as a bug and will be fixed, soon: https://hibernate.atlassian.net/browse/HHH-9576
  • Literals should be sent to the server as literals, because when users need literals (e.g. in the above edge cases), they don’t want the API to override this behaviour through bind values. This might also be solved soon: https://hibernate.atlassian.net/browse/HHH-11778

I’m currently working with the team to remedy these problems, such that the criteria API won’t inhibit users from a performance perspective:

How to Generate at Least One Row in SQL

There are some situations where you would like to have at least one (empty) row in your result set in SQL.

Imagine the following situation. We’re querying the Sakila database for actors and their films:

SELECT first_name, last_name, title
FROM actor
JOIN film_actor USING (actor_id)
JOIN film USING (film_id)
ORDER BY 1, 2, 3

yielding something like:

+------------+-----------+---------------------+
| FIRST_NAME | LAST_NAME | TITLE               |
+------------+-----------+---------------------+
| ...        | ...       | ...                 |
| ADAM       | GRANT     | SEABISCUIT PUNK     |
| ADAM       | GRANT     | SPLENDOR PATTON     |
| ADAM       | GRANT     | TADPOLE PARK        |
| ADAM       | GRANT     | TWISTED PIRATES     |
| ADAM       | GRANT     | WANDA CHAMBER       |
| ADAM       | HOPPER    | BLINDNESS GUN       |
| ADAM       | HOPPER    | BLOOD ARGONAUTS     |
| ADAM       | HOPPER    | CHAMBER ITALIAN     |
| ...        | ...       | ...                 |
+------------+-----------+---------------------+

Now, let’s find actors called SUSAN, and in fact, let’s not care if they played in any films (I’ve added them to the Sakila database for the sake of the example):

SELECT actor_id, first_name, last_name, title
FROM actor
LEFT JOIN film_actor USING (actor_id)
LEFT JOIN film USING (film_id)
WHERE first_name = 'SUSAN'
ORDER BY 1, 2, 3

Interesting, there are now two actors without any films:

+----------+------------+-----------+-----------------+
| ACTOR_ID | FIRST_NAME | LAST_NAME | TITLE           |
+----------+------------+-----------+-----------------+
| 110      | SUSAN      | DAVIS     | TROJAN TOMORROW |
| 110      | SUSAN      | DAVIS     | WASH HEAVENLY   |
| 110      | SUSAN      | DAVIS     | WORDS HUNTER    |
| 201      | SUSAN      | DAVIS     |                 |
| 202      | SUSAN      | SMITH     |                 |
+----------+------------+-----------+-----------------+

This worked, because I have changed the JOIN type from INNER JOIN to LEFT JOIN. That’s neat. But what if we hadn’t found any actor called SUSAN? What if we were looking for SUSANNE instead?

SELECT actor_id, first_name, last_name, title
FROM actor
LEFT JOIN film_actor USING (actor_id)
LEFT JOIN film USING (film_id)
WHERE first_name = 'SUSANNE'
ORDER BY 1, 2, 3

Empty. Void. Nothing:

+----------+------------+-----------+-----------------+
| ACTOR_ID | FIRST_NAME | LAST_NAME | TITLE           |
+----------+------------+-----------+-----------------+

That’s fine in most cases, because the way we wrote this query, we’re expecting:

All actors called Susanne and their films, if any

But what if we wanted to have the same behaviour as we got for Films through LEFT JOIN also with the actors? I.e. if we wanted this, instead (i.e. a collection with 1..N cardinality):

+----------+------------+-----------+-----------------+
| ACTOR_ID | FIRST_NAME | LAST_NAME | TITLE           |
|          |            |           |                 |
+----------+------------+-----------+-----------------+

What would you need this for? Well, sometimes, we can simply not handle the depressing sadness of emptiness.

How to do this? We need another LEFT JOIN prepended to the ACTOR table, but not just to the ACTOR table itself, we need to prepend it to everything. E.g. like this:

SELECT actor_id, first_name, last_name, title

-- This dummy table will always generate exactly one row
FROM (
  SELECT 1 a
) a

-- This is the original query, including the WHERE clause
LEFT JOIN (
  SELECT *
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  LEFT JOIN film USING (film_id)
  WHERE first_name = 'SUSANNE'
) b ON 1 = 1
ORDER BY 1, 2, 3

The above query is guaranteed to produce at least one row, because the left side of the LEFT JOIN always produces exactly one row, which is joined to every row on the right side, if there is any row on the right side.

Caveats:

  • The WHERE clause (and potentially other clauses, like GROUP BY) must now go inside of the new derived table B. Otherwise, we’d be removing that single row from A again using WHERE. (This is because of the order of SQL operations. We must ensure WHERE “happens-before” LEFT JOIN)
  • The LEFT JOIN between A and B needs an ON clause for syntactic reasons, even if we don’t really need that here. Just put something that is always true (like TRUE in PostgreSQL).
  • Our result now has an additional, useless column A, which might bother us, e.g. when using SELECT *

Alternative: OUTER APPLY

If you’re using SQL Server or Oracle 12c, there’s an even more elegant solution using OUTER APPLY:

SQL Server

SELECT actor_id, first_name, last_name, title
FROM (SELECT 1 a) a
OUTER APPLY (
  SELECT a.actor_id, a.first_name, a.last_name, f.title
  FROM actor a
  LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
  LEFT JOIN film f ON fa.film_id = f.film_id
  WHERE first_name = 'SUSANNE'
) b
ORDER BY 1, 2, 3

Oracle 12c

SELECT actor_id, first_name, last_name, title
FROM (SELECT 1 a FROM dual) a
OUTER APPLY (
  SELECT *
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  LEFT JOIN film USING (film_id)
  WHERE first_name = 'SUSANNE'
) b
ORDER BY 1, 2, 3

While we don’t actually need the nice APPLY feature here, it just allows us to omit the ON clause and still have the LEFT OUTER semantics. Neat ey?

Geek bonus

And if you really want to geek out on this functionality, consider using the dee table from the dum/dee PostgreSQL example. Remember, the dee table is a table with exactly one row and no columns! This means we can use SELECT * without getting this dummy row!

SELECT *

-- This dummy table will always generate exactly one row
FROM dee

-- This is the original query, including the WHERE clause
LEFT JOIN (
  SELECT *
  FROM actor
  LEFT JOIN film_actor USING (actor_id)
  LEFT JOIN film USING (film_id)
  WHERE first_name = 'SUSANNE'
) b ON 1 = 1
ORDER BY 1, 2, 3

Ahh. Beautiful SQL!

10 Nice Examples of Writing SQL in Kotlin With jOOQ

Kotlin is the next big thing. With Google announcing official support for Kotlin on Android, we’ll see a lot more traction for this lovely language.

We’ve already blogged about the Kotlin language recently: 10 Features I Wish Java Would Steal From the Kotlin Language.

Should you migrate your application to Kotlin? Perhaps – there are many questions. The most important question is (of course): What are you going to do with your awesome SQL code?

The answer is: Use jOOQ! Need convincing? Here are 10 nice examples of writing SQL in Kotlin with jOOQ:

(Note: All the code is available on GitHub)

1. Using Kotlin’s Map Access Literals With jOOQ Records

This is a free feature that you’re getting from Kotlin thanks to their smart move of supporting certain operators on custom types by convention. Check out how easy it is to access a value from a record using a column reference:

val a = AUTHOR
val b = BOOK

ctx.select(a.FIRST_NAME, a.LAST_NAME, b.TITLE)
   .from(a)
   .join(b).on(a.ID.eq(b.AUTHOR_ID))
   .orderBy(1, 2, 3)
   .forEach {
       println("${it[b.TITLE]} by "
             + "${it[a.FIRST_NAME]} ${it[a.LAST_NAME]}")
   }

The output from our sample code is this:

1984 by George Orwell
Animal Farm by George Orwell
Brida by Paulo Coelho
O Alquimista by Paulo Coelho

We’re already using a couple of useful language features here. First off:

val for local variable type inference

We’re renaming AUTHOR to a and BOOK to b in our code to shorten table names a little bit. Using val we don’t even need to explicitly refer to the type of those tables.

As it looks right now, we’re going to get that feature in the Java language as well, so stay tuned.

(More about val in item #3)

More info about this feature here.

String interpolation

As you can see, inside of the String literal, we’re using the ${dollar.notation} to access variables from the context.

More info about this feature here.

(More about string interpolation in item #5)

Implicit lambda parameter “it”

When writing single-parameter lambda expressions, which is typical for loops, mapping, and many other standard library features, we don’t need to actually name that parameter in Kotlin. The default name is “it”. Groovy developers will rejoice, as they can profit from the same feature.

More info about this feature here.

Finally: Operator overloading

Kotlin supports a nice and pragmatic approach to operator overloading. We’re making use of the fact that the following two things mean exactly the same in Kotlin:

val x1 = anything[key];
val x2 = anything.get(key);

anything[key] = value;
anything.set(key, value);

Yeah, why not? In jOOQ, we’ve already seen this coming in version 3.9, and we’ve thus added get() and set() methods on our ubiquitous Record type, so you can access record values by:

  • Field reference (with type safety)
  • Column name (no type safety)
  • Index (no type safety)

Note that operator overloading can be used as well with a couple of arithmetic operators:

ctx.select(TRANSACTION.AMOUNT + 10) // Translates to Field.plus(10)
   .from(TRANSACTION)
   .fetch();

More info about this feature here.

2. Loop Over Record Contents

Another really nice feature of the language is destructuring of “tuples” into several local variables. Quite a few languages support this now, and so does Kotlin, even without first-level language support for real tuples. But often, we don’t actually need tuples, we just need their syntax. This can be done with any Map, for instance. Consider this code:

for (r in ctx.fetch(b))
    for ((k, v) in r.intoMap())
        println("${r[b.ID]}: ${k.padEnd(20)} = $v")

The output from our sample code is this:

1: ID                   = 1
1: AUTHOR_ID            = 1
1: TITLE                = 1984
1: PUBLISHED_IN         = 1948
2: ID                   = 2
2: AUTHOR_ID            = 1
2: TITLE                = Animal Farm
2: PUBLISHED_IN         = 1945
3: ID                   = 3
3: AUTHOR_ID            = 2
3: TITLE                = O Alquimista
3: PUBLISHED_IN         = 1988
4: ID                   = 4
4: AUTHOR_ID            = 2
4: TITLE                = Brida
4: PUBLISHED_IN         = 1990

This is really nice! Every jOOQ record can be represented as a Map using Record.intoMap(), which can be destructured automatically in loops as can be seen above.

More info about this feature here.

3. Local Variable Type Inference

An API that makes use of generics as heavily as jOOQ can leave quite some burden on the user occasionally. Type safety is great for the DSL, e.g. in jOOQ, the Java / Kotlin compilers can type-check things like:

// Type checks: Both sides are of type string
AUTHOR.FIRST_NAME.in("Alice", "Bob");

// Type checks: 
// - Both sides are of type string
// - Both sides are of degree one
AUTHOR.FIRST_NAME.in(
  select(CUSTOMER.FIRST_NAME).from(CUSTOMER)
);

// Doesn't type check: Wrong degree
AUTHOR.FIRST_NAME.in(
  select(CUSTOMER.FIRST_NAME, CUSTOMER.LAST_NAME).from(CUSTOMER)
)

This works through fancy types like:

Select<Record2<String, String>> subselect =
  select(CUSTOMER.FIRST_NAME, CUSTOMER.LAST_NAME).from(CUSTOMER);

These types can get quite hairy as they go up to 22 (higher degrees are supported without type safety). So, in Kotlin, we can simply write:

val subselect =
  select(CUSTOMER.FIRST_NAME, CUSTOMER.LAST_NAME).from(CUSTOMER);

// Still doesn't type check:
AUTHOR.FIRST_NAME.in(subselect)

// This works:
row(AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME).in(subselect)

More info about this feature here.

4. Inline Functions for New jOOQ Features

Are you missing a feature in jOOQ, for instance the PostgreSQL ilike operator? Well, it is supported through the Field.likeIgnoreCase() method, but you might prefer the more native looking ilike operator. No problem with Kotlin. Just write an inline function (which works similar to an extension function but they’re not exactly the same thing):

inline fun <F: Field<String>> F.ilike(field: String): Condition {
    return condition("{0} ilike {1}", this, field)
}

These inline functions make use of jOOQ’s plain SQL API which is heavily used by Java developers as well, but in this case, the keyword inline on the above functions indicates that the function works like an inline function. We can now write a query like this:

println("${ctx.select(b.TITLE)
              .from(b)
              .where(b.TITLE.ilike("%animal%"))
              .fetchOne(b.TITLE)}")

The result being:

Animal Farm

Notice how we can simply write b.TITLE.ilike("..") even if the jOOQ API doesn’t have such a method!

More info about this feature here

5. String Interpolation and Multiline Strings

How many times have we wished for multiline strings in Java? So many other languages have it, even SQL and PL/SQL. When working with string-based embedded SQL, this is just a killer feature. Good for Kotlin! Not only does Kotlin support multiline strings, but also string interpolation.

Here’s how to use the jOOQ plain SQL API with Kotlin:

ctx.resultQuery("""
    SELECT *
    FROM (
      VALUES (1, 'a'),
             (2, 'b')
    ) AS t
    """)
    .fetch()
    .forEach {
        println("${it.intoMap()}")
    }

Just copy-paste any arbitrary SQL statement right from your SQL console (in this case: H2 syntax) and you’re done. The output of the above is:

{C1=1, C2=a}
{C1=2, C2=b}

Bonus feature if you’re using the upcoming jOOQ 3.10 parser: You can standardise on SQL features that are not available from your database:

val colX = field("x")
val colY = field("y")
ctx.parser()
   .parseResultQuery("""
    SELECT *
    FROM (
      VALUES (1, 'a'),
             (2, 'b')
      -- This feature (derived column lists) 
      -- isn't really available in H2 yet it works!
    ) AS t(${colX.name}, ${colY.name})
    """)
   .fetch()
   .forEach {
       println("${it[colX]}, ${it[colY]}")
   }

The above statement makes use of derived column lists (renaming tables and columns in one go). Unfortunately, H2 doesn’t support this feature but jOOQ can emulate it, transparently. Note that we’re again using string interpolation, this time in order to reuse column names.

More info about the feature here.

6. Null Safe Dereferencing

Null is the mother of all bikesheds. Of course, Kotlin is opinionated about null as well

You’re not sure if your query will produce a result? No problem with Kotlin’s various null-related operator:

println("${ctx.fetchOne(b, b.ID.eq(5))?.title ?: "not found"}")

The above will print:

not found

Why? Because fetchOne() won’t return a record from our database, it will return null. We cannot dereference ?.title from null, but the ?. operator won’t throw an exception, it will just return null again. Finally, the Elvis operator will provide a default value if the expression on the left of it is null. Perfect!

More info about the feature here.

7. Getters and setters feel like properties

Remember the last 20 years when we emulated properties in Java using the obnoxious JavaBeans convention? Tired of writing getters and setters? Well, at least, with Kotlin, it pays off. We can syntactically pretend that we have a property where there are only getters and setters. Check this out:

val author1 = ctx.selectFrom(a).where(a.ID.eq(1)).fetchOne();
println("${author1.firstName} ${author1.lastName}")

val author2 = ctx.newRecord(a);
author2.firstName = "Alice"
author2.lastName = "Doe"
author2.store()
println("${author2.firstName} ${author2.lastName}")

Isn’t that awesome? we can just dereference author.firstName, which simply translates to author.getFirstName(). Likewise, we can assign a value to author2.lastName = "Doe", which simply translates to author2.setLastName("Doe")

It seems so obvious, no?

More info about the feature here

8. With to Abbreviate Setter Calls

We can take this feature one step further and use the nice with() inline function (not a keyword!). In the above example, when storing a new author, we have to repeat the author reference all the time. This is no longer true with… with!

val author3 = ctx.newRecord(a);
with (author3) {
    firstName = "Bob"
    lastName = "Doe"
    store()
}

All the code that is inside of the block supplied to the with() function is going to operate on the with() function’s first argument. We already know this interesting feature from JavaScript, but in Kotlin, it’s typesafe!

And notice, the store() call is also affected!

More info about this feature here

9. With to Locally Import Tables

We can use with() also with queries! In jOOQ, columns are instance members inside of tables, and as such, must be dereferenced from a concrete table. That can be tedious (SQL doesn’t have this requirement), especially when selecting only from a single table. You can turn on static member generation in jOOQ’s code generator, but then table aliasing is much less easy.

Or you use Kotlin again! The with() function can help:

with (AUTHOR) {
    ctx.select(FIRST_NAME, LAST_NAME)
       .from(AUTHOR)
       .where(ID.lt(5))
       .orderBy(ID)
       .fetch {
           println("${it[FIRST_NAME]} ${it[LAST_NAME]}")
       }
}

Resulting in

George Orwell
Paulo Coelho

Isn’t that nice? Now, all the FIRST_NAME, LAST_NAME, ID column references are dereferenced from AUTHOR in a much more lenient way.

More info about this feature here

10. Destructuring jOOQ records into sets of local variables

This is so great, we’re going to add more native support for it in jOOQ 3.10 right away.

Kotlin’s type destructuring feature is implemented following convention. This means that any type that has methods with the name component[N] can be destructured. We currently don’t have those methods, but they can be added really easily using inline operators:

operator fun <T, R : Record3<T, *, *>> R.component1() : T {
    return this.value1();
}

operator fun <T, R : Record3<*, T, *>> R.component2() : T {
    return this.value2();
}

operator fun <T, R : Record3<*, *, T>> R.component3() : T {
    return this.value3();
}

These inline functions are added to the R type, which is any subtype of Record3, and then they can be used as such:

for ((first, last, title) in ctx
   .select(a.FIRST_NAME, a.LAST_NAME, b.TITLE)
   .from(a)
   .join(b).on(a.ID.eq(b.AUTHOR_ID))
   .orderBy(1, 2, 3))
       println("$title by $first $last")

The result is again:

1984 by George Orwell
Animal Farm by George Orwell
Brida by Paulo Coelho
O Alquimista by Paulo Coelho

As can be seen, our query returns the type Record3<String, String, String>. Our inline functions will add component1(), component2(), and component3() methods to the type, which are used by the Kotlin language to destructure the record into three local loop variables: (first, last, title).

This is incredibly useful!

(note also that we don’t have to execute the query explicitly in the loop because any jOOQ ResultQuery is Iterable)

More info about this feature here

Conclusion

Kotlin is getting increasingly popular. It contains a lot of smart yet pragmatic features that make daily work with programming on the JVM a lot easier. Kotlin’s high focus on Java interoperability makes it really really powerful, and working with Kotlin and jOOQ is a really productive way of writing SQL on the JVM.

Do give it a shot. The sources are here:

https://github.com/jOOQ/jOOQ/tree/master/jOOQ-examples/jOOQ-kotlin-example

10 Tips on How to be a Great Programmer

I was recently asked in an interview about my opinion on how to be a great programmer. That’s an interesting question, and I think we can all be great programmers, regardless of our talent, if we follow a couple of rules that – I believe – should be common sense. In fact, these rules don’t all apply to programmers only, but to any professional.

Not everything in this list is meant to be taken entirely seriously, some things are just my opinions and yours may vary, and not all descriptions of programmer personalities match real-world situations I may have encountered, so if in doubt, please do not take offense. I didn’t mean you 🙂

Here they are:

1. Learn how to ask questions

There are essentially these types of programmers asking questions:

  • The perfectionist: Especially when asking a question about some open source tool, they may have already debugged through the code and found the real cause of their problem. But even if not, the perfectionist will write an introduction to the problem, steps to reproduce, potentially a suggested workaround, and as I said, potentially a suggested fix. In fact, the perfectionist doesn’t have questions. Only answers.
  • The chatterbox: This person will not really ask questions. Rather, they arrange their thoughts publicly and occasionally put a rhetorical question mark here and there. What may appear to be a question is really just a stream of thoughts and if you wait with the answer, they may either find the answer themselves, or ask the real question in email number 37. Or not. Or maybe, if we try it this way. You know what? It turned out that the requirement is completely wrong, I solved it with some other technique. Oh, actually, I changed libraries entirely. Hehe. No more questions.
  • The slacker: Here’s the code. What’s wrong? Halp plz.
  • The manager: For this type, time is money. The questions must be short and the answers ASAP. There’s something ironic about this approach, because by keeping the questions short (read: incomplete, not concise), most often, a lot of important details are missing, which only leads to requests for details. Then, the manager (naturally being disappointed because the reply is not an answer but a new question) will send yet again a short message and demand an answer with even more urgency. This can go back and forth for quite a while. In the end, it may take 1-2 weeks until what could’ve been the answer will actually be answered.
  • The complainer: This person doesn’t ask questions. They complain. Until the question goes away. Perhaps. If it doesn’t all the better. More reasons to complain.

By now, it should be clear that a well prepared question (concise, simple, short, yet with enough details) will yield much better answers. If you know what exactly you want to learn with your question, chances are, you’ll get exactly what you wanted.

2. Learn how to avoid asking questions

In fact, mostly, it’s better to try to avoid asking questions. Perhaps you can figure it out yourself? Not always, of course. Many things, you simply cannot know and by asking a domain expert, you’ll find the quickest and most efficient path to success. But often, trying something yourself has these benefits:

  • You learn it the “hard way” which is the way that sticks to our memory much better – we’ll remember what we’ve learned
  • It’s more rewarding to figure out stuff yourself
  • You don’t create “noise”. Remember the chatterbox? Unless the person you’re asking the question(s) is routinely answering questions (and thus postponing their answer), they may not see through your stream of thoughts and try to answer each individual incomplete “question”. That doesn’t help anyone.
  • By postponing a question (for a while, at least), you can gather more relevant information that you can provide to someone who might be able to answer your question. Think of the “perfectionist”. Spend more time looking for details first, then answer the question.
  • You’ll get better at asking questions by training to ask good questions. And that takes a bit more time.

3. Don’t leave broken windows

There was a very interesting article on reddit, recently about not leaving broken windows. The essence of the article is to never compromise on quality. To never become a slacker. To never leave … broken windows. Here’s a quote from the article

When we take certain shortcuts to deliver something in the shortest period of time, and our code reflects how careless we’ve been, developers coming after us (from the same team, a future team, and even ourselves!), will reach an important conclusion: it’s not important to pay enough attention to the code we produce. Little by little our application will start deteriorating in what it becomes an unstoppable process.

This isn’t about being a perfectionist, in fact. Sometimes, fixing broken windows can be postponed (and must be, of course). Oftentimes, by allowing windows to break and stay broken, we’re introducing a situation where no one is happy. Not us the programmers, not our customers, not our users, not our project managers. This is an attitude thing and thus at the core of being professional. How did Benjamin Franklin say?

The bitterness of poor quality remains long after the sweetness of low price is forgotten

That’s true for everything. In this case, “low price” is the quick win we might get by implementing something in a sloppy way.

4. Software ought to be deterministic. Aim for it!

In an ideal world, everything in software is “deterministic”. We’d all be functional programmers, writing side-effect free, pure functions. Like String.contains(). No matter how many times you execute the following:

assertTrue("abcde".contains("bc"));

… the result is always the same, expected result. The universe and all its statefulness will have absolutely no impact on this computation. It’s deterministic.

We can do that, too in our own programs, not just in standard libraries. We can try to write side-effect free, deterministic modules for as often as possible. This isn’t really a matter of what technology we choose. Deterministic programming can be done in any language – even if functional languages have more tools to prevent accidental side-effects through more sophisticated type systems. But the example I’ve shown is a Java example. Object orientation permits determinism. Heck, procedural languages like PL/SQL allow for determinism. It’s a requirement for a function to be deterministic, if it is to be used in an index:

CREATE INDEX upper_first_name ON customer (upper (first_name));
-- Deterministic function here: -----------^^^^^^^^^^^^^^^^^^

So again, this is a matter of discipline. You could see a side-effectful procedure / method / “function” as a “broken window”. Perhaps it was easier to maintain a side-effect, hoping that eventually, it can be removed. But that’s usually a lie. The price will be paid later, when non-determinism suddenly strikes. And it will.

5. Expect the unexpected

That previous link is key. Murphy’s Law is something we programmers should observe all the time. Everything can break. And it will. Come on, as software engineers, we know it will break. Because our world is not deterministic, neither are the business requirements that we’re implementing. We can implement tip #4 (determinism) only until it is no longer possible. From then on, we’ll inevitably enter the world of non-determinism (the “real world”), where stuff will go wrong. So, aim for it. Expect the unexpected. Train your inner pessimist to foresee all kinds of things.

How to write that pessimistic code in a concise manner is another story, of course. And how to distinguish things that will fail (and need to be dealt with) from things that might fail (and don’t need to be dealt with), that takes some practice 🙂

6. Never cargo cult. Never follow dogma. Always embrace: “It depends”

Everything you’ve been taught is potentially wrong. Including (or probably: especially) when someone really popular says it. Here’s a nice quote:

Our profession is full of hypocrisy. We like to think of ourselves as mathematicians, where only the purest of ideas persist, and they’re necessarily correct.

That’s wrong. Our profession is built on top of mathematics, but unless you go into the funky world of category theory or relational algebra (and even then, I doubt everything is “correct”), you’re in the pragmatic world of real-world business requirements. And that’s, quite frankly, very far from perfect. Let’s look at some of the most popular programming languages out there:

  • C
  • Java
  • SQL

Do you really think that these languages resemble mathematics in the least bit? If so, let’s discuss segmentation faults, Java generics, or SQL three-valued logic. These languages are platforms built by pragmatists. There is some really cool theoretical background in all of them, but ultimately, they had to do the job.

The same is true for everything built on top of languages. Libraries. Frameworks. Heck, even architectures. Design patterns. Nothing is right or wrong. Everything is a tool that was designed for some context. Think of the tool in its context. Never think of the tool as a standalone raison d’être. We’re not doing art for art’s sake.

So, say no to unquestioned:

  • XML
  • JSON
  • Functional programming
  • Object oriented programming
  • Design patterns
  • Microservices
  • Three tier architectures
  • DDD
  • TDD
  • In fact: *DD

All of these are nice tools given a certain context. They don’t always hold true. By staying curious and thinking out of the box, you’ll be a better programmer and know when to use which one of these tools.

7. Do it

True. There are luminaries out there who outperform everyone.

But most programmers are simply “good”. Or they have the potential to be “good”. How to be a good programmer? By doing it. Great software wasn’t written in a day, and popular people aren’t the only heroes of our time. I’ve met many great programmers no one knows about, because they lead private lives, solve private problems of small companies.

But great programmers all have one thing in common: They just do it. They practice. They work every day to get better and better.

Want to get better at SQL? Do it! Try to write a SQL statement with some new features in them, every day. Use window functions. Grouping sets. Recursion. Partitioned outer join. The MODEL and/or MATCH_RECOGNIZE clauses. It doesn’t have to ship to production every time, but the practice will be worth it.

8. Focus on one subject (in the long run)

There might be only very few “good” full stack developers out there. Most full stack developers will rather be mediocre at everything. Sure, a small team might need only a few of them and they can cover a lot of business logic to quickly bootstrap a piece of new software. But everything will be quite sloppy and “kinda works”. Perhaps that’s good enough for the minimum viable product phase, but in the long run, there will be more complex problems that a full stack developer will not have the time to properly analyse (or foresee!).

Focus on one subject mostly, and get really good at that. A specialist will always be in demand as long as that specialist’s niche exists, and many niches will outlive us all (hello COBOL or SQL). So, do your career a favour and do something really weel, rather than many things “just ok”.

9. Play

While you should mostly focus on one subject, you shouldn’t forget the other subjects completely. You may never be really really really good at all of SQL, scaling up, scaling out, low level performance, CSS (who’s good at that anyway!?), object orientation, requirements engineering, architecture, etc. all at once (see tip #8). It’s just not possible.

But you should at least understand the essence of each of these. You should understand when SQL is the right choice (and when it isn’t). When low level performance tuning is important (and when it isn’t). How CSS works in principle. The advantage of object orientation, FP, etc.

You should spend some time playing with these (and many more) concepts, technologies to better understand why they’re important. To know when to apply them, and then perhaps to find an expert to actually execute the work.

By playing around with new paradigms and technologies, you’ll open up your mind to entirely different ways of thinking, and chances are, you’ll be able to use that in your everyday work in one way or another.

10. Keep it simple, stupid

Albert Einstein said:

Everything should be made as simple as possible, but no simpler

No one is able to handle huge complexity. Not in software, not in any other aspect of life. Complexity is the killer of good software and thus simplicity is the enabler. Easy to understand. Hard to implement. Simplicity is something that takes a lot of time and practice to recognise and produce. There are many rules you can follow, of course.

One of the simplest rules is to have methods/functions with only a few parameters. Let’s look at that. Surely, the aforementioned String.contains() method qualifies. We can write "abcde".contains("bcd") and without reading any documentation, everyone will immediately understand what this does and why. The method does one thing and one thing only. There are no complicated context/settings/other arguments that can be passed to the method. There are no “special cases”, nor any caveats.

Again, producing simplicity in a library is much easier than doing that in business logic. Can we achieve it? Perhaps. By practicing. By refactoring. But like great software, simplicity is not built in a day.

(Protip: Conway’s Law applies. It is utterly impossible to write good, simple software in an environment where the business is super complex. Either, you like complexity and ugly legacy, or you better get out of that business).

The Difficulty of Tuning Queries Over a Database Link – Or How I Learned to Stop Worrying and Love the DUAL@LINK Table

A large-ish customer in banking (largest tables on that particular system: ~1 billion rows) once decided to separate the OLTP database from the “log database” in order to better use resources and prevent contention on some tables, as the append-only log database is used heavily for analytic querying of all sorts. That seems to make perfect sense. Except that sometimes, joins need to be done between “main database” and “log database” tables. This is when things get really hard to tune in Oracle – and probably in other databases too.

In this article, however, I’d like to focus on a much simpler example. One that seems to cause no trouble to the optimiser because all joined tables are from the “log database” only. Let’s use the following setup:

-- This is the database link
CREATE PUBLIC DATABASE LINK LOOPBACK 
CONNECT TO TEST IDENTIFIED BY TEST 
USING 'ORCLCDB';

-- Just making sure we get all statistics in execution plans
ALTER SESSION SET statistics_level = ALL;

And then, create this schema:

CREATE TABLE t (
  a INT NOT NULL,
  b INT NOT NULL,
  CONSTRAINT pk_t PRIMARY KEY (a)
);
CREATE TABLE u (
  a INT NOT NULL,
  b INT NOT NULL,
  CONSTRAINT pk_u PRIMARY KEY (a)
);

INSERT INTO t
SELECT
  level,
  level
FROM dual
CONNECT BY level <= 500000;

INSERT INTO u
SELECT
  level,
  level
FROM dual
CONNECT BY level <= 500000;

CREATE INDEX i_t ON t(b);

ALTER TABLE u ADD CONSTRAINT fk_u FOREIGN KEY (a) REFERENCES t;

EXEC dbms_stats.gather_table_stats('TEST', 'T');
EXEC dbms_stats.gather_table_stats('TEST', 'U');

It’s a really boring emulation of the real schema, and it doesn’t have nearly as many columns / rows. But the essence is:

  • There are (at least) two tables
  • Both have quite a few rows (that’s important here. I’ll show why, later)
  • We’ll use an index for searching rows
  • We’ll join by a one-to-many relationship

There may be other setups to reproduce the same issue, of course.

Now, let’s consider the following query (not using the database link yet).

SELECT CASE WHEN EXISTS (
  SELECT *
  FROM t 
  JOIN u USING (a)
  WHERE t.b BETWEEN 0 AND 1000
) THEN 1 ELSE 0 END
FROM dual

Unfortunately, Oracle doesn’t support boolean types and always requires a FROM clause. Otherwise, we could be writing this more concise version, as in PostgreSQL:

SELECT EXISTS (
  SELECT *
  FROM t 
  JOIN u USING (a)
  WHERE t.b BETWEEN 0 AND 1000
)

We’re checking for the existence of rows in both tables, given a predicate that runs on the previously created index.

As shown in a previous article, it’s much better to use EXISTS rather than COUNT(*), in pretty much all databases. The algorithm is optimal, because the usage of the EXISTS predicate hints to the optimiser that a SEMI JOIN can be used instead of an INNER JOIN

--------------------------------------------------------------------------
| Operation                            | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------------
| SELECT STATEMENT                     |      |      1 |        |      1 |
|  NESTED LOOPS SEMI                   |      |      1 |      4 |      1 |
|   TABLE ACCESS BY INDEX ROWID BATCHED| T    |      1 |   1000 |      1 |
|    INDEX RANGE SCAN                  | I_T  |      1 |   1000 |      1 |
|   INDEX UNIQUE SCAN                  | PK_U |      1 |    333K|      1 |
|  FAST DUAL                           |      |      1 |      1 |      1 |
--------------------------------------------------------------------------

Some observations:

So, this is optimal (until I learn a new trick, of course).

Let’s bring in database links

Assuming that these two tables are on a remote database, we might naively proceed with writing this query:

SELECT CASE WHEN EXISTS (
  SELECT *
  FROM t@loopback 
  JOIN u@loopback USING (a)
  WHERE t.b BETWEEN 0 AND 1000
) THEN 1 ELSE 0 END
FROM dual;

So now, we’re selecting from T@LOOPBACK and U@LOOPBACK, but the rest is exactly the same. For the sake of simplicity, I’m running this reproduction on the same instance, thus “LOOPBACK”. The logical impact is the same, though.

------------------------------------------------------
| Operation        | Name | Starts | E-Rows | A-Rows |
------------------------------------------------------
| SELECT STATEMENT |      |      1 |        |      1 |
|  REMOTE          |      |      1 |        |      1 |
|  FAST DUAL       |      |      1 |      1 |      1 |
------------------------------------------------------

Interesting. Or rather: Not too interesting. Sure, our own database knows the correct estimate: 1 row that comes out of the EXISTS() predicate. But the interesting thing happens at the remote database. Let’s look at the plan there. The query being executed on the remote database is this:

SQL_ID  80fczs4r1c9yd, child number 0
-------------------------------------

SELECT 0 FROM "T" "A2","U" "A1" WHERE "A2"."B">=0 AND "A2"."B"=0

So, the EXISTS() predicate is not propagated to the remote database. Thus, the plan:

Plan hash value: 165433672
 
--------------------------------------------------------------------------
| Operation                            | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------------
| SELECT STATEMENT                     |      |      1 |        |      1 |
|  HASH JOIN                           |      |      1 |   1000 |      1 |
|   TABLE ACCESS BY INDEX ROWID BATCHED| T    |      1 |   1000 |   1000 |
|    INDEX RANGE SCAN                  | I_T  |      1 |   1000 |   1000 |
|   INDEX FAST FULL SCAN               | PK_U |      1 |    500K|      1 |
--------------------------------------------------------------------------

Oops. Observations:

  • We’re now running a hash join (as expected, given the query that the remote database knows of)
  • We’re materialising the expected 1000 rows from the predicate on T.B
  • But we’re still not fetching all the expected 500,000 rows from the U table because the database that calls this query will abort as soon as it finds a single row

Huh. Bummer. So while we’re not running into a major catastrophe (of materialising all the rows from U), this is still far from optimal. The remote database has no knowledge at all of the fact that we’re going to be selecting 0 or 1 rows only, and that it thus should always run a SEMI JOIN.

You can try adding a /*+FIRST_ROWS(1)*/ hint, but that doesn’t work. It won’t make it to the remote database.

Arcane DUAL@LINK to the rescue

This is when I had an idea. The problem might just be the fact that Oracle always needs a FROM clause, even if it doesn’t make any sense here. So what if we use DUAL@LOOPBACK instead of DUAL. Because that DUAL table, technically, is a table on our own database, so even if it looks as though the entire query can be run on the remote database, that seems not to be true here! So let’s try this:

SELECT CASE WHEN EXISTS (
  SELECT *
  FROM t@loopback 
  JOIN u@loopback USING (a)
  WHERE t.b BETWEEN 0 AND 1000
) THEN 1 ELSE 0 END
FROM dual@loopback; -- Subtle difference here!

As I hoped, this subtle change leads to the EXISTS() predicate being sent to the remote database. The query executed on the remote database is now:

SQL_ID  9bz87xw0zc23c, child number 0
-------------------------------------
SELECT CASE  WHEN  EXISTS (SELECT 0 FROM "T" "A3","U" "A2" WHERE 
"A3"."B">=0 AND "A3"."B"<=1000 AND "A3"."A"="A2"."A") THEN 1 ELSE 0 END 
 FROM "DUAL" "A1"

And the plan, now again including the desired SEMI JOIN:

Plan hash value: 1561559448
 
--------------------------------------------------------------------------
| Operation                            | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------------
| SELECT STATEMENT                     |      |      1 |        |      1 |
|  NESTED LOOPS SEMI                   |      |      1 |      4 |      1 |
|   TABLE ACCESS BY INDEX ROWID BATCHED| T    |      1 |   1000 |      1 |
|    INDEX RANGE SCAN                  | I_T  |      1 |   1000 |      1 |
|   INDEX UNIQUE SCAN                  | PK_U |      1 |    333K|      1 |
|  FAST DUAL                           |      |      1 |      1 |      1 |
--------------------------------------------------------------------------

Excellent!

Benchmark time

Plans and estimates are one thing. What ultimately counts to business is wall clock time. So, let’s try this again using a benchmark:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 100;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT CASE WHEN EXISTS (
          SELECT *
          FROM t 
          JOIN u USING (a)
          WHERE t.b BETWEEN 0 AND 1000
        ) THEN 1 ELSE 0 END
        FROM dual
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 1 : ' 
      || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT CASE WHEN EXISTS (
          SELECT *
          FROM t@loopback 
          JOIN u@loopback USING (a)
          WHERE t.b BETWEEN 0 AND 1000
        ) THEN 1 ELSE 0 END
        FROM dual
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 2 : ' 
      || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT CASE WHEN EXISTS (
          SELECT *
          FROM t@loopback 
          JOIN u@loopback USING (a)
          WHERE t.b BETWEEN 0 AND 1000
        ) THEN 1 ELSE 0 END
        FROM dual@loopback
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 3 : ' 
      || (SYSTIMESTAMP - v_ts));
    dbms_output.put_line('');
  END LOOP;
END;
/

And here are the resulting times:

Run 1, Statement 1 : +000000000 00:00:00.008110000
Run 1, Statement 2 : +000000000 00:00:00.213404000
Run 1, Statement 3 : +000000000 00:00:00.043044000

Run 2, Statement 1 : +000000000 00:00:00.003466000
Run 2, Statement 2 : +000000000 00:00:00.198487000
Run 2, Statement 3 : +000000000 00:00:00.042717000

Run 3, Statement 1 : +000000000 00:00:00.003077000
Run 3, Statement 2 : +000000000 00:00:00.191802000
Run 3, Statement 3 : +000000000 00:00:00.048740000

Run 4, Statement 1 : +000000000 00:00:00.005008000
Run 4, Statement 2 : +000000000 00:00:00.192828000
Run 4, Statement 3 : +000000000 00:00:00.043461000

Run 5, Statement 1 : +000000000 00:00:00.002970000
Run 5, Statement 2 : +000000000 00:00:00.190786000
Run 5, Statement 3 : +000000000 00:00:00.043910000

Clearly, not using the database link is always the fastest, roughly by a factor of 10 compared to the DUAL@LOOPBACK solution. But due to the system design, we don’t have this choice. Nonetheless, you can still see that DUAL@LOOPBACK consistently outperforms DUAL by another factor of around 5 as it still prevents the HASH JOIN!

Caveat: Small data != Big data

There, I said it. “Big Data”. Before, we had a predicate that ran on 1,000 rows in a 500,000 row strong table. Our customer had millions of rows. But what happens if you query small data sets? Let’s reduce the predicate to this:

WHERE t.b BETWEEN 0 AND 10

The benchmark result is now completely different:

Run 1, Statement 1 : +000000000 00:00:00.007093000
Run 1, Statement 2 : +000000000 00:00:00.047539000
Run 1, Statement 3 : +000000000 00:00:00.071546000

Run 2, Statement 1 : +000000000 00:00:00.003023000
Run 2, Statement 2 : +000000000 00:00:00.041259000
Run 2, Statement 3 : +000000000 00:00:00.052132000

Run 3, Statement 1 : +000000000 00:00:00.002767000
Run 3, Statement 2 : +000000000 00:00:00.034190000
Run 3, Statement 3 : +000000000 00:00:00.054023000

Run 4, Statement 1 : +000000000 00:00:00.003468000
Run 4, Statement 2 : +000000000 00:00:00.026141000
Run 4, Statement 3 : +000000000 00:00:00.047415000

Run 5, Statement 1 : +000000000 00:00:00.002818000
Run 5, Statement 2 : +000000000 00:00:00.026100000
Run 5, Statement 3 : +000000000 00:00:00.046875000

And as you can see, the DUAL@LOOPBACK solution actually worsens performance for these queries. The reason for this is that we’re now running, again, a NESTED LOOP JOIN (but not SEMI JOIN) rather than a HASH JOIN on the remote database:

Query on remote database:

SQL_ID  7349t2363uc9m, child number 0
-------------------------------------
SELECT 0 FROM "T" "A2","U" "A1" WHERE "A2"."B">=0 AND "A2"."B"=0

Plan on remote database:

Plan hash value: 2558931407
 
--------------------------------------------------------------------------
| Operation                            | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------------
| SELECT STATEMENT                     |      |      1 |        |      1 |
|  NESTED LOOPS                        |      |      1 |     10 |      1 |
|   TABLE ACCESS BY INDEX ROWID BATCHED| T    |      1 |     10 |      1 |
|    INDEX RANGE SCAN                  | I_T  |      1 |     10 |      1 |
|   INDEX UNIQUE SCAN                  | PK_U |      1 |      1 |      1 |
--------------------------------------------------------------------------

I haven’t analysed what the reason for this difference is, as the difference is not significant enough, compared to the improvement for large data sets.

Conclusion

Tuning queries over database links is hard. Much much harder than tuning “ordinary” queries. Ideally, you’ll simply avoid database links and run all queries on a single instance. But sometimes that’s not possible.

In that case, the best solution is to move the logic to the remote query completely and collect only the result. Ideally, this is done using a stored procedure on the remote database and calculating this 1/0 result completely remotely. I think, hipsters these days call this a Microservice, or better, a Lambda:

CREATE FUNCTION t_u RETURN NUMBER IS
  v_result NUMBER;
BEGIN
  SELECT CASE WHEN EXISTS (
    SELECT *
    FROM t 
    JOIN u USING (a)
    WHERE t.b BETWEEN 0 AND 1000
  ) THEN 1 ELSE 0 END
  INTO v_result
  FROM dual;
  
  RETURN v_result;
END t_u;
/

Comparing the benchmark call with the other options:

T.B BETWEEN 0 AND 10

Statement 1 : +000000000 00:00:00.003022000 -- Local query
Statement 2 : +000000000 00:00:00.027416000 -- DUAL
Statement 3 : +000000000 00:00:00.043823000 -- Remote DUAL
Statement 4 : +000000000 00:00:00.022181000 -- Remote stored procedure

T.B BETWEEN 0 AND 1000

Statement 1 : +000000000 00:00:00.002877000 -- Local query
Statement 2 : +000000000 00:00:00.188588000 -- Local DUAL
Statement 3 : +000000000 00:00:00.050163000 -- Remote DUAL
Statement 4 : +000000000 00:00:00.018736000 -- Remote stored procedure

But that, too, might not be possible as you may not have the required rights to create stored procedures on that database. You could call DBMS_SQL on the remote database and run a PL/SQL block dynamically on the remote database (didn’t try that in my benchmark).

Or, you simply use an occasional DUAL@LINK, which might already do the trick with a minimal change to the original query.

Don’t Use the String Concatenation “Trick” in SQL Predicates

In SQL, quite often, we want to compare several values with each other. For instance, when we’re looking for a specific user by their first and last names, we’ll write a query like this one:

SELECT *
FROM customer
WHERE first_name = 'SUSAN'
AND last_name = 'WILSON';

We’re getting:

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          8   SUSAN        WILSON

Surely, everyone agrees that this is correct and perfectly fine as we probably have an index on these two columns (or on at least one of them) to speed up such queries:

CREATE INDEX idx_customer_name ON customer (last_name, first_name);

The execution plan is thus optimal, e.g. with Oracle:

-------------------------------------------------------------------------
| Id  | Operation                           | Name              | Rows  |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                   |       |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER          |     1 |
|*  2 |   INDEX RANGE SCAN                  | IDX_CUSTOMER_NAME |     1 |
-------------------------------------------------------------------------

But sometimes, we cannot use AND to connect two predicates. In particular, that’s not possible with an IN predicate, so people sometimes resort to using string concatenation, because that seems to work and make sense.

For instance, let’s find all customers whose first and last names matches those of an actor (as always, using the Sakila database)

SELECT *
FROM customer
WHERE first_name || last_name IN (
  SELECT first_name || last_name
  FROM actor
)

And yes indeed, what we’re getting here is the correct answer:

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          6   JENNIFER     DAVIS

But that answer is only accidentally correct!

Because we weren’t looking for customers called

first_name = 'JENNIFER' AND last_name = 'DAVIS'

We were looking for customers called

first_name || last_name = 'JENNIFERDAVIS'

Want proof? Let’s add a new customer:

INSERT INTO customer (customer_id, first_name, last_name )
VALUES               (600        , 'JENNI'   , 'FERDAVIS');

Yeah right? No one is called FERDAVIS. Or are they? As good programmers, we closely observe Murphy’s Law (i.e. always look both left and right when crossing a street).

In any case, let’s run our query again:

SELECT *
FROM customer
WHERE first_name || last_name IN (
  SELECT first_name || last_name
  FROM actor
)

And observe the result!

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          6   JENNIFER     DAVIS
        600   JENNI        FERDAVIS

Of course, because our predicate was really looking for customers called

first_name || last_name = 'JENNIFERDAVIS'

Which matches in both cases:

-- What we expected
first_name || last_name = 'JENNIFER' || 'DAVIS'

-- What we got
first_name || last_name = 'JENNI' || 'FERDAVIS'

Notice that I only added this customer to the customer table, not to the actor table. There’s no actor by the name FERDAVIS, so the result is clearly wrong.

AHA! Let’s use an “impossible” separator

So, we might proceed to fixing this as such:

SELECT *
FROM customer
WHERE first_name || '###' || last_name IN (
  SELECT first_name || '###' || last_name
  FROM actor
)

And now, the result is again correct. We get only JENNIFER DAVIS because we were looking for:

first_name || '###' || last_name = 'JENNIFER###DAVIS'

This works quite well for a while, as the separator is quite “impossible” (i.e. improbable) to be encountered in actual data. But we shouldn’t trust our judgement, because… Murphy’s Law. So you might think: better use a more rare separator, e.g. (if your database supports proper character sets)

SELECT *
FROM customer
WHERE first_name || '🙈🙉🙊' || last_name IN (
  SELECT first_name || '🙈🙉🙊' || last_name
  FROM actor
)

The use of emojis should indicate what my opinion of this approach is.

Too bad for performance, though

Remember that index we’ve created? Fact is, we also have such an index on the ACTOR table:

CREATE INDEX idx_actor_name ON actor (last_name, first_name);

And now, let’s assume our query is a bit different. We’ll be looking only for customers whose address_id is 10:

SELECT *
FROM customer
WHERE address_id = 10
AND first_name || '🙈🙉🙊' || last_name IN (
  SELECT first_name || '🙈🙉🙊' || last_name
  FROM actor
)

Now, our querymoji is using the index indeed, but for an INDEX FULL SCAN, so it’s only slightly faster than scanning the entire actor table:

-----------------------------------------------------------------------------------
| Id  | Operation                            | Name                       | Rows  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                            |       |
|*  1 |  HASH JOIN SEMI                      |                            |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER                   |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_CUSTOMER_FK_ADDRESS_ID |     1 |
|   4 |   INDEX FULL SCAN                    | IDX_ACTOR_NAME             |     2 |
-----------------------------------------------------------------------------------

And what’s worse, even if all the cardinality estimates correctly indicate only 1-2 rows, we’ll perform a HASH JOIN and load the full index for it! We should be running a NESTED LOOP instead.

Is there a better way? Yes! Use row constructors to compare several values at once:

SELECT *
FROM customer
WHERE address_id = 10
AND (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM actor
);

Or, if your database doesn’t support this syntax (luckily, Oracle and PostgreSQL do, for instance), then you can resort to an equivalent EXISTS predicate

SELECT *
FROM customer c
WHERE address_id = 10
AND EXISTS (
  SELECT 1
  FROM actor a
  WHERE c.first_name = a.first_name
  AND c.last_name = a.last_name
);

Both of these queries are exactly equivalent and result in a nested loop semi join, rather than the previous hash join, which is perfectly reasonable for these small tables. We can now use the IDX_ACTOR_NAME for a quick INDEX RANGE SCAN operation:

-----------------------------------------------------------------------------------
| Id  | Operation                            | Name                       | Rows  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                            |       |
|   1 |  NESTED LOOPS SEMI                   |                            |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER                   |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_CUSTOMER_FK_ADDRESS_ID |     1 |
|*  4 |   INDEX RANGE SCAN                   | IDX_ACTOR_NAME             |     1 |
-----------------------------------------------------------------------------------

But let’s not trust the estimated plans. Let’s benchmark (more info about benchmarking SQL here)

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2500;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name
        FROM customer
        WHERE address_id = 10
        AND first_name || '###' || last_name IN (
          SELECT first_name || '###' || last_name
          FROM actor
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 1 : ' 
      || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name
        FROM customer
        WHERE address_id = 10
        AND (first_name, last_name) IN (
          SELECT first_name, last_name
          FROM actor
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 2 : '
      || (SYSTIMESTAMP - v_ts));
  END LOOP;
END;
/

As can be seen here, the benchmark shows that the query using the row constructor is drastically faster as it can properly use the index as it should:

Run 1, Statement 1 : +000000000 00:00:00.374471000
Run 1, Statement 2 : +000000000 00:00:00.062830000
Run 2, Statement 1 : +000000000 00:00:00.364168000
Run 2, Statement 2 : +000000000 00:00:00.066252000
Run 3, Statement 1 : +000000000 00:00:00.359559000
Run 3, Statement 2 : +000000000 00:00:00.063898000
Run 4, Statement 1 : +000000000 00:00:00.344775000
Run 4, Statement 2 : +000000000 00:00:00.086060000
Run 5, Statement 1 : +000000000 00:00:00.394163000
Run 5, Statement 2 : +000000000 00:00:00.063176000

Now, imagine we were running this against some much more impressive data sets than the Sakila database

Conclusion

If you’re ever thinking about concatenating two fields for a comparison, try again. There are two major caveats that should indicate you’re about to do something silly:

  • There’s a major risk of your query being subtly wrong (accidental matches between JENNIFER DAVIS and JENNI FERDAVIS)
  • There’s a major risk of your query being quite slow

So, as a rule of thumb, don’t use concatenation in predicates. There’s (almost) always a better way.

Read also: Why You Should (Sometimes) Avoid Expressions in SQL Predicates