How to Prevent Execution Plan Troubles when Querying Skewed Data, with jOOQ

One of the biggest advantages of using jOOQ is that you can change all of your complex application’s generated SQL with just a few lines of code. In this article, we’ll look into how to solve some common bind peeking issues just like that, without touching your application code, without the need to explain this advanced SQL performance issue to every team member, and best of all: for good.

What are Bind Values Good For?

First of all, bind variables are a good thing. They:

The latter bullet doesn’t affect all dialects, just like this article doesn’t. Commercial DBMS like Oracle, SQL Server, etc. ship with a powerful execution plan cache. They are designed to run thousands of distinct queries with very complex plans. Planning these queries takes time (I’ve seen Oracle SQL being planned for 5 seconds!) and you don’t want the DBMS to re-execute this planning work every time the query is run, which may be thousands of times per second.

Usually, this execution plan cache takes the SQL string (or a hash of it), and associates meta data like alternative possible execution plans to it. When the SQL string changes, the cache lookup fails and the “new” query has to be planned again.

I say “new”, because it might be the “same” query to the user, even if the string is different. For example:

SELECT * FROM book WHERE id = 1;
SELECT * FROM book WHERE id = 2;

Now we have two times the “same” query, but each of them are “new”. Oracle will plan them both afresh. So, we use bind variables, instead:

SELECT * FROM book WHERE id = ?;
SELECT * FROM book WHERE id = ?;

What is the Bind Peeking Issue?

In some cases, the cached plan is not optimal. This is the case when the actual bind value is significant to the planning, e.g. the value 1 would produce a vastly different plan than the value 2, or much more likely the value DELETED produces a different plan from PROCESSED or NEW.

This problem has been discussed in our previous blog post Why You Should Design Your Database to Optimise for Statistics.

“Bind Peeking” is a technique by the Oracle database (it is done by others, too, but maybe not called “Bind Peeking”) to have a “peek” at the bind variable to get a more accurate plan than the average plan if we don’t know the bind value. This can go both ways, good or bad, so there were numerous fixes / patches / workarounds in previous Oracle versions. Some interesting articles on the topic:

Databases slowly get to truly adaptive query execution models where execution plans can be fixed in-flight when the estimates were clearly wrong. Db2 is quite strong at this, and Oracle is getting better.

But even then, some times planners get it wrong, simply because they can’t reasonably estimate the cardinalities produced by a simple predicate like

WHERE x = ?

… just because the overall query is very complex and some SQL transformation did not apply.

Preventing the Problem by Avoiding Bind Values

Again. Please use bind values by default. They’re a good thing by default. Not all data is as skewed as the one I presented in my other blog post. But some data is almost always skewed: Enumeration types.

When you have an enum like:

enum ProcessingState {
  NEW,
  PROCESSING,
  EXECUTED,
  DELETED
}

Or in PostgreSQL:

CREATE TYPE processing_state AS ENUM (
  'new',
  'processing',
  'executed',
  'deleted'
);

Or even just encoded as a CHECK constraint:

CREATE TABLE transaction (
  -- ...
  processing_state VARCHAR(10) CHECK (processing_state IN (
    'new',
    'processing',
    'executed',
    'deleted'
  ))
  -- ...
);

In that case, you will very likely have highly skewed data. For example, a quick query might yield:

SELECT processing_state, count(*)
FROM transaction
GROUP BY processing_state

Resulting in:

+------------------+----------+
| processing_state |    count |
+------------------+----------+
| new              |    10234 |
| processing       |       15 |
| executed         | 17581684 |
| deleted          |    83193 |
+------------------+----------+

Now, do you think you’ll profit from indexing PROCESSING_STATE equally, when looking for NEW or PROCESSING versus when looking for EXECUTED values? Do you want the same plans? Do you want an average plan, which might not use the index, when in fact you should use it (looking for PROCESSING)?

Not only that, but it’s also quite unlikely that your queries are so generic as that the individual PROCESSING_STATE values can be used interchangeably. For example, a query looking for DELETED states might be run by a housekeeping batch job that wants to remove the logically deleted transactions for good. It will never query anything other than DELETED states. So, might as well inline, right?

Now, if you write a query like this:

SELECT *
FROM transaction
WHERE processing_state = 'processing';

With jOOQ, you can create an “inline” on a per-query basis using DSL.inline("processing") (as opposed to DSL.val("processing"), which is used by default, implicitly:

// These generate a ? bind value
ctx.selectFrom(TRANSACTION)
   .where(TRANSACTION.PROCESSING_STATE.eq("processing"))
   .fetch();

ctx.selectFrom(TRANSACTION)
   .where(TRANSACTION.PROCESSING_STATE.eq(val("processing")))
   .fetch();

// This creates an inline 'processing' literal
ctx.selectFrom(TRANSACTION)
   .where(TRANSACTION.PROCESSING_STATE.eq(inline("processing")))
   .fetch();

As always, assuming this static import:

import static org.jooq.impl.DSL.*;

But now, you have to think of doing that every single time you query PROCESSING_STATE, and all the other similar columns.

Prevent it for Good

Much better, prevent it for good. You can create a very simple Binding like this:

class AlwaysInlineStringBinding implements Binding<String, String> {

    final Binding<?, String> delegate = VARCHAR.getBinding();

    @Override
    public Converter<String, String> converter() {
        return Converters.identity(String.class);
    }

    @Override
    public void sql(BindingSQLContext<String> ctx) 
    throws SQLException {
        ctx.render().visit(inline(ctx.value()));
    }

    @Override
    public void register(BindingRegisterContext<String> ctx) 
    throws SQLException {
        delegate.register(ctx);
    }

    // No need to set anything
    @Override
    public void set(BindingSetStatementContext<String> ctx) 
    throws SQLException {}

    @Override
    public void set(BindingSetSQLOutputContext<String> ctx) 
    throws SQLException {
        delegate.set(ctx);
    }

    @Override
    public void get(BindingGetResultSetContext<String> ctx) 
    throws SQLException {
        delegate.get(ctx);
    }

    @Override
    public void get(BindingGetStatementContext<String> ctx) 
    throws SQLException {
        delegate.get(ctx);
    }

    @Override
    public void get(BindingGetSQLInputContext<String> ctx) 
    throws SQLException {
        delegate.get(ctx);
    }
}

Or, starting from jOOQ 3.15, even simpler, and generic:

class AlwaysInlineStringBinding 
extends DefaultBinding<String, String> {
    public AlwaysInlineStringBinding() {
        super(DefaultBinding.binding(VARCHAR));
    }

    @Override
    public void sql(BindingSQLContext<String> ctx) 
    throws SQLException {
        ctx.render().visit(inline(ctx.value()));
    }

    // No need to set anything
    @Override
    public void set(BindingSetStatementContext<T> ctx) 
    throws SQLException {}
}

Or even generic:

class AlwaysInlineBinding<T> extends DefaultBinding<T, T> {
    public AlwaysInlineBinding(DataType<T> type) {
        super(DefaultBinding.binding(type));
    }

    @Override
    public void sql(BindingSQLContext<T> ctx) 
    throws SQLException {
        ctx.render().visit(inline(ctx.value()));
    }

    // No need to set anything
    @Override
    public void set(BindingSetStatementContext<T> ctx) 
    throws SQLException {}
}

All this does is generate inline values instead of the ? bind parameter marker, and skip setting any value to the JDBC PreparedStatement (or reactive R2DBC Statement, starting from jOOQ 3.15. This will work just the same!)

Try it very easily, yourself (using the jOOQ 3.15 version):

@Test
public void testAlwaysInlineBinding() {
    DSLContext ctx = DSL.using(DEFAULT);
    DataType<Integer> t = INTEGER.asConvertedDataType(
        new AlwaysInlineBinding<>(INTEGER));

    Field<Integer> i = field("i", INTEGER);
    Field<Integer> j = field("j", t);
    Param<Integer> a = val(1);
    Param<Integer> b = val(1, INTEGER.asConvertedDataType(
        new AlwaysInlineBinding<>(INTEGER)));

    // Bind value by default
    assertEquals("?", ctx.render(a));
    assertEquals("1", ctx.renderInlined(a));
    assertEquals("1", ctx.render(b));
    assertEquals("1", ctx.renderInlined(b));

    // Bind value by default in predicates
    assertEquals("i = ?", ctx.render(i.eq(a)));
    assertEquals("i = 1", ctx.renderInlined(i.eq(a)));
    assertEquals("i = 1", ctx.render(i.eq(b)));
    assertEquals("i = 1", ctx.renderInlined(i.eq(b)));
    assertEquals("i = ?", ctx.render(i.eq(1)));
    assertEquals("i = 1", ctx.renderInlined(i.eq(1)));

    // No more bind values in predicates!
    assertEquals("j = 1", ctx.render(j.eq(a)));
    assertEquals("j = 1", ctx.renderInlined(j.eq(a)));
    assertEquals("j = 1", ctx.render(j.eq(b)));
    assertEquals("j = 1", ctx.renderInlined(j.eq(b)));
    assertEquals("j = 1", ctx.render(j.eq(1)));
    assertEquals("j = 1", ctx.renderInlined(j.eq(1)));
}

Of course, instead of doing the above programmatically, you’ll attach this Binding to all relevant columns using the code generator’s forced type configuration

Conclusion

Please use bind values by default. Both in jOOQ and elsewhere. It’s a very good default.

But sometimes, your data is skewed, and you as a developer, you probably know that. In those cases, sometimes, “inline values” as we call them (or constants, literals, etc.) may be the better choice to help the optimiser get better at estimating. Even if the optimiser’s estimates are good the first time, the plan may switch in production for some weird reason, including some plans being purged because the cache is full, or the DBA clicked a button, or whatever.

And that’s when your query might be unnecessarily slow all of a sudden. No more need. When you have enum types, or similar, just use the above simple trick, apply to all of your schema where it makes sense, and forget this problem for good.

Side note:

Of course, the other way round is just as easy. When you have inline literals that you want to switch over to bind values, you can do it just the same way, e.g. when you use jOOQ’s parsing connection to translate between dialects, or to patch your wrong ORM-generated SQL!

PostgreSQL’s Best-Kept Secret, and how to Use it with jOOQ

PostgreSQL has a lot of secret data types. In recent times, PostgreSQL’s JSON and JSONB support was hyped as being the NoSQL on SQL secret (e.g. as advertised by ToroDB) that allows you to get the best out of both worlds. But there are many other useful data types, among which the range type.

How does the range type work?

Ranges are very useful for things like age, date, price intervals etc. Let’s assume the following table:

CREATE TABLE age_categories (
  name VARCHAR(50),
  ages INT4RANGE
);

We can now fill the above table as such:

INSERT INTO age_categories
VALUES ('Foetus',   int4range(-1, 0)),
       ('Newborn',  int4range(0, 1)),
       ('Infant',   int4range(1, 2)),
       ('Kid',      int4range(2, 12)),
       ('Teenager', int4range(12, 20)),
       ('Tween',    int4range(20, 30)),
       ('Adult',    int4range(30, 60)),
       ('Senior',   int4range(60, 90)),
       ('Ancient',  int4range(90, 999));

And query it, e.g. by taking my age:

SELECT name
FROM age_categories
WHERE range_contains_elem(ages, 33);

… yielding

name
-----
Adult

There are a lot of other useful functions involved with range calculations. For instance, we can get the age span of a set of categories. Let’s assume we want to have the age span of Kid, Teenager, Senior. Let’s query:

SELECT int4range(min(lower(ages)), max(upper(ages)))
FROM age_categories
WHERE name IN ('Kid', 'Teenager', 'Senior');

… yielding

int4range
---------
[2,90]

And now, let’s go back to fetching all the categories that are within that range:

SELECT name
FROM age_categories
WHERE range_overlaps(ages, (
  SELECT int4range(min(lower(ages)), max(upper(ages)))
  FROM age_categories
  WHERE name IN ('Kid', 'Teenager', 'Senior')
));

… yielding

name
--------
Kid
Teenager
Tween
Adult
Senior

All of this could have been implemented with values contained in two separate columns, but using ranges is just much more expressive for range arithmetic.

How to use these types with jOOQ?

jOOQ doesn’t include built-in support for these advanced data types, but it allows you to bind these data types to your own, custom representation.

A good representation of PostgreSQL’s range types in Java would be jOOλ’s org.jooq.lambda.tuple.Range type, but you could also simply use int[] or Map.Entry for ranges. When we’re using jOOλ’s Range type, the idea is to be able to run the following statements using jOOQ:

// Assuming this static import:
import static org.jooq.lambda.tuple.Tuple.*;

DSL.using(configuration)
   .insertInto(AGE_CATEGORIES)
   .columns(AGE_CATEGORIES.NAME, AGE_CATEGORIES.AGES)
   .values("Foetus",   range(-1, 0))
   .values("Newborn",  range(0, 1))
   .values("Infant",   range(1, 2))
   .values("Kid",      range(2, 12))
   .values("Teenager", range(12, 20))
   .values("Tween",    range(20, 30))
   .values("Adult",    range(30, 60))
   .values("Senior",   range(60, 90))
   .values("Ancient",  range(90, 999))
   .execute();

And querying…

DSL.using(configuration)
   .select(AGE_CATEGORIES.NAME)
   .from(AGE_CATEGORIES)
   .where(rangeContainsElem(AGE_CATEGORIES.AGES, 33))
   .fetch();

DSL.using(configuration)
   .select(AGE_CATEGORIES.NAME)
   .where(rangeOverlaps(AGE_CATEGORIES.AGES,
      select(int4range(min(lower(AGE_CATEGORIES.AGES)),
                       max(upper(AGE_CATEGORIES.AGES))))
     .from(AGE_CATEGORIES)
     .where(AGE_CATEGORIES.NAME.in(
       "Kid", "Teenager", "Senior"
     ))
   ))
   .fetch();

As always, the idea with jOOQ is that the SQL you want to get as output is the SQL you want to write in Java, type safely. In order to be able to write the above, we will need to implement 1-2 missing pieces. First off, we’ll need to create a data type binding (org.jooq.Binding) as described in this section of the manual. The binding can be written as such, using the following Converter:

public class Int4RangeConverter 
implements Converter<Object, Range<Integer>> {
    private static final Pattern PATTERN = 
        Pattern.compile("\\[(.*?),(.*?)\\)");

    @Override
    public Range<Integer> from(Object t) {
        if (t == null)
            return null;

        Matcher m = PATTERN.matcher("" + t);
        if (m.find())
            return Tuple.range(
                Integer.valueOf(m.group(1)), 
                Integer.valueOf(m.group(2)));

        throw new IllegalArgumentException(
            "Unsupported range : " + t);
    }

    @Override
    public Object to(Range<Integer> u) {
        return u == null 
            ? null 
            : "[" + u.v1 + "," + u.v2 + ")";
    }

    @Override
    public Class<Object> fromType() {
        return Object.class;
    }

    @SuppressWarnings({ "unchecked", "rawtypes" })
    @Override
    public Class<Range<Integer>> toType() {
        return (Class) Range.class;
    }
}

… and the Converter can then be re-used in a Binding:

public class PostgresInt4RangeBinding 
implements Binding<Object, Range<Integer>> {

    @Override
    public Converter<Object, Range<Integer>> converter() {
        return new Int4RangeConverter();
    }

    @Override
    public void sql(BindingSQLContext<Range<Integer>> ctx) throws SQLException {
        ctx.render()
           .visit(DSL.val(ctx.convert(converter()).value()))
           .sql("::int4range");
    }

    // ...
}

The important thing in the binding is just that every bind variable needs to be encoded in PostgreSQL’s string format for range types (i.e. [lower, upper)) and cast explicitly to ?::int4range, that’s all.

You can then configure your code generator to use those types, e.g. on all columns that are called [xxx]_RANGE

<customType>
    <name>com.example.PostgresInt4RangeBinding</name>
    <type>org.jooq.lambda.tuple.Range&lt;Integer></type>
    <binding>com.example.PostgresInt4RangeBinding</binding>
</customType>
<forcedType>
    <name>com.example.PostgresInt4RangeBinding</name>
    <expression>.*?_RANGE</expression>
</forcedType>

The last missing thing now are the functions that you need to compare ranges, i.e.:

  • rangeContainsElem()
  • rangeOverlaps()
  • int4range()
  • lower()
  • upper()

These are written quickly:

static <T extends Comparable<T>> Condition 
    rangeContainsElem(Field<Range<T>> f1, T e) {
    return DSL.condition("range_contains_elem({0}, {1})", f1, val(e));
}

static <T extends Comparable<T>> Condition 
    rangeOverlaps(Field<Range<T>> f1, Range<T> f2) {
    return DSL.condition("range_overlaps({0}, {1})", f1, val(f2, f1.getDataType()));
}

Conclusion

Writing an extension for jOOQ data types takes a bit of time and effort, but it is really easy to achieve and will allow you to write very powerful queries in a type safe way. Once you’ve set up all the data types and the bindings, your generated source code will reflect the actual data type from your database and you will be able to write powerful queries in a type safe way directly in Java. Let’s consider again the plain SQL and the jOOQ version:

SQL

SELECT name
FROM age_categories
WHERE range_overlaps(ages, (
  SELECT int4range(min(lower(ages)), max(upper(ages)))
  FROM age_categories
  WHERE name IN ('Kid', 'Teenager', 'Senior')
));

jOOQ

DSL.using(configuration)
   .select(AGE_CATEGORIES.NAME)
   .where(rangeOverlaps(AGE_CATEGORIES.AGES,
      select(int4range(min(lower(AGE_CATEGORIES.AGES)),
                       max(upper(AGE_CATEGORIES.AGES))))
     .from(AGE_CATEGORIES)
     .where(AGE_CATEGORIES.NAME.in(
       "Kid", "Teenager", "Senior"
     ))
   ))
   .fetch();

More information about data type bindings can be found in the jOOQ manual.