Squeezing Another 10% Speed Increase out of jOOQ using JMC and JMH

In this post, we’re going to discuss a couple of recent efforts to squeeze roughly 10% in terms of speed out of jOOQ by iterating on hotspots that were detected using JMC (Java Mission Control) and then validated using JMH (Java Microbenchmark Harness). This post shows how to apply micro optimisations to algorithms where the smallest improvement can have a significant effect.

While JMH is probably without competition, JMC could easily be replaced by JProfiler, YourKit, or even your own manual jstack sampling. I’ll just use JMC because it ships with the JDK and is free for use for development as of JDK 8 and 9 (if you’re unsure whether you’re “developing”, better ask Oracle). Rumours have it that JMC might be contributed to the OpenJDK in the near future.

Micro optimisations

Micro optimisations are a cool technique to squeeze a very small improvement out of a local algorithm (e.g. a loop) that has a significant effect on the entire application / library, because of the fact that the local algorithm is called many times. This is absolutely the case in jOOQ, which is essentially a library that always runs 4 nested loops:

  1. S: A “loop” over all possible SQL statements
  2. E: A “loop” over all executions of such a statement
  3. R: A loop over all rows in the result
  4. C: A loop over all columns in a row

Such four level nested loops result in what we could call a polynomial complexity of our algorithms, even if we cannot call the complexity O(N4) (as the 4 “N” are not all the same), it is certainly of O(S x E x R x C) (I’ll call this “S-E-R-C loops” further down). Even to the untrained eye, it becomes evident that anything that happens in the inner-most “C-loop” can have devastating effects. We better not be opening any files here, that could be opened outside of, e.g. the “S-loop”

In a previous blog post, we’ve discussed common techniques of optimising such situations. In this blog post, we’ll look into a couple of concrete examples.

How to discover flaws in these loops?

We’re looking for the problems that affect all users, the kind of problem that, once fixed, will improve jOOQ’s performance for everyone by e.g. 10%. This is similar to what the JIT does, by performing things like stack allocation, inlining, which don’t drastically improve things locally, but do so globally, and for everyone. Here’s an interesting guest post by Tagir Valeev on JIT optimisation, and how good it is.

Getting a large “S-loop”

The first option is to run profiling sessions on benchmarks. We could, for example, run the entire “S-E-R-C loops” in a JMC profiling session, where the “S-loop” is a loop over all our statements, or in other words, over all our integration tests. Unfortunately, with this approach, our “E-loop” (in the case of jOOQ’s integration tests) is a single execution per statement. We’d have to run the integration tests many, many times in order to get meaningful results.

Also, while the jOOQ integration tests run thousands of distinct queries, most queries are still rather simple, each one focusing on an individual SQL feature (e.g. lateral join). In a end user application, queries might use less specific features, but are much more complex, i.e. they have a lot of ordinary joins.

This technique is useful to find problems that appear in all queries, deep down inside of jOOQ – e.g. at the JDBC interface. But we cannot use this approach to test individual features.

Getting a large “E-loop”

Another option is to write a single test that runs a few statements (small “S-loop”) many times in an explicit loop (large “E-loop”). This has the advantage that a specific bottleneck can be found with a high confidence, but the drawback is: It’s specific. For instance, if we find a small bottleneck in the string concatenation function, well, that is certainly worth fixing, but doesn’t affect most users.

This approach is useful to test individual features. It can also be useful for finding issues that affect all queries, but with a lower confidence than the previous case, where the “S-loop” is maximised.

Getting large “R-loops” and “C-loops”

Creating large result sets is easy and should definitely be part of such benchmarks, because in the case of a large result set, any flaw will multiply drastically, so fixing these things is worthwhile. However, these problems only affect actual result sets, not the query building process or the execution process. Sure, most statements are probably queries, not insertions / updates, etc. But this needs to be kept in mind.

Optimising for problems in large “E-loops”

All of the above scenarios are different optimisation sessions and deserve their own blog posts. In this post, I’m describing what has been discovered and fixed when running a single query 3 million times on an H2 database. The H2 database is chosen here, because it can run in memory of the same process and thus has the least extra overhead compared to jOOQ – so jOOQ’s overhead contributions become significant in a profiling session / benchmark. In fact, it can be shown that in such a benchmark, jOOQ (or Hibernate, etc.) appears to perform quite poorly compared to a JDBC only solution, as many have done before.

This is an important moment to remind ourselves:

Benchmarks do not reflect real-world use cases! You will never run the exact same query 3 million times on a production system, and your production system doesn’t run on H2.

A benchmark profits from so much caching, buffering, you would never perform as fast as in a benchmark.

Always be careful not to draw any wrong conclusions from a benchmark!

This needs to be said, so take every benchmark you find on the web with a grain of salt. This includes our own!

The query being profiled is:

ctx.select(
      AUTHOR.FIRST_NAME,
      AUTHOR.LAST_NAME,
      BOOK.ID,
      BOOK.TITLE)
   .from(BOOK)
   .join(AUTHOR).on(BOOK.AUTHOR_ID.eq(AUTHOR.ID))
   .where(BOOK.ID.eq(1))
   .and(BOOK.TITLE.isNull().or(BOOK.TITLE.ne(randomValue)));

The trivial query returns a ridiculous 4 rows and 4 columns, so the “R-loop” and “C-loops” are negligible. This benchmark is really testing the overhead of jOOQ query execution in a case where the database does not contribute much to the execution time. Again, in a real world scenario, you will get much more overhead from your database.

In the following sections, I’ll show a few minor bottlenecks that could be found when drilling down into these such execution scenarios. As I’ve switched between JMC versions, the screenshots will not always be the same, I’m afraid.

1. Instance allocation of constant values

A very silly mistake was easily discovered right away:

The mistake didn’t contribute a whole lot of overhead, only 1.1% to the sampled time spent, but it made me curious. In version 3.10 of jOOQ, the SelectQueryImpl‘s Limit class, which encodes the jOOQ OFFSET / LIMIT behaviour kept allocating this DSL.val() thingy, which is a bind variable. Sure, limits do work with bind variables, but this happened when SelectQueryImpl was initialised, not when the LIMIT clause is added by the jOOQ API user.

As can be seen in the sources, the following logic was there:

private static final Field<Integer> ZERO              = zero();
private static final Field<Integer> ONE               = one();
private Field<Integer>              numberOfRowsOrMax = 
    DSL.inline(Integer.MAX_VALUE);

While the “special limits” ZERO and ONE were static members, the numberOfRowsOrMax value wasn’t. That’s the instantiation we were measuring in JMC. The member is not a constant, but the default value is. It is always initialised with Integer.MAX_VALUE wrapped in an DSL.inline() call. The solution is really simple:

private static final Param<Integer> MAX               = 
    DSL.inline(Integer.MAX_VALUE);
private Field<Integer>              numberOfRowsOrMax = MAX;

This is obviously better! Not only does it avoid the allocation of the bind variable, it also avoids the boxing of Integer.MAX_VALUE (which can also be seen in the sampling screenshot).

Note, a similar optimisation is available in the JDK’s ArrayList. When you look at the sources, you’ll see:

/**
 * Shared empty array instance used for empty instances.
 */
private static final Object[] EMPTY_ELEMENTDATA = {};

When you initialise an ArrayList without initial capacity, it will reference this shared instance, instead of creating a new, empty (or even non-empty) array. This delays the allocation of such an array until we actually add things to the ArrayList, just in case it stays empty.

jOOQ’s LIMIT is the same. Most queries might not have a LIMIT, so better not allocate that MAX_VALUE afresh!

This is done once per “E-loop” iteration

One issue down: https://github.com/jOOQ/jOOQ/issues/6635

2. Copying lists in internals

This is really a micro optimisation that you probably shouldn’t do in ordinary business logic. But it might be worthwhile in infrastructure logic, e.g. when you’re also in an “S-E-R-C loop”:

jOOQ (unfortunately) occasionally copies data around between arrays, e.g. wrapping Strings in jOOQ wrapper types, transforming numbers to strings, etc. These loops aren’t bad per se, but remember, we’re inside some level of the “S-E-R-C loop”, so these copying operations might be run hundreds of millions of times when we run a statement 3 million times.

The above loop didn’t contribute a lot of overhead, and possible the cloned object was stack allocated or the clone call eliminated by the JIT. But maybe it wasn’t. The QualifiedName class cloned its argument prior to returning it to make sure that no accidental modifications will have any side effect:

private static final String[] nonEmpty(String[] qualifiedName) {
    String[] result;
    ...
    if (nulls > 0) {
        result = new String[qualifiedName.length - nulls];
        ...
    }
    else {
        result = qualifiedName.clone();
    }
    return result;
}

So, the implementation of the method guaranteed a new array as a result.

After a bit of analysis, it could be seen that there is only a single consumer of this method, and it doesn’t leave that consumer. So, it’s safe to remove the clone call. Probably, the utility was refactored from a more general purpose method into this local usage.

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6640

3. Running checks in loops

This one is too silly to be true:

There’s a costly overhead in the CombinedCondition constructor (<init> method). Notice, how the samples drop from 0.47% to 0.32% between the constructor and the next method init(), that’s the time spent inside the constructor.

A tiny amount of time, but this time is spent every time someone combines two conditions / predicates with AND and OR. Every time. We can probably save this time. The problem is this:

CombinedCondition(Operator operator, Collection<? extends Condition> conditions) {
    ...
    for (Condition condition : conditions)
        if (condition == null)
            throw new IllegalArgumentException("The argument 'conditions' must not contain null");

    ...
    init(operator, conditions);
}

There’s a loop over the arguments to give some meaningful error messages. That’s a bit too defensive, I suspect. How about we simply live with the NPE when it arises, as this should be rather unexpected (for the context, jOOQ hardly ever checks on parameters like this, so this should also be removed for consistency reasons).

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6666 (nice number)

4. Lazy initialisation of lists

The nature of the JDBC API forces us to work with ThreadLocal variables, very unfortunately, as it is not possible to pass arguments from parent SQLData objects to children, especially when we combine nesting of Oracle TABLE/VARRAY and OBJECT types.

In this analysis, we’re combining the profiler’s CPU sampling with its memory sampling:

In the CPU sampling view above, we can see some overhead in the DefaultExecuteContext, which is instantiated once per “E-loop” iteration. Again, not a huge overhead, but let’s look at what this constructor does. It contributes to the overall allocations of ArrayList:

When we select the type in JMC, the other view will then display all the stack traces where ArrayList instances were allocated, among which, again, our dear DefaultExecuteContext constructor:

Where are those ArrayLists allocated? Right here:

BLOBS.set(new ArrayList<Blob>());
CLOBS.set(new ArrayList<Clob>());
SQLXMLS.set(new ArrayList<SQLXML>());
ARRAYS.set(new ArrayList<Array>());

Every time we start executing a query, we initialise a list for each ones of these types. All of our variable binding logic will then register any possibly allocated BLOB or CLOB, etc. such that we can clean these up at the end of the execution (a JDBC 4.0 feature that not everyone knows of!):

static final void register(Blob blob) {
    BLOBS.get().add(blob);
}
    
static final void clean() {
    List<Blob> blobs = BLOBS.get();

    if (blobs != null) {
        for (Blob blob : blobs)
            JDBCUtils.safeFree(blob);

        BLOBS.remove();
    }
    ...
}

Don’t forget calling Blob.free() et al, if you’re working with JDBC directly!

But the truth is, in most cases, we don’t really need these things. We need them only in Oracle, and only if we’re using TABLE / VARRAY or OBJECT types, due to some JDBC restrictions. Why punish all the users of other databases with this overhead? Instead of a sophisticated refactoring, which risks introducing regressions (https://github.com/jOOQ/jOOQ/issues/4205), we can simply initialise these lists lazily. We leave the clean() method as it is, remove the initialisation in the constructor, and replace the register() logic by this:

static final void register(Blob blob) {
    List<Blob> list = BLOBS.get();

    if (list == null) {
        list = new ArrayList<Blob>();
        BLOBS.set(list);
    }

    list.add(blob);
}

That was easy. And significant. Check out the new allocation measurements:

Note that every allocation, apart from the overhead of allocating things, also incurs additional overhead when the object is garbage collected. That’s a bit trickier to measure and correlate. In general, less allocations is almost always a good thing, except if the allocation is super short lived, in case of which stack allocation can happen, or the logic can even be eliminated by the JIT.

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6669

6. Using String.replace()

This is mostly a problem in JDK 8 only, JDK 9 fixed string replacing by no longer relying on regular expressions internally. In JDK 8, however (and jOOQ still supports Java 6, so this is relevant), string replacement works through regular expressions as can be seen here:

The Pattern implementation allocates quite a few int[] instances, even if that’s probably not strictly needed for non-regex patterns as those of String.replace():

I’ve already analysed this in a previous blog post, which can be seen here:

https://blog.jooq.org/2017/10/11/benchmarking-jdk-string-replace-vs-apache-commons-stringutils-replace/

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6672

7. Registering an SPI that is going to be inactive

This one was a bit more tricky to solve as it relies on a deeper analysis. Unfortunately, I have no profiling screenshots available anymore, but it is easy to explain with code. There’s an internal ExecuteListeners utility, which abstracts over the ExecuteListener SPIs. Users can register such a listener and listen to query rendering, variable binding, query execution, and other lifecycle events. By default, there is no such ExecuteListener by the users, but there’s always one internal ExecuteListener:

private static ExecuteListener[] listeners(ExecuteContext ctx) {
    List<ExecuteListener> result = new ArrayList<ExecuteListener>();

    for (ExecuteListenerProvider provider : ctx.configuration()
                                               .executeListenerProviders())
        if (provider != null)
            result.add(provider.provide());

    if (!FALSE.equals(ctx.settings().isExecuteLogging()))
        result.add(new LoggerListener());

    return result.toArray(EMPTY_EXECUTE_LISTENER);
}

The LoggerListener is added by default, unless users turn off that feature. Which means:

  • We’ll pretty much always get this ArrayList
  • We’ll pretty much always loop over this list
  • We’ll pretty much always clal this LoggerListener

But what does it do? It logs stuff on DEBUG and TRACE level. For instance:

@Override
public void executeEnd(ExecuteContext ctx) {
    if (ctx.rows() >= 0)
        if (log.isDebugEnabled())
            log.debug("Affected row(s)", ctx.rows());
}

That’s what it does by definition. It’s a debug logger. So, the improved logic for initialising this thing is the following:

private static final ExecuteListener[] listeners(ExecuteContext ctx) {
    List<ExecuteListener> result = null;

    for (ExecuteListenerProvider provider : ctx.configuration()
                                               .executeListenerProviders())
        if (provider != null)
            (result = init(result)).add(provider.provide());

    if (!FALSE.equals(ctx.settings().isExecuteLogging())) {
        if (LOGGER_LISTENER_LOGGER.isDebugEnabled())
            (result = init(result)).add(new LoggerListener());
    }

    return result == null ? null : result.toArray(EMPTY_EXECUTE_LISTENER);
}

We’re no longer allocating the ArrayList (that might be premature, the JIT might have rewritten this allocation to not happen, but OK), and we’re only adding the LoggerListener if it DEBUG or TRACE logging is enabled for it, i.e. if it would do any work at all.

That’s just a couple of CPU cycles we can save on every execution. Again, I don’t have the profiling measurements anymore, but trust me. It helped.

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6747

8. Eager allocation where lazy allocation works

Sometimes, we need two different representations of the same information. The “raw” representation, and a more useful, pre-processed representation for some purposes. This was done, for instance, in QualifiedField:

private final Name          name;
private final Table<Record> table;

QualifiedField(Name name, DataType<T> type) {
    super(name, type);

    this.name = name;
    this.table = name.qualified()
        ? DSL.table(name.qualifier())
        : null;
}

@Override
public final void accept(Context<?> ctx) {
    ctx.visit(name);
}

@Override
public final Table<Record> getTable() {
    return table;
}

As can be seen, the name is really the beef of this class. It’s a qualified name that generates itself on the SQL string. The Table representation is useful when navigating the meta model, but this is hardly ever done by jOOQ’s internals and/or user facing code.

However, this eager initialisation it is costly:

Quite a few UnqualifiedName[] arrays are allocated by the call to Name.qualifier(). We can easily make that table reference non-final and calculate it lazily:

private final Name              name;
private Table<Record>           table;

QualifiedField(Name name, DataType<T> type) {
    super(name, type);

    this.name = name;
}

@Override
public final Table<Record> getTable() {
    if (table == null)
        table = name.qualified() ? DSL.table(name.qualifier()) : null;

    return table;
}

Because name is final, we could call table “effectively final” (in a different meaning than the Java language’s) – we won’t have any thread safety issues because these particular types are immutable inside of jOOQ.

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6755

Results

Now, thus far, we’ve “improved” many low hanging fruit based on a profiler session (that was run, akhem, from outside of Eclipse on a rather busy machine). This wasn’t very scientific. Just tracking down “bottlenecks” which triggered my interest by having high enough numbers to even notice. This is called “micro optimisation”, and it is only worth the trouble if you’re in a “S-E-R-C loop”, meaning that the code you’re optimising is executed many many times. For me, developing jOOQ, this is almost always the case, because jOOQ is a library used by a lot of people who all profit from these optimisations. In many other cases, this might be called “premature optimisation”

But once we’ve optimised, we shouldn’t stop. I’ve done a couple of individual JMH benchmarks for many of the above problems, to see if they were really an improvement. But sometimes, in a JMH benchmark, something that doesn’t look like an improvement will still be an improvement in the bigger picture. The JVM doesn’t inline all methods 100 levels deep. If your algorithm is complex, perhaps a micro optimisation will still have an effect that would not have any effect on a JMH benchmark.

Unfortunately this isn’t very exact science, but with enough intuition, you’ll find the right spots to optimise.

In my case, I verified progress over two patch releases: 3.10.0 -> 3.10.1 -> 3.10.2 (not yet released) by running a JMH benchmark over the entire query execution (including H2’s part). The results of applying roughly 15 of the above and similar optimisations (~2 days’ worth of effort) is:

JDK 9 (9+181)

jOOQ 3.10.0 Open Source Edition

Benchmark                          Mode   Cnt       Score      Error  Units
ExecutionBenchmark.testExecution   thrpt   21  101891.108 ± 7283.832  ops/s

jOOQ 3.10.2 Open Source Edition

Benchmark                          Mode   Cnt       Score      Error  Units
ExecutionBenchmark.testExecution   thrpt   21  110982.940 ± 2374.504  ops/s

JDK 8 (1.8.0_145)

jOOQ 3.10.0 Open Source Edition

Benchmark                          Mode   Cnt       Score      Error  Units
ExecutionBenchmark.testExecution   thrpt   21  110178.873 ± 2134.894  ops/s

jOOQ 3.10.2 Open Source Edition

Benchmark                          Mode   Cnt       Score      Error  Units
ExecutionBenchmark.testExecution   thrpt   21  118795.922 ± 2661.653  ops/s

As can be seen, in both JDK versions, we’ve gotten roughly a 10% speed increase. What’s interesting is also that JDK 8 seemed to have been also 10% faster than JDK 9 in this benchmark, although this can be due to a variety of things that I haven’t considered yet, and which are out of scope for this discussion.

Conclusion

This iterative approach to tackling performance is definitely worth it for library authors:

  • run a representative benchmark (repeat a task millions of times)
  • profile it
  • track down “bottlenecks”
  • if they’re easy to fix without regression risk, do it
  • repeat
  • after a while, verify with JMH

Individual improvements are quite hard to measure, or measure correctly. But when you do 10-15 of them, they start adding up and become significant. 10% can make a difference.

Looking forward to your comments, alternative techniques, alternative tools, etc.!

If you liked this article, you will also like Top 10 Easy Performance Optimisations in Java

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s