Using JDK 10’s Local Variable Type Inference with jOOQ

After the successful release of JDK 9, we can already look forward, and play around with early access releases of JDK 10. The list of JEPs currently targeted for JDK 10 is quite manageable so far. JEP 286 is probably the most exciting one for most Java developers: Local variable type inference (which we’ve blogged about before). You can read the JEP yourself, or just go get the early access release and play around with it.

One of the nice things about this new feature is the fact that we now get access to non-denotable types that were previously rather clumsy to work with. For example, this is now possible:

A rarely used feature are methods in anonymous classes that do not override / implement a super type’s method. They are available only in a very narrow scope. Prior to Java 10, we could only call either m() or n() on such a class, using the following syntax:

(new Object() {
    void m() { 
        System.out.println("m"); 
    }
    void n() { 
        System.out.println("n"); 
    }
}).m();

// Now, how to call n()?

The language feature wasn’t too useful. Only one method could be called from the “outside” of the anonymous class, as the instance reference will have gone quickly. With Java 10, we can assign the whole expression to a local variable, without losing the anonymous type.

On a side-note, Java always had a funky and weird love-hate relationship with structural typing, trying to be a mostly nominally typed language. Yet, as we can see in this example, another new kind of structural type has snuck into the language. Cool!

What does this mean for jOOQ?

jOOQ has some cool types. Just look at the API:

Ultimately, depending on how many columns you want to project in your SELECT statement, you’ll get a different Record[N]<T1, T2, ..., T[N]> type, e.g.

for (Record3<String, String, String> r : using(con)
        .select(c.TABLE_SCHEMA, c.TABLE_NAME, c.COLUMN_NAME)
        .from(c))
  System.out.println(
    r.value1() + "." + r.value2() + "." + r.value3());

What’s nice is the fact that there is record-level type safety, i.e. you know that the record has 3 columns and that they’re all of type String. What’s less nice is that in order to profit from this type safety, you have to actually write down the type, which can get laborious (both when writing and when reading it), e.g. when you select 16 columns or more.

Java 10 changes this. It’s now possible to simply write

for (var r : using(con)
        .select(c.TABLE_SCHEMA, c.TABLE_NAME, c.COLUMN_NAME)
        .from(c))
  System.out.println(
    r.value1() + "." + r.value2() + "." + r.value3());

I.e. using the keyword “var” (or “final var”, if you prefer) to create the loop variable. And it will still be type safe. For instance, you cannot call r.value4() on it:

jshell> for (var r : using(con)
   ...>         .select(c.TABLE_SCHEMA, c.TABLE_NAME, c.COLUMN_NAME)
   ...>         .from(c))
   ...>   System.out.println(r.value1() + "." + r.value2() + "." + r.value4());
|  Error:
|  cannot find symbol
|    symbol:   method value4()
|      System.out.println(r.value1() + "." + r.value2() + "." + r.value4());
|                                                               ^------^

This isn’t a game changer, but for folks coming from Kotlin or Scala, it is a big relief to see that this option is now given to Java developers too.

And this isn’t just useful for results in jOOQ. You can also use it for creating dynamic SQL, e.g.:

// Create a subquery listing all tables called TABLES in any schema
var subq = select(t.TABLE_SCHEMA, t.TABLE_NAME)
          .from(t)
          .where(t.TABLE_NAME.eq("TABLES"));

// Create a predicate that uses the above subquery:
var pred = row(c.TABLE_SCHEMA, c.TABLE_NAME).in(sub1);

// use the above predicate in an actual query
var q = using(con).selectFrom(c).where(pred);

So, clearly, this is going to be a really really useful Java release for jOOQ folks.

How to Avoid Excessive Sorts in Window Functions

Usually, this blog is 100% pro window functions and advocates using them at any occasion. But like any tool, window functions come at a price and we must carefully evaluate if that’s a price we’re willing to pay. That price can be a sort operation. And as we all know, sort operations are expensive. They follow O(n log n) complexity, which must be avoided at all costs for large data sets.

In a previous post, I’ve described how to calculate a running total with window functions (among other ways). In this post, we’re going to calculate the cumulative revenue at each payment in our Sakila database.

SELECT
  customer_id,
  payment_date,
  amount,
  SUM(amount) OVER (
    PARTITION BY customer_id
    ORDER BY payment_date, payment_id
  ) cumulative_amount
FROM payment
ORDER BY customer_id, payment_date, payment_id;

The above will yield something like this:

customer_id |payment_date        |amount |cumulative_amount 
------------|--------------------|-------|------------------
1           |2005-05-25 11:30:37 |2.99   |2.99              
1           |2005-05-28 10:35:23 |0.99   |3.98              
1           |2005-06-15 00:54:12 |5.99   |9.97              
1           |2005-06-15 18:02:53 |0.99   |10.96             
1           |2005-06-15 21:08:46 |9.99   |20.95             
1           |2005-06-16 15:18:57 |4.99   |25.94             
...

As can be seen, in spread sheet notation, cumulative_amount[N] = cumulative_amount[N-1] + amount.

Reusing this calculation in several queries

As in any other language, we don’t want to repeat ourselves, so the SQL way of doing DRY is to create a view or a table valued function. Let’s create a view, first. Something like this:

CREATE VIEW payment_with_revenue AS
SELECT
  customer_id,
  payment_date,
  amount,
  SUM(amount) OVER (
    PARTITION BY customer_id
    ORDER BY payment_date, payment_id
  ) cumulative_amount
FROM payment

Now, we can do nice things like this:

SELECT 
  customer_id,
  payment_date,
  amount,
  cumulative_amount
FROM payment_with_revenue
WHERE customer_id IN (1, 2, 3)
AND payment_date 
  BETWEEN DATE '2005-05-25'
  AND     DATE '2005-05-29'
ORDER BY customer_id, payment_date

yielding:

customer_id |payment_date        |amount |cumulative_amount 
------------|--------------------|-------|------------------
1           |2005-05-25 11:30:37 |2.99   |2.99              
1           |2005-05-28 10:35:23 |0.99   |3.98              
2           |2005-05-27 00:09:24 |4.99   |4.99              
3           |2005-05-27 17:17:09 |1.99   |1.99              

What about performance?

Now, if we have an index on (CUSTOMER_ID, PAYMENT_DATE), we’d expect to be able to use it, right? Because it seems that our predicate should be able to profit from it:

SELECT 
  count(*),
  count(*) FILTER (
    WHERE customer_id IN (1, 2, 3)
  ),
  count(*) FILTER (
    WHERE customer_id IN (1, 2, 3)
    AND payment_date < DATE '2005-05-29'
  ) 
FROM payment;

yielding:

count |count |count 
------|------|-----
16049 |85    |4     

(To learn more about the cool FILTER clause, read this article here)

How could we best use the index? Let’s look again at our original query, but this time, with an inlined view (“inlined”):

SELECT 
  customer_id,
  payment_date,
  amount,
  cumulative_amount
FROM (
  SELECT
    customer_id,
    payment_date,
    amount,
    SUM(amount) OVER (
      PARTITION BY customer_id
      ORDER BY payment_date, payment_id
    ) cumulative_amount
  FROM payment
) inlined
WHERE customer_id IN (1, 2, 3)
AND payment_date 
  BETWEEN DATE '2005-05-25'
  AND     DATE '2005-05-29'
ORDER BY customer_id, payment_date;

We should be able to apply two transformations that benefit using the index:

CUSTOMER_ID IN (1, 2, 3) predicate

The CUSTOMER_ID IN (1, 2, 3) predicate should be pushed down into the view, “past” the window function, because it does not affect the window function calculation, which partitions the data set by CUSTOMER_ID. By being pushed “past” the window function, I mean the fact that window functions are calculated late in the order of SELECT clauses.

This means that our original query should be equivalent to this one:

SELECT 
  customer_id,
  payment_date,
  amount,
  cumulative_amount
FROM (
  SELECT
    customer_id,
    payment_date,
    amount,
    SUM(amount) OVER (
      PARTITION BY customer_id
      ORDER BY payment_date, payment_id
    ) cumulative_amount
  FROM payment
  WHERE customer_id IN (1, 2, 3) -- Pushed down
) inlined
WHERE payment_date 
  BETWEEN DATE '2005-05-25'
  AND     DATE '2005-05-29'
ORDER BY customer_id, payment_date;

The PAYMENT_DATE predicate

The PAYMENT_DATE predicate is a bit more tricky. It cannot be pushed “past” the window function completely, because that would alter the semantics of the window function, which calculates the cumulative amount in the RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW range (which is the default, if we do not specify it).

But intuitively (and if you want to spend the time: formally as well), we can show that we can at least push the upper bound of our range predicate into the view, like this:

SELECT 
  customer_id,
  payment_date,
  amount,
  cumulative_amount
FROM (
  SELECT
    customer_id,
    payment_date,
    amount,
    SUM(amount) OVER (
      PARTITION BY customer_id
      ORDER BY payment_date, payment_id
    ) cumulative_amount
  FROM payment
  WHERE customer_id IN (1, 2, 3)
  AND payment_date <= DATE '2005-05-29' -- Pushed down
) inlined
WHERE payment_date >= DATE '2005-05-25'
ORDER BY customer_id, payment_date;

And now, we can profit from the index very easily! But is this transformation being done by any database? Unfortunately not. Some databases manage to push down the “more obvious” CUSTOMER_ID predicate past the window function, but none can do the same with the “less obvious” range predicate on PAYMENT_DATE:

DB2 LUW 10.5

The CUSTOMER_ID predicate is pushed down into the view, which generates an index scan (blue) on the pre-existing foreign key index (which doesn’t contain the PAYMENT_DATE column), but the PAYMENT_DATE itself is only filtered much later using an in-memory filter (red):

Explain Plan                                                       
-------------------------------------------------------------------
ID | Operation                       |                  Rows | Cost
 1 | RETURN                          |                       |   40
 2 |  FILTER                         |     4 of 80 (  5.00%) |   40
 3 |   TBSCAN                        |    80 of 80 (100.00%) |   40
 4 |    SORT                         |    80 of 80 (100.00%) |   40
 5 |     NLJOIN                      |               80 of 3 |   40
 6 |      TBSCAN GENROW              |      3 of 3 (100.00%) |    0
 7 |      FETCH PAYMENT              |    27 of 27 (100.00%) |   13
 8 |       IXSCAN IDX_FK_CUSTOMER_ID | 27 of 16049 (   .17%) |    6
                                                                   
Predicate Information                                              
 2 - RESID (Q5.PAYMENT_DATE <= '2005-05-29')                       
     RESID ('2005-05-25' <= Q5.PAYMENT_DATE)                       
 5 - JOIN (Q3.CUSTOMER_ID = Q2.$C0)                                
 8 - START (Q3.CUSTOMER_ID = Q2.$C0)                               
      STOP (Q3.CUSTOMER_ID = Q2.$C0)                               

Conversely, see the plan of the manually optimised query:

Explain Plan                                                  
--------------------------------------------------------------
ID | Operation                   |                 Rows | Cost
 1 | RETURN                      |                      |   40
 2 |  FILTER                     |     4 of 4 (100.00%) |   40
 3 |   TBSCAN                    |     4 of 4 (100.00%) |   40
 4 |    SORT                     |     4 of 4 (100.00%) |   40
 5 |     NLJOIN                  |               4 of 1 |   40
 6 |      TBSCAN GENROW          |     3 of 3 (100.00%) |    0
 7 |      FETCH PAYMENT          |     1 of 1 (100.00%) |   13
 8 |       IXSCAN IDX_PAYMENT_I1 | 1 of 16049 (   .01%) |    6
                                                              
Predicate Information                                         
 2 - RESID ('2005-05-25' <= Q5.PAYMENT_DATE)                  
 5 - JOIN (Q3.CUSTOMER_ID = Q2.$C0)                           
 8 - START (Q3.CUSTOMER_ID = Q2.$C0)                          
      STOP (Q3.CUSTOMER_ID = Q2.$C0)                          
      STOP (Q3.PAYMENT_DATE <= '2005-05-29')                  

This is certainly a better plan.

MySQL 8.0.2

MySQL, very regrettably, doesn’t seem to show any effort at all in optimising this. We’re accessing the entire payment table to get this result.

id   table        type  rows    filtered    Extra
-----------------------------------------------------------------------
1    <derived2>   ALL   16086    3.33       Using where
2    payment      ALL   16086  100.00       Using filesort

Here’s the manually optimised plan:

id   table        type  key             rows  filtered    Extra
-------------------------------------------------------------------------------
1    <derived2>   ALL                   4     3.33        Using where
2    payment      range idx_payment_i1  4      100.00     Using index condition

Oracle 12.2.0.1

Oracle also cannot do this beyond pushing the more obvious CUSTOMER_ID predicate into the view:

-------------------------------------------------------------------------------
| Id  | Operation                              | Name                 | Rows  |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                      |       |
|*  1 |  VIEW                                  | PAYMENT_WITH_REVENUE |    80 |
|   2 |   WINDOW SORT                          |                      |    80 |
|   3 |    INLIST ITERATOR                     |                      |       |
|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| PAYMENT              |    80 |
|*  5 |      INDEX RANGE SCAN                  | IDX_FK_CUSTOMER_ID   |    80 |
-------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(("PAYMENT_DATE">=TO_DATE('2005-05-25 00:00:00') AND 
              "PAYMENT_DATE"<=TO_DATE('2005-05-29 00:00:00')))
   5 - access(("CUSTOMER_ID"=1 OR "CUSTOMER_ID"=2 OR "CUSTOMER_ID"=3))

The manually optimised plan looks better:

-------------------------------------------------------------------------
| Id  | Operation                              | Name           | Rows  |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                |       |
|*  1 |  VIEW                                  |                |     1 |
|   2 |   WINDOW SORT                          |                |     1 |
|   3 |    INLIST ITERATOR                     |                |       |
|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| PAYMENT        |     1 |
|*  5 |      INDEX RANGE SCAN                  | IDX_PAYMENT_I1 |     1 |
-------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("PAYMENT_DATE">=TO_DATE('2005-05-25 00:00:00'))
   5 - access(("CUSTOMER_ID" IN (1, 2, 3)) AND 
              "PAYMENT_DATE"<=TO_DATE('2005-05-29 00:00:00'))

Much better cardinality estimates!

PostgreSQL 10

PostgreSQL’s version of the Sakila database uses a partitioned payment table, but that should be irrelevant for this analysis. The CUSTOMER_ID predicate could be pushed down…

QUERY PLAN                                                                                          
---------------------------------------------------------------------------------------------------
Subquery Scan on payment_with_revenue  (cost=117.06..124.45 rows=8 width=52)                       
  Filter: ((payment_date >= '2005-05-25') AND (payment_date <= '2005-05-29'))
-> WindowAgg  (cost=117.06..121.49 rows=197 width=56)                                               
   -> Sort  (cost=117.06..117.55 rows=197 width=24)                                              
      Sort Key: payment.customer_id, payment.payment_date, payment.payment_id                  
      -> Result  (cost=0.29..109.55 rows=197 width=24)                                        
         -> Append  (cost=0.29..107.58 rows=197 width=24)                                  
            -> Index Scan using idx_fk.. on payment  (cost=0.29..18.21 rows=77 width=20)
               Index Cond: (customer_id = ANY ('{1,2,3}'::integer[]))
               -> Bitmap Heap Scan on payment_p2007_01  (cost=4.62..14.90 rows=20 width=26)
                  Recheck Cond: (customer_id = ANY ('{1,2,3}'::integer[]))               
                  -> Bitmap Index Scan on idx_fk.. (cost=0.00..4.61 rows=20 width=0)
                     Index Cond: (customer_id = ANY ('{1,2,3}'::integer[]))           
                  -> Bitmap Heap Scan on payment_p2007_02  (cost=4.62..14.90 rows=20 width=26)
                     Recheck Cond: (customer_id = ANY ('{1,2,3}'::integer[]))               
                  -> Bitmap Index Scan on idx_fk.. (cost=0.00..4.61 rows=20 width=0)
                     Index Cond: (customer_id = ANY ('{1,2,3}'::integer[]))           
              ...

But manual optimisation is required to get better behaviour for the date range:

QUERY PLAN                                                                                           
-----------------------------------------------------------------------------------------------------
Subquery Scan on inlined  (cost=18.46..18.56 rows=3 width=48)                                        
  Filter: (inlined.payment_date >= '2005-05-25'::date)                    
-> WindowAgg  (cost=18.46..18.52 rows=3 width=52)                                                 
   -> Sort  (cost=18.46..18.46 rows=3 width=20)                                                
      Sort Key: payment.customer_id, payment.payment_date, payment.payment_id                
      -> Result  (cost=0.29..18.43 rows=3 width=20)                                         
         -> Append  (cost=0.29..18.40 rows=3 width=20)                                   
            -> Index Scan using idx_fk.. on payment  (cost=0.29..18.40 rows=3 width=20)
                Index Cond: (customer_id = ANY ('{1,2,3}'::integer[]))
                Filter: (payment_date <= '2005-05-29'::date)

Interestingly, the index still isn’t used optimally on both columns, which has nothing to do with the current discussion on window functions. PostgreSQL seems to be unable to think of the IN predicate as an equality predicate. See also this article about other optimisations (such as predicate merging) that are not possible (yet) in PostgreSQL.

But still, this is much better as it brings down the estimated cardinalities (in case this query is a subquery in a more sophisticated context), and more importantly, it filters out many many rows prior to calculating the window function.

SQL Server 2014

Another database that cannot push down this predicate past the window function optimally. Only the “obvious” part is pushed down:

|--Sort(ORDER BY:([payment_date] ASC))
   |--Filter(WHERE:([payment_date]>='2005-05-25' AND [payment_date]<='2005-05-29'))
      |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1004]=(0) THEN NULL ELSE [Expr1005] END))
         |--Stream Aggregate(GROUP BY:([WindowCount1009]) DEFINE:(..))
            |--Window Spool(RANGE BETWEEN:(UNBOUNDED, [[payment_date], [payment_id]]))
               |--Segment
                  |--Segment
                     |--Sort(ORDER BY:([customer_id] ASC, [payment_date] ASC, [payment_id] ASC))
                        |--Table Scan(OBJECT:([payment]), WHERE:([customer_id] IN (1, 2, 3)))

Interestingly, this doesn’t even use the index at all, but at least the data is filtered out prior to the calculation that relies on sorting. With the manual optimisation, again the same, much better effect:

|--Filter(WHERE:([payment_date]>='2005-05-25'))
   |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1004]=(0) THEN NULL ELSE [Expr1005] END))
      |--Stream Aggregate(GROUP BY:([WindowCount1011]) DEFINE:(..))
         |--Window Spool(RANGE BETWEEN:(UNBOUNDED, [[payment_date], [payment_id]]))
            |--Segment
               |--Segment
                  |--Sort(ORDER BY:([payment_date] ASC, [payment_id] ASC))
                     |--Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1000]))
                        |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1007], [Expr1008], [Expr1006]))
                        |  |--Compute Scalar(DEFINE:(([Expr1007],[Expr1008],[Expr1006])=GetRangeWithMismatchedTypes(NULL,'2005-05-29',(42))))
                        |  |  |--Constant Scan
                        |  |--Index Seek(OBJECT:([idx_payment_i1]), SEEK:([customer_id] IN (1, 2, 3) AND [payment_date] > [Expr1007] AND [payment_date] < [Expr1008]))
                        |--RID Lookup(OBJECT:([payment]))

Certainly, this is a bit cryptic to read but it really means the same thing as always: The manual optimisation worked and we got a better plan.

Meh, does it matter?

I hope so! Let’s benchmark these things against each other! Some info about our benchmarking technique in our previous post and on this page here. Specifically, we don’t publish actual execution times, only relative times within the benchmark as we do not want to compare databases against each other but only against themselves.

DB2 LUW 10.5

RUN |STMT |RATIO  |
----|-----|-------|
1   |1    |3.0890 |
1   |2    |1.2272 |
2   |1    |3.0624 |
2   |2    |1.0100 |
3   |1    |3.0389 |
3   |2    |1.0000 |
4   |1    |3.1566 |
4   |2    |1.0948 |
5   |1    |3.1817 |
5   |2    |1.0905 |

The manually optimised statement is 3x faster in our benchmark. Do bear in mind that we’re operating on a rather small data set of a total of a few thousand rows! This gets worse in a larger data set.

MySQL 8.0.2

The difference is devastating in MySQL 8.0.2, which just recently introduced window functions. Surely, the MySQL team will be able to apply some further optimisations prior to GA – I’ve filed an issue for review:

0	1	431.1905
0	2	1.0000
1	1	372.4286
1	2	1.0000
2	1	413.4762
2	2	1.0000
3	1	381.2857
3	2	1.0000
4	1	400.1429
4	2	1.2857

Oracle 12.2.0.1

Another factor 4x can be observed in Oracle:

Run 1, Statement 1 : 4.58751
Run 1, Statement 2 : 1.37639
Run 2, Statement 1 : 4.71833
Run 2, Statement 2 : 1.03693
Run 3, Statement 1 : 4.05729
Run 3, Statement 2 : 1.04719
Run 4, Statement 1 : 3.86653
Run 4, Statement 2 : 1
Run 5, Statement 1 : 3.99603
Run 5, Statement 2 : 1.0212

PostgreSQL 10

PostgreSQL is quite bad too, here. A factor 7x can be observed:

RUN 1, Statement 1: 7.23373
RUN 1, Statement 2: 1.01438
RUN 2, Statement 1: 6.62028
RUN 2, Statement 2: 1.26183
RUN 3, Statement 1: 8.40322
RUN 3, Statement 2: 1.04074
RUN 4, Statement 1: 6.33401
RUN 4, Statement 2: 1.06750
RUN 5, Statement 1: 6.41649
RUN 5, Statement 2: 1.00000

SQL Server 2014

Another very significant penalty in SQL Server for the unoptimised version:

Run 1, Statement 1: 29.50000
Run 1, Statement 2: 1.07500
Run 2, Statement 1: 28.15000
Run 2, Statement 2: 1.00000
Run 3, Statement 1: 28.00000
Run 3, Statement 2: 1.00000
Run 4, Statement 1: 28.00000
Run 4, Statement 2: 1.00000
Run 5, Statement 1: 31.07500
Run 5, Statement 2: 1.00000

Bad news for views. Is there a better solution?

This is rather bad news for window functions inside of reusable views. None of the databases, not even DB2 or Oracle can push down range predicates past a derived table’s window function, if the column that is part of the range predicate doesn’t correspond to the window function’s PARTITION BY clause.

The problem described above can be easily fixed when the query is written manually, expanding all possible views into their calling SQL, but that kind of sucks – we’d love to make our code reusable. There’s one solution in databases that support inline table valued functions. Among the tested databases, these include:

  • DB2
  • PostgreSQL
  • SQL Server

MySQL doesn’t have table valued functions, and Oracle’s (very regrettably) are not inlineable because they have to be written in PL/SQL.

Here’s how to write these functions:

DB2

Function definition:

CREATE OR REPLACE FUNCTION f_payment_with_revenue (
  p_customer_id BIGINT,
  p_from_date DATE,
  p_to_date DATE
)
RETURNS TABLE (
  customer_id BIGINT,
  payment_date DATE,
  amount DECIMAL(10, 2),
  cumulative_amount DECIMAL(10, 2)
)
LANGUAGE SQL
RETURN
SELECT *
FROM (
  SELECT
    customer_id,
    payment_date,
    amount,
    SUM(amount) OVER (
      PARTITION BY customer_id
      ORDER BY payment_date, payment_id
    ) cumulative_amount
  FROM payment
  WHERE customer_id = p_customer_id
  AND payment_date <= p_to_date
) t
WHERE payment_date >= p_from_date;

Function call:

SELECT 
  payment_date,
  amount,
  cumulative_amount
FROM (
  SELECT customer_id FROM customer WHERE customer_id IN (1, 2, 3)
) c(customer_id),
TABLE(sakila.f_payment_with_revenue(
  c.customer_id,
  CAST('2005-05-25' AS DATE),
  CAST('2005-05-29' AS DATE)
))
ORDER BY payment_date;

Execution plan:

Explain Plan                                                    
----------------------------------------------------------------
ID | Operation                     |                 Rows | Cost
 1 | RETURN                        |                      |   33
 2 |  TBSCAN                       |     4 of 4 (100.00%) |   33
 3 |   SORT                        |     4 of 4 (100.00%) |   33
 4 |    NLJOIN                     |               4 of 1 |   33
 5 |     NLJOIN                    |               3 of 1 |   20
 6 |      TBSCAN GENROW            |     3 of 3 (100.00%) |    0
 7 |      IXSCAN PK_CUSTOMER       |   1 of 599 (   .17%) |    6
 8 |     FILTER                    |     1 of 1 (100.00%) |   13
 9 |      TBSCAN                   |     1 of 1 (100.00%) |   13
10 |       SORT                    |     1 of 1 (100.00%) |   13
11 |        FETCH PAYMENT          |     1 of 1 (100.00%) |   13
12 |         IXSCAN IDX_PAYMENT_I1 | 1 of 16049 (   .01%) |    6
                                                                
Predicate Information                                           
  5 - JOIN (Q3.CUSTOMER_ID = Q2.$C0)                            
  7 - START (Q3.CUSTOMER_ID = Q2.$C0)                           
       STOP (Q3.CUSTOMER_ID = Q2.$C0)                           
  8 - RESID ('2005-05-25' <= Q6.PAYMENT_DATE)                   
 12 - START (Q4.CUSTOMER_ID = Q3.CUSTOMER_ID)                   
       STOP (Q4.CUSTOMER_ID = Q3.CUSTOMER_ID)                   
       STOP (Q4.PAYMENT_DATE <= '2005-05-29')                   

Much better!

Benchmark result (Statement 1 = function call, Statement 2 = manually optimised):

RUN |STMT |RATIO  |
----|-----|-------|
1   |1    |1.5945 |
1   |2    |1.0080 |
2   |1    |1.6310 |
2   |2    |1.0768 |
3   |1    |1.5827 |
3   |2    |1.0090 |
4   |1    |1.5486 |
4   |2    |1.0084 |
5   |1    |1.5569 |
5   |2    |1.0000 |

Definitely a huge improvement. The comparison might not be entirely fair because

  • CROSS APPLY / LATERAL unnesting tends to generate nested loops that could be written more optimally with a classic join
  • We have an additional auxiliary customer table access (which could probably be tuned away with another rewrite)

PostgreSQL

Function definition:

CREATE OR REPLACE FUNCTION f_payment_with_revenue (
  p_customer_id BIGINT,
  p_from_date DATE,
  p_to_date DATE
)
RETURNS TABLE (
  customer_id SMALLINT,
  payment_date TIMESTAMP,
  amount DECIMAL(10, 2),
  cumulative_amount DECIMAL(10, 2)
)
AS $$
SELECT *
FROM (
  SELECT
    customer_id,
    payment_date,
    amount,
    SUM(amount) OVER (
      PARTITION BY customer_id
      ORDER BY payment_date, payment_id
    ) cumulative_amount
  FROM payment
  WHERE customer_id = p_customer_id
  AND payment_date <= p_to_date
) t
WHERE payment_date >= p_from_date
$$ LANGUAGE SQL;

Function call:

SELECT 
  payment_date,
  amount,
  cumulative_amount
FROM (
  SELECT customer_id FROM customer WHERE customer_id IN (1, 2, 3)
) c(customer_id)
CROSS JOIN LATERAL f_payment_with_revenue(
  c.customer_id,
  CAST('2005-05-25' AS DATE),
  CAST('2005-05-29' AS DATE)
)
ORDER BY payment_date;

Execution plan:

QUERY PLAN                                                                                    
----------------------------------------------------------------------------------------------
Sort  (cost=250.39..257.89 rows=3000 width=72)                                                
  Sort Key: f_payment_with_revenue.payment_date                                               
  ->  Nested Loop  (cost=0.53..77.13 rows=3000 width=72)                                      
        ->  Index Only Scan using customer_pkey on customer  (cost=0.28..16.88 rows=3 width=4)
              Index Cond: (customer_id = ANY ('{1,2,3}'::integer[]))                          
        ->  Function Scan on f_payment_with_revenue  (cost=0.25..10.25 rows=1000 width=72)    

Oops, no unnesting of the function is happening. The cardinality defaults to 1000. That’s bad news!

Benchmark result (Statement 1 = function call, Statement 2 = manually optimised):

RUN 1, Statement 1: 25.77538
RUN 1, Statement 2: 1.00000
RUN 2, Statement 1: 27.55197
RUN 2, Statement 2: 1.11581
RUN 3, Statement 1: 27.99331
RUN 3, Statement 2: 1.16463
RUN 4, Statement 1: 29.11022
RUN 4, Statement 2: 1.01159
RUN 5, Statement 1: 26.65781
RUN 5, Statement 2: 1.01654

Rats. This has gotten much worse than with the view. Not surprising, though. Table valued functions are not that good of an idea when they cannot be inlined! Oracle would have had a similar result if I wasn’t too lazy to translate my function to an ordinary PL/SQL table valued function, or a pipelined function.

SQL Server

Function definition:

CREATE FUNCTION f_payment_with_revenue (
  @customer_id BIGINT,
  @from_date DATE,
  @to_date DATE
)
RETURNS TABLE
AS RETURN
SELECT *
FROM (
  SELECT
    customer_id,
    payment_date,
    amount,
    SUM(amount) OVER (
      PARTITION BY customer_id
      ORDER BY payment_date, payment_id
    ) cumulative_amount
  FROM payment
  WHERE customer_id = @customer_id
  AND payment_date <= @to_date
) t
WHERE payment_date >= @from_date;

Function call:

SELECT 
  payment_date,
  amount,
  cumulative_amount
FROM (
  SELECT customer_id FROM customer WHERE customer_id IN (1, 2, 3)
) AS c(customer_id)
CROSS APPLY f_payment_with_revenue(
  c.customer_id,
  CAST('2005-05-25' AS DATE),
  CAST('2005-05-29' AS DATE)
)
ORDER BY payment_date;

Execution plan

|--Sort(ORDER BY:([payment_date] ASC))
   |--Nested Loops(Inner Join, OUTER REFERENCES:([customer_id]))
      |--Index Seek(OBJECT:([PK__customer__CD65CB84E826462D]), SEEK:([customer_id] IN (1, 2, 3))
      |--Filter(WHERE:([payment_date]>='2005-05-25'))
         |--Compute Scalar(DEFINE:([Expr1006]=CASE WHEN [Expr1007]=(0) THEN NULL ELSE [Expr1008] END))
            |--Stream Aggregate(GROUP BY:([WindowCount1014]) DEFINE:(..)))
               |--Window Spool(RANGE BETWEEN:(UNBOUNDED, [[payment_date], [payment_id]]))
                  |--Segment
                     |--Segment
                        |--Sort(ORDER BY:([payment_date] ASC, [payment_id] ASC))
                           |--Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1003]))
                              |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1010], [Expr1011], [Expr1009]))
                              |  |--Compute Scalar(DEFINE:(([Expr1010],[Expr1011],[Expr1009])=GetRangeWithMismatchedTypes(NULL,'2005-05-29',(42))))
                              |  |  |--Constant Scan
                              |  |--Index Seek(OBJECT:([idx_payment_i1]), SEEK:([customer_id]=CONVERT_IMPLICIT(bigint,[customer_id],0) AND [payment_date] > [Expr1010] AND [payment_date] < [Expr1011]))
                              |--RID Lookup(OBJECT:([payment]), SEEK:([Bmk1003]=[Bmk1003]))

Again, super unreadable IMO, but after looking a bit more closely, we can see that the plan is almost the same as the manually optimised one, and the predicate is applied early on, where it belongs.

Benchmark result (Statement 1 = function call, Statement 2 = manually optimised):

Run 1, Statement 1: 2.50000
Run 1, Statement 2: 1.27778
Run 2, Statement 1: 2.11111
Run 2, Statement 2: 1.27778
Run 3, Statement 1: 2.11111
Run 3, Statement 2: 1.00000
Run 4, Statement 1: 2.22222
Run 4, Statement 2: 1.11111
Run 5, Statement 1: 2.02778
Run 5, Statement 2: 1.19444

Conclusion

Window functions are super cool and powerful. But they come at a price. They sort your data. Normally, when we write complex queries and reuse parts in views, we can profit from predicate push down operations into derived tables and views, which is something that most databases support (see also our previous blog post about such optimisations).

But when it comes to using window functions, they act like a “fence”, past which only few predicates can be pushed automatically. It’s not that it wouldn’t be possible, it simply isn’t done very well by most databases (and in the case of MySQL, not at all as of 8.0.2).

Inline table valued functions can be a remedy to avoid manual building of complex queries, such that at least some parts of your logic can be reused among queries. Unfortunately, they rely on CROSS APPLY or LATERAL JOIN, which can also cause performance issues in more complex setups. Besides, among the databases covered in this article, only DB2 and SQL Server support inline table valued functions. Oracle doesn’t support SQL functions at all, and PostgreSQL’s SQL functions are not inlinable (yet), which means that in these databases, in order to tune such queries, you might not be able to reuse the parts that use window functions in views or stored functions.

However, as always, do measure. Perhaps, a 4x waste of performance for a particular query is OK.

Squeezing Another 10% Speed Increase out of jOOQ using JMC and JMH

In this post, we’re going to discuss a couple of recent efforts to squeeze roughly 10% in terms of speed out of jOOQ by iterating on hotspots that were detected using JMC (Java Mission Control) and then validated using JMH (Java Microbenchmark Harness). This post shows how to apply micro optimisations to algorithms where the smallest improvement can have a significant effect.

While JMH is probably without competition, JMC could easily be replaced by JProfiler, YourKit, or even your own manual jstack sampling. I’ll just use JMC because it ships with the JDK and is free for use for development as of JDK 8 and 9 (if you’re unsure whether you’re “developing”, better ask Oracle). Rumours have it that JMC might be contributed to the OpenJDK in the near future.

Micro optimisations

Micro optimisations are a cool technique to squeeze a very small improvement out of a local algorithm (e.g. a loop) that has a significant effect on the entire application / library, because of the fact that the local algorithm is called many times. This is absolutely the case in jOOQ, which is essentially a library that always runs 4 nested loops:

  1. S: A “loop” over all possible SQL statements
  2. E: A “loop” over all executions of such a statement
  3. R: A loop over all rows in the result
  4. C: A loop over all columns in a row

Such four level nested loops result in what we could call a polynomial complexity of our algorithms, even if we cannot call the complexity O(N4) (as the 4 “N” are not all the same), it is certainly of O(S x E x R x C) (I’ll call this “S-E-R-C loops” further down). Even to the untrained eye, it becomes evident that anything that happens in the inner-most “C-loop” can have devastating effects. We better not be opening any files here, that could be opened outside of, e.g. the “S-loop”

In a previous blog post, we’ve discussed common techniques of optimising such situations. In this blog post, we’ll look into a couple of concrete examples.

How to discover flaws in these loops?

We’re looking for the problems that affect all users, the kind of problem that, once fixed, will improve jOOQ’s performance for everyone by e.g. 10%. This is similar to what the JIT does, by performing things like stack allocation, inlining, which don’t drastically improve things locally, but do so globally, and for everyone. Here’s an interesting guest post by Tagir Valeev on JIT optimisation, and how good it is.

Getting a large “S-loop”

The first option is to run profiling sessions on benchmarks. We could, for example, run the entire “S-E-R-C loops” in a JMC profiling session, where the “S-loop” is a loop over all our statements, or in other words, over all our integration tests. Unfortunately, with this approach, our “E-loop” (in the case of jOOQ’s integration tests) is a single execution per statement. We’d have to run the integration tests many, many times in order to get meaningful results.

Also, while the jOOQ integration tests run thousands of distinct queries, most queries are still rather simple, each one focusing on an individual SQL feature (e.g. lateral join). In a end user application, queries might use less specific features, but are much more complex, i.e. they have a lot of ordinary joins.

This technique is useful to find problems that appear in all queries, deep down inside of jOOQ – e.g. at the JDBC interface. But we cannot use this approach to test individual features.

Getting a large “E-loop”

Another option is to write a single test that runs a few statements (small “S-loop”) many times in an explicit loop (large “E-loop”). This has the advantage that a specific bottleneck can be found with a high confidence, but the drawback is: It’s specific. For instance, if we find a small bottleneck in the string concatenation function, well, that is certainly worth fixing, but doesn’t affect most users.

This approach is useful to test individual features. It can also be useful for finding issues that affect all queries, but with a lower confidence than the previous case, where the “S-loop” is maximised.

Getting large “R-loops” and “C-loops”

Creating large result sets is easy and should definitely be part of such benchmarks, because in the case of a large result set, any flaw will multiply drastically, so fixing these things is worthwhile. However, these problems only affect actual result sets, not the query building process or the execution process. Sure, most statements are probably queries, not insertions / updates, etc. But this needs to be kept in mind.

Optimising for problems in large “E-loops”

All of the above scenarios are different optimisation sessions and deserve their own blog posts. In this post, I’m describing what has been discovered and fixed when running a single query 3 million times on an H2 database. The H2 database is chosen here, because it can run in memory of the same process and thus has the least extra overhead compared to jOOQ – so jOOQ’s overhead contributions become significant in a profiling session / benchmark. In fact, it can be shown that in such a benchmark, jOOQ (or Hibernate, etc.) appears to perform quite poorly compared to a JDBC only solution, as many have done before.

This is an important moment to remind ourselves:

Benchmarks do not reflect real-world use cases! You will never run the exact same query 3 million times on a production system, and your production system doesn’t run on H2.

A benchmark profits from so much caching, buffering, you would never perform as fast as in a benchmark.

Always be careful not to draw any wrong conclusions from a benchmark!

This needs to be said, so take every benchmark you find on the web with a grain of salt. This includes our own!

The query being profiled is:

ctx.select(
      AUTHOR.FIRST_NAME,
      AUTHOR.LAST_NAME,
      BOOK.ID,
      BOOK.TITLE)
   .from(BOOK)
   .join(AUTHOR).on(BOOK.AUTHOR_ID.eq(AUTHOR.ID))
   .where(BOOK.ID.eq(1))
   .and(BOOK.TITLE.isNull().or(BOOK.TITLE.ne(randomValue)));

The trivial query returns a ridiculous 4 rows and 4 columns, so the “R-loop” and “C-loops” are negligible. This benchmark is really testing the overhead of jOOQ query execution in a case where the database does not contribute much to the execution time. Again, in a real world scenario, you will get much more overhead from your database.

In the following sections, I’ll show a few minor bottlenecks that could be found when drilling down into these such execution scenarios. As I’ve switched between JMC versions, the screenshots will not always be the same, I’m afraid.

1. Instance allocation of constant values

A very silly mistake was easily discovered right away:

The mistake didn’t contribute a whole lot of overhead, only 1.1% to the sampled time spent, but it made me curious. In version 3.10 of jOOQ, the SelectQueryImpl‘s Limit class, which encodes the jOOQ OFFSET / LIMIT behaviour kept allocating this DSL.val() thingy, which is a bind variable. Sure, limits do work with bind variables, but this happened when SelectQueryImpl was initialised, not when the LIMIT clause is added by the jOOQ API user.

As can be seen in the sources, the following logic was there:

private static final Field<Integer> ZERO              = zero();
private static final Field<Integer> ONE               = one();
private Field<Integer>              numberOfRowsOrMax = 
    DSL.inline(Integer.MAX_VALUE);

While the “special limits” ZERO and ONE were static members, the numberOfRowsOrMax value wasn’t. That’s the instantiation we were measuring in JMC. The member is not a constant, but the default value is. It is always initialised with Integer.MAX_VALUE wrapped in an DSL.inline() call. The solution is really simple:

private static final Param<Integer> MAX               = 
    DSL.inline(Integer.MAX_VALUE);
private Field<Integer>              numberOfRowsOrMax = MAX;

This is obviously better! Not only does it avoid the allocation of the bind variable, it also avoids the boxing of Integer.MAX_VALUE (which can also be seen in the sampling screenshot).

Note, a similar optimisation is available in the JDK’s ArrayList. When you look at the sources, you’ll see:

/**
 * Shared empty array instance used for empty instances.
 */
private static final Object[] EMPTY_ELEMENTDATA = {};

When you initialise an ArrayList without initial capacity, it will reference this shared instance, instead of creating a new, empty (or even non-empty) array. This delays the allocation of such an array until we actually add things to the ArrayList, just in case it stays empty.

jOOQ’s LIMIT is the same. Most queries might not have a LIMIT, so better not allocate that MAX_VALUE afresh!

This is done once per “E-loop” iteration

One issue down: https://github.com/jOOQ/jOOQ/issues/6635

2. Copying lists in internals

This is really a micro optimisation that you probably shouldn’t do in ordinary business logic. But it might be worthwhile in infrastructure logic, e.g. when you’re also in an “S-E-R-C loop”:

jOOQ (unfortunately) occasionally copies data around between arrays, e.g. wrapping Strings in jOOQ wrapper types, transforming numbers to strings, etc. These loops aren’t bad per se, but remember, we’re inside some level of the “S-E-R-C loop”, so these copying operations might be run hundreds of millions of times when we run a statement 3 million times.

The above loop didn’t contribute a lot of overhead, and possible the cloned object was stack allocated or the clone call eliminated by the JIT. But maybe it wasn’t. The QualifiedName class cloned its argument prior to returning it to make sure that no accidental modifications will have any side effect:

private static final String[] nonEmpty(String[] qualifiedName) {
    String[] result;
    ...
    if (nulls > 0) {
        result = new String[qualifiedName.length - nulls];
        ...
    }
    else {
        result = qualifiedName.clone();
    }
    return result;
}

So, the implementation of the method guaranteed a new array as a result.

After a bit of analysis, it could be seen that there is only a single consumer of this method, and it doesn’t leave that consumer. So, it’s safe to remove the clone call. Probably, the utility was refactored from a more general purpose method into this local usage.

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6640

3. Running checks in loops

This one is too silly to be true:

There’s a costly overhead in the CombinedCondition constructor (<init> method). Notice, how the samples drop from 0.47% to 0.32% between the constructor and the next method init(), that’s the time spent inside the constructor.

A tiny amount of time, but this time is spent every time someone combines two conditions / predicates with AND and OR. Every time. We can probably save this time. The problem is this:

CombinedCondition(Operator operator, Collection<? extends Condition> conditions) {
    ...
    for (Condition condition : conditions)
        if (condition == null)
            throw new IllegalArgumentException("The argument 'conditions' must not contain null");

    ...
    init(operator, conditions);
}

There’s a loop over the arguments to give some meaningful error messages. That’s a bit too defensive, I suspect. How about we simply live with the NPE when it arises, as this should be rather unexpected (for the context, jOOQ hardly ever checks on parameters like this, so this should also be removed for consistency reasons).

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6666 (nice number)

4. Lazy initialisation of lists

The nature of the JDBC API forces us to work with ThreadLocal variables, very unfortunately, as it is not possible to pass arguments from parent SQLData objects to children, especially when we combine nesting of Oracle TABLE/VARRAY and OBJECT types.

In this analysis, we’re combining the profiler’s CPU sampling with its memory sampling:

In the CPU sampling view above, we can see some overhead in the DefaultExecuteContext, which is instantiated once per “E-loop” iteration. Again, not a huge overhead, but let’s look at what this constructor does. It contributes to the overall allocations of ArrayList:

When we select the type in JMC, the other view will then display all the stack traces where ArrayList instances were allocated, among which, again, our dear DefaultExecuteContext constructor:

Where are those ArrayLists allocated? Right here:

BLOBS.set(new ArrayList<Blob>());
CLOBS.set(new ArrayList<Clob>());
SQLXMLS.set(new ArrayList<SQLXML>());
ARRAYS.set(new ArrayList<Array>());

Every time we start executing a query, we initialise a list for each ones of these types. All of our variable binding logic will then register any possibly allocated BLOB or CLOB, etc. such that we can clean these up at the end of the execution (a JDBC 4.0 feature that not everyone knows of!):

static final void register(Blob blob) {
    BLOBS.get().add(blob);
}
    
static final void clean() {
    List<Blob> blobs = BLOBS.get();

    if (blobs != null) {
        for (Blob blob : blobs)
            JDBCUtils.safeFree(blob);

        BLOBS.remove();
    }
    ...
}

Don’t forget calling Blob.free() et al, if you’re working with JDBC directly!

But the truth is, in most cases, we don’t really need these things. We need them only in Oracle, and only if we’re using TABLE / VARRAY or OBJECT types, due to some JDBC restrictions. Why punish all the users of other databases with this overhead? Instead of a sophisticated refactoring, which risks introducing regressions (https://github.com/jOOQ/jOOQ/issues/4205), we can simply initialise these lists lazily. We leave the clean() method as it is, remove the initialisation in the constructor, and replace the register() logic by this:

static final void register(Blob blob) {
    List<Blob> list = BLOBS.get();

    if (list == null) {
        list = new ArrayList<Blob>();
        BLOBS.set(list);
    }

    list.add(blob);
}

That was easy. And significant. Check out the new allocation measurements:

Note that every allocation, apart from the overhead of allocating things, also incurs additional overhead when the object is garbage collected. That’s a bit trickier to measure and correlate. In general, less allocations is almost always a good thing, except if the allocation is super short lived, in case of which stack allocation can happen, or the logic can even be eliminated by the JIT.

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6669

6. Using String.replace()

This is mostly a problem in JDK 8 only, JDK 9 fixed string replacing by no longer relying on regular expressions internally. In JDK 8, however (and jOOQ still supports Java 6, so this is relevant), string replacement works through regular expressions as can be seen here:

The Pattern implementation allocates quite a few int[] instances, even if that’s probably not strictly needed for non-regex patterns as those of String.replace():

I’ve already analysed this in a previous blog post, which can be seen here:

https://blog.jooq.org/2017/10/11/benchmarking-jdk-string-replace-vs-apache-commons-stringutils-replace/

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6672

7. Registering an SPI that is going to be inactive

This one was a bit more tricky to solve as it relies on a deeper analysis. Unfortunately, I have no profiling screenshots available anymore, but it is easy to explain with code. There’s an internal ExecuteListeners utility, which abstracts over the ExecuteListener SPIs. Users can register such a listener and listen to query rendering, variable binding, query execution, and other lifecycle events. By default, there is no such ExecuteListener by the users, but there’s always one internal ExecuteListener:

private static ExecuteListener[] listeners(ExecuteContext ctx) {
    List<ExecuteListener> result = new ArrayList<ExecuteListener>();

    for (ExecuteListenerProvider provider : ctx.configuration()
                                               .executeListenerProviders())
        if (provider != null)
            result.add(provider.provide());

    if (!FALSE.equals(ctx.settings().isExecuteLogging()))
        result.add(new LoggerListener());

    return result.toArray(EMPTY_EXECUTE_LISTENER);
}

The LoggerListener is added by default, unless users turn off that feature. Which means:

  • We’ll pretty much always get this ArrayList
  • We’ll pretty much always loop over this list
  • We’ll pretty much always clal this LoggerListener

But what does it do? It logs stuff on DEBUG and TRACE level. For instance:

@Override
public void executeEnd(ExecuteContext ctx) {
    if (ctx.rows() >= 0)
        if (log.isDebugEnabled())
            log.debug("Affected row(s)", ctx.rows());
}

That’s what it does by definition. It’s a debug logger. So, the improved logic for initialising this thing is the following:

private static final ExecuteListener[] listeners(ExecuteContext ctx) {
    List<ExecuteListener> result = null;

    for (ExecuteListenerProvider provider : ctx.configuration()
                                               .executeListenerProviders())
        if (provider != null)
            (result = init(result)).add(provider.provide());

    if (!FALSE.equals(ctx.settings().isExecuteLogging())) {
        if (LOGGER_LISTENER_LOGGER.isDebugEnabled())
            (result = init(result)).add(new LoggerListener());
    }

    return result == null ? null : result.toArray(EMPTY_EXECUTE_LISTENER);
}

We’re no longer allocating the ArrayList (that might be premature, the JIT might have rewritten this allocation to not happen, but OK), and we’re only adding the LoggerListener if it DEBUG or TRACE logging is enabled for it, i.e. if it would do any work at all.

That’s just a couple of CPU cycles we can save on every execution. Again, I don’t have the profiling measurements anymore, but trust me. It helped.

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6747

8. Eager allocation where lazy allocation works

Sometimes, we need two different representations of the same information. The “raw” representation, and a more useful, pre-processed representation for some purposes. This was done, for instance, in QualifiedField:

private final Name          name;
private final Table<Record> table;

QualifiedField(Name name, DataType<T> type) {
    super(name, type);

    this.name = name;
    this.table = name.qualified()
        ? DSL.table(name.qualifier())
        : null;
}

@Override
public final void accept(Context<?> ctx) {
    ctx.visit(name);
}

@Override
public final Table<Record> getTable() {
    return table;
}

As can be seen, the name is really the beef of this class. It’s a qualified name that generates itself on the SQL string. The Table representation is useful when navigating the meta model, but this is hardly ever done by jOOQ’s internals and/or user facing code.

However, this eager initialisation it is costly:

Quite a few UnqualifiedName[] arrays are allocated by the call to Name.qualifier(). We can easily make that table reference non-final and calculate it lazily:

private final Name              name;
private Table<Record>           table;

QualifiedField(Name name, DataType<T> type) {
    super(name, type);

    this.name = name;
}

@Override
public final Table<Record> getTable() {
    if (table == null)
        table = name.qualified() ? DSL.table(name.qualifier()) : null;

    return table;
}

Because name is final, we could call table “effectively final” (in a different meaning than the Java language’s) – we won’t have any thread safety issues because these particular types are immutable inside of jOOQ.

This is done several times per “E-loop” iteration

One more issue down: https://github.com/jOOQ/jOOQ/issues/6755

Results

Now, thus far, we’ve “improved” many low hanging fruit based on a profiler session (that was run, akhem, from outside of Eclipse on a rather busy machine). This wasn’t very scientific. Just tracking down “bottlenecks” which triggered my interest by having high enough numbers to even notice. This is called “micro optimisation”, and it is only worth the trouble if you’re in a “S-E-R-C loop”, meaning that the code you’re optimising is executed many many times. For me, developing jOOQ, this is almost always the case, because jOOQ is a library used by a lot of people who all profit from these optimisations. In many other cases, this might be called “premature optimisation”

But once we’ve optimised, we shouldn’t stop. I’ve done a couple of individual JMH benchmarks for many of the above problems, to see if they were really an improvement. But sometimes, in a JMH benchmark, something that doesn’t look like an improvement will still be an improvement in the bigger picture. The JVM doesn’t inline all methods 100 levels deep. If your algorithm is complex, perhaps a micro optimisation will still have an effect that would not have any effect on a JMH benchmark.

Unfortunately this isn’t very exact science, but with enough intuition, you’ll find the right spots to optimise.

In my case, I verified progress over two patch releases: 3.10.0 -> 3.10.1 -> 3.10.2 (not yet released) by running a JMH benchmark over the entire query execution (including H2’s part). The results of applying roughly 15 of the above and similar optimisations (~2 days’ worth of effort) is:

JDK 9 (9+181)

jOOQ 3.10.0 Open Source Edition

Benchmark                          Mode   Cnt       Score      Error  Units
ExecutionBenchmark.testExecution   thrpt   21  101891.108 ± 7283.832  ops/s

jOOQ 3.10.2 Open Source Edition

Benchmark                          Mode   Cnt       Score      Error  Units
ExecutionBenchmark.testExecution   thrpt   21  110982.940 ± 2374.504  ops/s

JDK 8 (1.8.0_145)

jOOQ 3.10.0 Open Source Edition

Benchmark                          Mode   Cnt       Score      Error  Units
ExecutionBenchmark.testExecution   thrpt   21  110178.873 ± 2134.894  ops/s

jOOQ 3.10.2 Open Source Edition

Benchmark                          Mode   Cnt       Score      Error  Units
ExecutionBenchmark.testExecution   thrpt   21  118795.922 ± 2661.653  ops/s

As can be seen, in both JDK versions, we’ve gotten roughly a 10% speed increase. What’s interesting is also that JDK 8 seemed to have been also 10% faster than JDK 9 in this benchmark, although this can be due to a variety of things that I haven’t considered yet, and which are out of scope for this discussion.

Conclusion

This iterative approach to tackling performance is definitely worth it for library authors:

  • run a representative benchmark (repeat a task millions of times)
  • profile it
  • track down “bottlenecks”
  • if they’re easy to fix without regression risk, do it
  • repeat
  • after a while, verify with JMH

Individual improvements are quite hard to measure, or measure correctly. But when you do 10-15 of them, they start adding up and become significant. 10% can make a difference.

Looking forward to your comments, alternative techniques, alternative tools, etc.!

If you liked this article, you will also like Top 10 Easy Performance Optimisations in Java