Top 5 Hidden jOOQ Features

jOOQ’s main value proposition is obvious: Type safe embedded SQL in Java.

People who actively look for such a SQL builder will inevitably stumble upon jOOQ and love it, of course. But a lot of people don’t really need a SQL builder – yet, jOOQ can still be immensely helpful in other situations, through its lesser known features.

Here’s a list of top 5 “hidden” jOOQ features.

1. Working with JDBC ResultSet

Even if you’re otherwise not using jOOQ but JDBC (or Spring JdbcTemplate, etc.) directly, one of the things that’s most annoying is working with ResultSet. A JDBC ResultSet models a database cursor, which is essentially a pointer to a collection on the server, which can be positioned anywhere, e.g. to the 50th record via ResultSet.absolute(50) (remember to start counting at 1).

The JDBC ResultSet is optimised for lazy data processing. This means that we don’t have to materialise the entire data set produced by the server in the client. This is a great feature for large (and even large-ish) data sets, but in many cases, it’s a pain. When we know we’re fetching only 10 rows and we know that we’re going to need them in memory anyway, a List<Record> type would be much more convenient.

jOOQ’s org.jooq.Result is such a List, and fortunately, you can easily import any JDBC ResultSet easily as follows by using DSLContext.fetch(ResultSet):

try (ResultSet rs = stmt.executeQuery()) {
    Result<Record> result = DSL.using(connection).fetch(rs);
    System.out.println(result);
}

With that in mind, you can now access all the nice jOOQ utilities, such as formatting a result, e.g. as TEXT (see The second feature for more details):

+---+---------+-----------+
| ID|AUTHOR_ID|TITLE      |
+---+---------+-----------+
|  1|        1|1984       |
|  2|        1|Animal Farm|
+---+---------+-----------+

Of course, the inverse is always possible as well. Need a JDBC ResultSet from a jOOQ Result? Call Result.intoResultSet() and you can inject dummy results to any application that operates on JDBC ResultSet:

DSLContext ctx = DSL.using(connection);

// Get ready for Java 10 with var!
var result = ctx.newResult(FIRST_NAME, LAST_NAME);
result.add(ctx.newRecord(FIRST_NAME, LAST_NAME)
              .values("John", "Doe"));

// Pretend this is a real ResultSet
try (ResultSet rs = result.intoResultSet()) {
  while (rs.next())
    System.out.println(rs.getString(1) + " " + rs.getString(2));
}

2. Exporting a Result as XML, CSV, JSON, HTML, TEXT, ASCII Chart

As we’ve seen in the previous section, jOOQ Result types have nice formatting features. Instead of just text, you can also format as XML, CSV, JSON, HTML, and again TEXT

The format can usually be adapted to your needs.

For instance, this text format is possible as well:

ID AUTHOR_ID TITLE      
------------------------
 1         1 1984       
 2         1 Animal Farm

When formatting as CSV, you’ll get:

ID,AUTHOR_ID,TITLE
1,1,1984
2,1,Animal Farm

When formatting as JSON, you might get:

[{"ID":1,"AUTHOR_ID":1,"TITLE":"1984"},
 {"ID":2,"AUTHOR_ID":1,"TITLE":"Animal Farm"}]

Or, depending on your specified formatting options, perhaps you’ll prefer the more compact array of array style?

[[1,1,"1984"],[2,1,"Animal Farm"]]

Or XML, again with various common formatting styles, among which:

<result>
  <record>
    <ID>1</ID>
    <AUTHOR_ID>1</AUTHOR_ID>
    <TITLE>1984</TITLE>
  </record>
  <record>
    <ID>2</ID>
    <AUTHOR_ID>1</AUTHOR_ID>
    <TITLE>Animal Farm</TITLE>
  </record>
</result>

HTML seems kind of obvious. You’ll get:

ID AUTHOR_ID TITLE
1 1 1984
2 1 Animal Farm

Or, in code:

<table>
<tr><th>ID</th><th>AUTHOR_ID</th><th>TITLE</th></tr>
<tr><td>1</td><td>1</td><td>1984</td></tr>
<tr><td>2</td><td>1</td><td>Animal Farm</td></tr>
</table>

As a bonus, you could even export the Result as an ASCII chart:

These features are obvious additions to ordinary jOOQ queries, but as I’ve shown in Section 1, you can get free exports from JDBC results as well!

3. Importing these text formats again

After the previous section’s export capabilities, it’s natural to think about how to import such data again back into a more usable format. For instance, when you write integration tests, you might expect a database query to return a result like this:

ID AUTHOR_ID TITLE      
-- --------- -----------
 1         1 1984       
 2         1 Animal Farm

Simply import the above textual representation of your result set into an actual jOOQ Result using Result.fetchFromTXT(String) and you can continue operating on a jOOQ Result (or as illustrated in Section 1, with a JDBC ResultSet!).

Most of the other export formats (except charts, of course) can be imported as well.

Now, don’t you wish for a second that Java has multi-line strings (in case of which this would be very nice looking):

Result<?> result = ctx.fetchFromTXT(
    "ID AUTHOR_ID TITLE      \n" +
    "-- --------- -----------\n" +
    " 1         1 1984       \n" +
    " 2         1 Animal Farm\n"
);
ResultSet rs = result.intoResultSet();

These types can now be injected anywhere where a service or DAO produces a jOOQ Result or a JDBC ResultSet. The most obvious application for this is mocking. The second most obvious application is testing. You can easily test that a service produces an expected result of the above form.

Let’s talk about mocking:

4. Mocking JDBC

Sometimes, mocking is cool. With the above tools, it’s only natural for jOOQ to provide a full-fledged, JDBC-based mocking SPI. I’ve written about this feature before and again here.

Essentially, you can implement a single FunctionalInterface called MockDataProvider. The simplest way to create one is by using the Mock.of() methods, e.g.:

MockDataProvider provider = Mock.of(ctx.fetchFromTXT(
    "ID AUTHOR_ID TITLE      \n" +
    "-- --------- -----------\n" +
    " 1         1 1984       \n" +
    " 2         1 Animal Farm\n"
));

What this provider does is it simply ignores all the input (queries, bind variables, etc.) and always returns the same simple result set. You can now plug this provider into a MockConnection and use it like any ordinary JDBC connection:

try (Connection c = new MockConnection(provider);
     PreparedStatement s = c.prepareStatement("SELECT foo");
     ResultSet rs = s.executeQuery()) {

    while (rs.next()) {
        System.out.println("ID        : " + rs.getInt(1));
        System.out.println("First name: " + rs.getString(2));
        System.out.println("Last name : " + rs.getString(3));
    }
}

The output being (completely ignoring the SELECT foo statement):

ID        : 1
First name: 1
Last name : 1984
ID        : 2
First name: 1
Last name : Animal Farm

This client code doesn’t even use jOOQ (although it could)! Meaning, you can use jOOQ as a JDBC mocking framework on any JDBC-based application, including a Hibernate based one.

Of course, you don’t always want to return the exact same result. This is why a MockDataProvider offers you an argument with all the query information in it:

try (Connection c = new MockConnection(ctx -> {
    if (ctx.sql().toLowerCase().startsWith("select")) {
        // ...
    }
})) {
    // Do stuff with this connection
}

You can almost implement an entire JDBC driver with a single lambda expression. Read more here. Cool, eh?

Side note: Don’t get me wrong: I don’t think you should mock your entire database layer just because you can. My thoughts are available in this tweet storm:

Speaking of “synthetic JDBC connections”

5. Parsing Connections

jOOQ 3.9 introduced a SQL parser, whose main use case so far is to parse and reverse engineer DDL scripts for the code generator.

Another feature that has not been talked about often yet (because still a bit experimental) is the parsing connection, available through DSLContext.parsingConnection(). Again, this is a JDBC Connection implementation that wraps a physical JDBC connection but runs all SQL queries through the jOOQ parser before generating them again.

What’s the point?

Let’s assume for a moment that we’re using SQL Server, which supports the following SQL standard syntax:

SELECT * FROM (VALUES (1), (2), (3)) t(a)

The result is:

 a
---
 1
 2
 3

Now, let’s assume we are planning to migrate our application to Oracle and we have the following JDBC code that doesn’t work on Oracle, because Oracle doesn’t support the above syntax:

try (Connection c = DriverManager.getConnection("...");
     Statement s = c.createStatement();
     ResultSet rs = s.executeQuery(
         "SELECT * FROM (VALUES (1), (2), (3)) t(a)")) {

    while (rs.next())
        System.out.println(rs.getInt(1));
}

Now, we have three options (hint #1 sucks, #2 and #3 are cool):

  1. Tediously migrate all such manually written JDBC based SQL to Oracle syntax and hope we don’t have to migrate back again
  2. Upgrade our JDBC based application to use jOOQ instead (that’s the best option, of course, but it also takes some time)
  3. Simply use the jOOQ parsing connection as shown below, and a lot of code will work right out of the box! (and then, of course, gradually migrate to jOOQ, see option #2)
try (DSLContext ctx = DSL.using("...");
     Connection c = ctx.parsingConnection(); // Magic here
     Statement s = c.createStatement();
     ResultSet rs = s.executeQuery(
         "SELECT * FROM (VALUES (1), (2), (3)) t(a)")) {

    while (rs.next())
        System.out.println(rs.getInt(1));
}

We haven’t touched any of our JDBC based client logic. We’ve only introduced a proxy JDBC connection that runs every statement through the jOOQ parser prior to re-generating the statement on the wrapped, physical JDBC connection.

What’s really executed on Oracle is this emulation here:

select t.a from (
  (select null a from dual where 1 = 0) union all 
  (select * from (
    (select 1 from dual) union all 
    (select 2 from dual) union all 
    (select 3 from dual)
  ) t)
) t

Looks funky, eh? The rationale for this emulation is described here.

Every SQL feature that jOOQ can represent with its API and that it can emulate between databases will be supported! This includes far more trivial things, like parsing this query:

SELECT substring('abcdefg', 2, 4)

… and running this one on Oracle instead:

select substr('abcdefg', 2, 4) from dual

You’re all thinking

Want to learn more about jOOQ?

There are many more such nice little things in the jOOQ API, which help make you super productive. Some examples include:

Do not GRANT ALL PRIVILEGES to your Production Users

Thanks to the generous contributions of Timur Shaidullin, jOOQ 3.11 will now support GRANT and REVOKE statements through #6812. While implementing integration tests for these new features, I had researched the different ways how these statements work on a variety of databases, and the good news is, they’re all mostly quite standardised (in fact, they’re even part of the SQL standard).

The less good news is that a lot of people do not seem to care about security – at all!

Granted (great pun!) MySQL seems to be lacking a feature here. When I create a new user:

-- Obviously, you will choose a better password
CREATE USER 'NO_RIGHTS'@'%' IDENTIFIED BY 'NO_RIGHTS';

… then this user can connect to the server, but not to any databases yet. From JDBC, we most often use the connection string:

jdbc:mysql://host/database

After all, we don’t just want to connect to a server, but also to a database. This is not allowed, and that’s a reasonable default, of course:

Caused by: java.sql.SQLSyntaxErrorException: Access denied for user 'NO_RIGHTS'@'%' to database 'test'
	at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:112)
	at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:89)
	at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:116)
	at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:853)
	at com.mysql.cj.jdbc.ConnectionImpl.(ConnectionImpl.java:440)
	at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:241)
	at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:221)
	at org.jooq.test.jOOQAbstractTest.getConnection1(jOOQAbstractTest.java:1132)
	at org.jooq.test.jOOQAbstractTest.getConnection0(jOOQAbstractTest.java:1064)
	... 31 more

No doubt about that. But how can I grant the right to connect to this database? There is no such grant in the documentation:
https://dev.mysql.com/doc/refman/8.0/en/grant.html

That’s unfortunate, because in order to start working with the database, the first thing I’d like to do is something like the hypothetical:

GRANT CONNECT ON test TO 'NO_RIGHTS'@'%';

From then on, I could start working and add more and more grants, step by step, depending on what I would like to allow the user NO_RIGHTS to do on the database. I have created a feature request for this: https://bugs.mysql.com/bug.php?id=89030

A workaround

In fact, any grant to any object inside of the database implicitly grants the “CONNECT” privilege to this user. I could, for instance, do this:

GRANT SELECT ON test.bank_account_transactions TO 'NO_RIGHTS'@'%';

But I don’t want to do that! That’s already much too big of a GRANT for the fact that I don’t actually want this user to be able to do anything at this point.

Here’s a more viable (but ugly) workaround:

-- Create a dummy view that is never going to be used:
CREATE VIEW v_unused AS SELECT 1;

-- Now grant showing (not selecting) this view to the user:
GRANT SHOW VIEW ON test.v_unused TO 'NO_RIGHTS'@'%';

Among all the possible grants, that’s about the most harmless I could find that will now allow this user to connect to the test database (and to show this view):

SHOW TABLES;

Yielding

Tables_in_test
--------------
v_unused

Note, with my security background and being the pessimist I am, I don’t even grant the SELECT privilege on this view, but just the SHOW VIEW privilege.

I could possibly live with that. Or if I cannot create any view myself, perhaps I could grant “SHOW VIEW” of all views. I don’t like that thought, but it seems to be about the least intrusive privilege to get that implied “CONNECT” privilege.

What does the Internet recommend?

I was obviously googling for this topic. The best way to google for this is by googling the JDBC error message:

Access denied for user ‘NO_RIGHTS’@’%’ to database ‘test’

Because that’s what people do, right? Google error messages. I would have expected tons of advice how to solve that particular problem. The problem of getting the “CONNECT” privilege, and the “CONNECT” privilege only

Here are some of the first results, which all shocked me completely. In order to illustrate and emphasise that shock, I will use emojis:

1. The MySQL manual

At first, it recommends this:

mysql> CREATE USER 'finley'@'localhost' IDENTIFIED BY 'password';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'finley'@'localhost'
    ->     WITH GRANT OPTION;

“GRANT ALL” 😕

OK, perhaps not the best first example, I mean I really don’t trust this guy finley. But OK, it’s the manual and it later proceeds to showing more restrictive GRANT options.

2. Some random forum

Plenty of talk about:

CREATE DATABASE `zabbix_db`;
GRANT ALL PRIVILEGES ON `zabbix_db`.* TO `zabbix_user`@'localhost' IDENTIFIED BY 'XXXXXXXXX';
FLUSH PRIVILEGES;

“GRANT ALL” 😯

Great, so now we know that this particular user can do everything on this forum. Excellent. Let’s find a SQLi vulnerability somewhere.

3. A random ServerFault question

Two answers suggesting:

CREATE USER 'username'@'192.168.22.2' IDENTIFIED BY 'password';
GRANT ALL PRIVILEGES ON databasename.* TO 'username'@'192.168.22.2';

And also

GRANT ALL on database.* to 'user'@'localhost' identified by 'paswword';

“GRANT ALL” 😧

Great advice, folks!

4. Another random forum

… where the user asking the question had already followed one of the previous forums’ advice:

GRANT USAGE ON *.* TO 'someusr'@'localhost'
GRANT ALL PRIVILEGES ON `qorbit_store`.* TO 'someusr'@'localhost'

Great, so qorbit_store also has a user with total privileges.

“GRANT ALL” 😨

5. A random GitHub issue

So… this is the travis CI server itself, isn’t it?

mysql -u root -e "CREATE DATABASE mydb;"
mysql -u root -e "GRANT ALL PRIVILEGES ON mydb.* TO 'travis'@'%';";

What are you folks thinking??

“GRANT ALL” 😬

6. A random MariaDB forum

I am trying to get some software installed but am running into problems. […] I used the command “grant all privileges on newdb.* to sam@localhost;” […]

True to the idea, let’s just hammer out commands until this dang thing works. The unix equivalent would be:

chmod -R 777 *

Solves all problems right?

“GRANT ALL” 😡

7. Another random forum

[…] you may need to give your MySQL user proper permissions. Log in to MySQL as the MySQL root user and issue these commands:

GRANT ALL PRIVILEGES ON database_name TO user@host IDENTIFIED BY ‘password’;
FLUSH PRIVILEGES;

“Proper permissions” – GRANT ALL is never proper!

“GRANT ALL” 🤬

To be fair, this particular forum then proceeds advising

If you want (or need) to be more restrictive with database permissions: You will need to at least grant the ALTER, CREATE, DELETE, INSERT, SELECT, and UPDATE permissions to your database user. ALTER and CREATE permissions are only needed for installing and upgrading Geeklog, as well as for installing plugins and other add-ons.

But as we’ve seen before, people don’t read all that advice, they just use the first thing that works.

8. Another random forum, on drupal

This is a true gem. Not only GRANT ALL, but also an unredacted password. I redacted it, but you can easily look this up if you want some data exchange business with swisha swisha:

2. define('MYSQL_HOST','localhost');
3. define('MYSQL_USER','swhisa_swhisa');
4. define('MYSQL_PASS','12345678'); // Redaction mine
5. define('MYSQL_DB','swhisa_swhisadb');
6. if(!mysql_connect (MYSQL_HOST, MYSQL_USER, MYSQL_PASS)){die (mysql_error()); } mysql_select_db(MYSQL_DB);
7.
8. $grt = "GRANT ALL ON *.* TO 'swhisa_swhisa'@'%'";
9. mysql_query($grt) or die(mysql_error());

“GRANT ALL” and public password 😭

In case you wondered, this emoji is me losing faith in humanity. Not a single post recommended first to grant the minimal rights necessary for the user. All of them (including the official documentation) just defaulted to granting everything.

Wrap up

For crying out loud. I googled for an error message about being unable to connect, and almost every single answer Google gave me was someone recommending to just GRANT ALL. That’s like googling for an alternative to tacos and being recommended to join the NASA / SpaceX missions to Mars because of the excellent astronaut food.

Of course, as a developer we like to GRANT ALL. That’s convenient. If we don’t ship to production but just want to play around with some features, we might not need security.

But the fact is that if the readers of such posts are never exposed to even the mere idea of thinking about a security model and user design where not every user can access (let alone modify, drop, shutdown) every resource, then these readers are simply going to ship GRANT ALL to production.

SQL injection is a serious threat on its own, but if the database user that exposes the vulnerability has complete privileges on your database, then you’re doubly doomed.

Always start with users that have no rights – then add rights as you need them, and add them sparingly!

Now go out there and start REVOKING (Perhaps wait until after everyone leaves for their Christmas Holiday, though 😉)

Creating a Microsoft Excel Style Pivot Table With Grand Totals in SQL

This answer to a beautiful Stack Overflow question I’ve given recently needs further explanation in a blog post. When working with Microsoft Excel, we can create beautiful and also very insightful Pivot Tables with grand totals. What are they? This is best explained visually.

Assuming you have this normalised form for your raw data. As in the question, it’s an inventory table in a bike shop:

Now, in order to analyse our inventory, we’d love to pivot the above normalised representation to the following non-normalised representation, and we’d also like to display the grand totals to learn how many bikes of each type we have, and how many bikes of each colour, and how many bikes in total:

There are tons of great tutorials out there explaining how to do this with Microsoft Excel. What we care about is:

How to do this with SQL

We’re using two SQL features for this:

Let’s create some data first. I’m going to use SQL Server syntax for most of this blog post. At the end, there will be a full solution for SQL Server, Oracle, and PostgreSQL:

WITH Bikes AS (
  SELECT * FROM (
    VALUES ('Mountain Bikes', 'Black'), 
           ('Mountain Bikes', 'Black'),
           ('Mountain Bikes', 'Silver'),
           ('Road Bikes', 'Red'),
           ('Road Bikes', 'Red'),
           ('Road Bikes', 'Red'),
           ('Road Bikes', 'Black'),
           ('Road Bikes', 'Yellow'),
           ('Touring Bikes', 'Blue'),
           ('Touring Bikes', 'Blue'),
           ('Touring Bikes', 'Yellow')
  ) AS Bikes (Name, Colour)
)
SELECT * FROM Bikes

This simply produces an in-memory table representation of our original, normalised data set.

Now, the first step is to create the following totals and grand totals:

  • Total bikes per name and colour
  • (Grand) total bikes per name
  • (Grand) total bikes per colour
  • (Grand) total bikes

In this particular case, we can use CUBE(), which forms all the possible GROUPING SETS combinations:

SELECT Name, Colour, COUNT(*) AS Total
FROM Bikes
GROUP BY CUBE (Name, Colour)
ORDER BY Name, Colour

The result looks like this:

Name            Colour  Total
-----------------------------
NULL            NULL    11
NULL            Black   3
NULL            Blue    2
NULL            Red     3
NULL            Silver  1
NULL            Yellow  2
Mountain Bikes  NULL    3
Mountain Bikes  Black   2
Mountain Bikes  Silver  1
Road Bikes      NULL    5
Road Bikes      Black   1
Road Bikes      Red     3
Road Bikes      Yellow  1
Touring Bikes   NULL    3
Touring Bikes   Blue    2
Touring Bikes   Yellow  1

Excellent! All the (grand) totals are now in the result set. Notice that we could have manually written this using the following, much more tedious syntax:

SELECT Name, Colour, COUNT(*) AS Total
FROM Bikes
GROUP BY Name, Colour
UNION ALL
SELECT Name, NULL, COUNT(*) AS Total
FROM Bikes
GROUP BY Name
UNION ALL
SELECT NULL, Colour, COUNT(*) AS Total
FROM Bikes
GROUP BY Colour
UNION ALL
SELECT NULL, NULL, COUNT(*) AS Total
FROM Bikes
ORDER BY Name, Colour

So, CUBE() (and ROLLUP() and GROUPING SETS()) is just syntax sugar for the above more verbose UNION ALL representation, with the additional important difference that very likely you’re going to get a much more optimal execution plan using CUBE():

Than using manual UNION ALL:

The result would be similar in Oracle and other databases.

This isn’t surprising. We can aggregate all the grand totals in one go with CUBE() (in fact, the “grand grand total” is calculated separately in this case), whereas it’s hard for the optimiser to prove that the UNION ALL version is really the same thing and the individual subqueries can be factored out.

Before we move on, just a slight improvement, let’s rename the grand totals from NULL to Total and wrap the thing in a derived table T:

SELECT *
FROM (
  SELECT 
    COALESCE(Name, 'Total') AS Name, 
    COALESCE(Colour, 'Total') AS Colour, 
    COUNT(*) AS Count
  FROM Bikes
  GROUP BY CUBE (Name, Colour)
) AS t

Now, pivot this representation into a more readable one

The data still looks normalised with repeating names and colours in the result tables. Let’s pivot it using … wait for it … the PIVOT clause (available in Oracle and SQL Server).

The PIVOT clause is a bit funky. It can be appended to any table expression (including derived tables) to pivot it. It will apply an implicit GROUP BY operation and generate a set of aggregated SELECT columns. When we pivot our previous derived table T:

SELECT *
FROM t
PIVOT (
  SUM(Count) FOR Colour IN (
    Red, Blue, Black, Silver, Yellow, 
    Grey, Multi, Uncoloured, Total
  )
) AS p

Then we’re getting the following, nice-looking result:

Name            Red     Blue    Black   Silver  Yellow  Grey    Multi   Uncoloured  Total
-----------------------------------------------------------------------------------------
Mountain Bikes  NULL    NULL    2       1       NULL    NULL    NULL    NULL        3
Road Bikes      3       NULL    1       NULL    1       NULL    NULL    NULL        5
Touring Bikes   NULL    2       NULL    NULL    1       NULL    NULL    NULL        3
Total           3       2       3       1       2       NULL    NULL    NULL        11

That’s almost the desired result – all that’s missing is some null handling. How does it work? We have the following syntax:

[ table ] PIVOT (
  [ aggregate function(s) ] FOR [ column(s) ] IN ( [ values ] )
)

Where

  • The [ table ] is the table being pivoted
  • The [ column(s) ] are the columns from the [ table ] being grouped, as in any ordinary GROUP BY clause
  • The [ values ] are the values of the [ column(s) ], for which filtered aggregations are made
  • The [ aggregate function(s) ] are the aggregations that are made per [ column(s) ] (group) and per [ value ] (filter)

This syntax is Oracle and SQL Server specific. Oracle can do a bit more than SQL Server. If this syntax is not available in your database, you can write it out manually again (just like the above CUBE() to get this (select with your mouse to remove colouring):

SELECT
  Name,
  SUM(CASE WHEN Colour = 'Red'        THEN Count END) AS Red,
  SUM(CASE WHEN Colour = 'Blue'       THEN Count END) AS Blue,
  SUM(CASE WHEN Colour = 'Black'      THEN Count END) AS Black,
  SUM(CASE WHEN Colour = 'Silver'     THEN Count END) AS Silver,
  SUM(CASE WHEN Colour = 'Yellow'     THEN Count END) AS Yellow,
  SUM(CASE WHEN Colour = 'Grey'       THEN Count END) AS Grey,
  SUM(CASE WHEN Colour = 'Multi'      THEN Count END) AS Multi,
  SUM(CASE WHEN Colour = 'Uncoloured' THEN Count END) AS Uncoloured,
  SUM(CASE WHEN Colour = 'Total'      THEN Count END) AS Total
FROM t
GROUP BY Name

There should be no performance penalty in the manually written version (although, as always, do check). More details about this in a previous article.

Putting it all together

Here’s the complete query in SQL Server:

WITH Bikes(Name, Colour) AS (
  SELECT * FROM (
    VALUES ('Mountain Bikes', 'Black'), 
           ('Mountain Bikes', 'Black'),
           ('Mountain Bikes', 'Silver'),
           ('Road Bikes', 'Red'),
           ('Road Bikes', 'Red'),
           ('Road Bikes', 'Red'),
           ('Road Bikes', 'Black'),
           ('Road Bikes', 'Yellow'),
           ('Touring Bikes', 'Blue'),
           ('Touring Bikes', 'Blue'),
           ('Touring Bikes', 'Yellow')
  ) AS Bikes(Name, Colour)
)
SELECT
  Name, 
  COALESCE(Red, 0) AS Red, 
  COALESCE(Blue, 0) AS Blue, 
  COALESCE(Black, 0) AS Black, 
  COALESCE(Silver, 0) AS Silver, 
  COALESCE(Yellow, 0) AS Yellow, 
  COALESCE(Grey, 0) AS Grey, 
  COALESCE(Multi, 0) AS Multi, 
  COALESCE(Uncoloured, 0) AS Uncoloured, 
  Total
FROM (
  SELECT 
    COALESCE(Name, 'Total') AS Name, 
    COALESCE(Colour, 'Total') AS Colour, 
    COUNT(*) AS Count
  FROM Bikes
  GROUP BY CUBE (Name, Colour)
) AS t
PIVOT (
  SUM(Count) FOR Colour IN (
    Red, Blue, Black, Silver, Yellow, 
    Grey, Multi, Uncoloured, Total
  )
) AS p
ORDER BY CASE Name WHEN 'Total' THEN 1 ELSE 0 END, Name

Or Oracle:

WITH Bikes(Name, Colour) AS (
  SELECT 'Mountain Bikes', 'Black'  FROM dual UNION ALL
  SELECT 'Mountain Bikes', 'Black'  FROM dual UNION ALL
  SELECT 'Mountain Bikes', 'Silver' FROM dual UNION ALL
  SELECT 'Road Bikes',     'Red'    FROM dual UNION ALL
  SELECT 'Road Bikes',     'Red'    FROM dual UNION ALL
  SELECT 'Road Bikes',     'Red'    FROM dual UNION ALL
  SELECT 'Road Bikes',     'Black'  FROM dual UNION ALL
  SELECT 'Road Bikes',     'Yellow' FROM dual UNION ALL
  SELECT 'Touring Bikes',  'Blue'   FROM dual UNION ALL
  SELECT 'Touring Bikes',  'Blue'   FROM dual UNION ALL
  SELECT 'Touring Bikes',  'Yellow' FROM dual
)
SELECT
  Name, 
  COALESCE(Red, 0) AS Red, 
  COALESCE(Blue, 0) AS Blue, 
  COALESCE(Black, 0) AS Black, 
  COALESCE(Silver, 0) AS Silver, 
  COALESCE(Yellow, 0) AS Yellow, 
  COALESCE(Grey, 0) AS Grey, 
  COALESCE(Multi, 0) AS Multi, 
  COALESCE(Uncoloured, 0) AS Uncoloured, 
  Total
FROM (
  SELECT 
    COALESCE(Name, 'Total') AS Name, 
    COALESCE(Colour, 'Total') AS Colour, 
    COUNT(*) AS Count
  FROM Bikes
  GROUP BY CUBE (Name, Colour)
) t
PIVOT (
  SUM(Count) FOR Colour IN (
    'Red' AS Red, 
    'Blue' AS Blue, 
    'Black' AS Black, 
    'Silver' AS Silver, 
    'Yellow' AS Yellow, 
    'Grey' AS Grey, 
    'Multi' AS Multi, 
    'Uncoloured' AS Uncoloured, 
    'Total' AS Total
  )
) p
ORDER BY CASE Name WHEN 'Total' THEN 1 ELSE 0 END, Name

Or PostgreSQL:

WITH Bikes(Name, Colour) AS (
  SELECT * FROM (
    VALUES ('Mountain Bikes', 'Black'), 
           ('Mountain Bikes', 'Black'),
           ('Mountain Bikes', 'Silver'),
           ('Road Bikes', 'Red'),
           ('Road Bikes', 'Red'),
           ('Road Bikes', 'Red'),
           ('Road Bikes', 'Black'),
           ('Road Bikes', 'Yellow'),
           ('Touring Bikes', 'Blue'),
           ('Touring Bikes', 'Blue'),
           ('Touring Bikes', 'Yellow')
  ) AS Bikes(Name, Colour)
)
SELECT
  Name,
  COALESCE(SUM(Count) FILTER (WHERE Colour = 'Red'       ), 0) AS Red,
  COALESCE(SUM(Count) FILTER (WHERE Colour = 'Blue'      ), 0) AS Blue,
  COALESCE(SUM(Count) FILTER (WHERE Colour = 'Black'     ), 0) AS Black,
  COALESCE(SUM(Count) FILTER (WHERE Colour = 'Silver'    ), 0) AS Silver,
  COALESCE(SUM(Count) FILTER (WHERE Colour = 'Yellow'    ), 0) AS Yellow,
  COALESCE(SUM(Count) FILTER (WHERE Colour = 'Grey'      ), 0) AS Grey,
  COALESCE(SUM(Count) FILTER (WHERE Colour = 'Multi'     ), 0) AS Multi,
  COALESCE(SUM(Count) FILTER (WHERE Colour = 'Uncoloured'), 0) AS Uncoloured,
  COALESCE(SUM(Count) FILTER (WHERE Colour = 'Total'     ), 0) AS Total
FROM (
  SELECT 
    COALESCE(Name, 'Total') AS Name, 
    COALESCE(Colour, 'Total') AS Colour, 
    COUNT(*) AS Count
  FROM Bikes
  GROUP BY CUBE (Name, Colour)
) AS t
GROUP BY Name
ORDER BY CASE Name WHEN 'Total' THEN 1 ELSE 0 END, Name

Or MySQL (which doesn’t support CUBE, only ROLLUP, thus slightly tweaked PostgreSQL variant):

WITH Bikes(Name, Colour) AS (
  SELECT 'Mountain Bikes', 'Black'  UNION ALL
  SELECT 'Mountain Bikes', 'Black'  UNION ALL
  SELECT 'Mountain Bikes', 'Silver' UNION ALL
  SELECT 'Road Bikes',     'Red'    UNION ALL
  SELECT 'Road Bikes',     'Red'    UNION ALL
  SELECT 'Road Bikes',     'Red'    UNION ALL
  SELECT 'Road Bikes',     'Black'  UNION ALL
  SELECT 'Road Bikes',     'Yellow' UNION ALL
  SELECT 'Touring Bikes',  'Blue'   UNION ALL
  SELECT 'Touring Bikes',  'Blue'   UNION ALL
  SELECT 'Touring Bikes',  'Yellow'
)
SELECT
  Name,
  COALESCE(SUM(CASE WHEN Colour = 'Red'        THEN Count END), 0) AS Red,
  COALESCE(SUM(CASE WHEN Colour = 'Blue'       THEN Count END), 0) AS Blue,
  COALESCE(SUM(CASE WHEN Colour = 'Black'      THEN Count END), 0) AS Black,
  COALESCE(SUM(CASE WHEN Colour = 'Silver'     THEN Count END), 0) AS Silver,
  COALESCE(SUM(CASE WHEN Colour = 'Yellow'     THEN Count END), 0) AS Yellow,
  COALESCE(SUM(CASE WHEN Colour = 'Grey'       THEN Count END), 0) AS Grey,
  COALESCE(SUM(CASE WHEN Colour = 'Multi'      THEN Count END), 0) AS Multi,
  COALESCE(SUM(CASE WHEN Colour = 'Uncoloured' THEN Count END), 0) AS Uncoloured,
  COALESCE(SUM(CASE WHEN Name != 'Total' OR Colour != 'Total' THEN Count END), 0) AS Total
FROM (
  SELECT 
    COALESCE(Name, 'Total') AS Name, 
    COALESCE(Colour, 'Total') AS Colour, 
    COUNT(*) AS Count
  FROM Bikes
  GROUP BY Colour, Name WITH ROLLUP
) AS t
GROUP BY Name
ORDER BY CASE Name WHEN 'Total' THEN 1 ELSE 0 END, Name

Conclusion

Whenever working with data and SQL, try finding an elegant solution with SQL. There are many tools for a variety of data processing and data presentation use-cases. Here are some other cool, related reads from our blog:

The Cost of JDBC Server Roundtrips

Or: Move That Loop into the Server Already!

This article will illustrate the significance of something that I always thought to be common sense, but I keep seeing people getting this (very) wrong in their productive systems. Chances are, in fact, that most applications out there suffer from this performance problem – and the fix is really easy.

Transferring data in bulk vs. transferring it row by row

This blog post is inspired by a recent Stack Overflow question that was asking about how to call Oracle’s DBMS_OUTPUT.GET_LINES from JDBC. The answer to that specific question can also be seen in a previous blog post here.

This time, I don’t want to discuss that particular Oracle-specific technical problem, but the performance aspect of whether to call:

  • DBMS_OUTPUT.GET_LINES: Which allows for fetching a bulk of server output into an array
  • DBMS_OUTPUT.GET_LINE: Which fetches a single line of server output into a string

While I will continue to discuss this Oracle-specific functionality, this blog post is in no way strictly related to Oracle. It is generally applicable (and again, it should be common sense) for any database, and in fact, any client-server architecture, or even any distributed architecture. The solution to such performance problems will almost always be that transferring a bulk of data is better than transferring it row by row.

Let’s run a benchmark!

The full benchmark logic can be seen in this gist. It includes the boring parts of actually calling the GET_LINE[S] procedures. The beef of the benchmark is this:

int max = 50;
long[] getLines = new long[max];
long[] getLine = new long[max];

try (Connection c = DriverManager.getConnection(url, properties);
    Statement s = c.createStatement()) {

    for (int warmup = 0; warmup < 2; warmup++) {
        for (int i = 0; i < max; i++) {
            s.executeUpdate("begin dbms_output.enable(); end;");
            String sql =
                "begin "
              + "for i in 1 .. 100 loop "
              + "dbms_output.put_line('Message ' || i); "
              + "end loop; "
              + "end;";
            long t1 = System.nanoTime();
            logGetLines(c, 100, () -> s.executeUpdate(sql));
            long t2 = System.nanoTime();
            logGetLine(c, 100, () -> s.executeUpdate(sql));
            long t3 = System.nanoTime();
            s.executeUpdate("begin dbms_output.disable(); end;");

            if (warmup > 0) {
                getLines[i] = t2 - t1;
                getLine[i] = t3 - t2;
            }
        }
    }
}

System.out.println(LongStream.of(getLines).summaryStatistics());
System.out.println(LongStream.of(getLine).summaryStatistics());

What does it do in prose?

  • It contains a warmup loop whose first iteration doesn’t contribute to our measurements (this is always a good idea)
  • It runs the benchmarked logic 50 times
  • It generates 100 DBMS_OUTPUT.PUT_LINE messages for each run in an anonymous PL/SQL loop …
  • … and then fetches those 100 messages immediately with either 1 call to GET_LINES or 100 calls to GET_LINE
  • Finally, all the stored execution times are aggregated and printed out conveniently with Java 8 Stream’s summary statistics feature

So, in both cases, we’re generating and fetching 5000 messages.

Both methods using GET_LINES and GET_LINE respectively are functionally equivalent, i.e. the only difference is performance. Again the full benchmark can be seen here (which also benchmarks the effect of JDBC fetchSize, which is none, in this case).

The results are devastating:

{count=50, sum=  69120455, min= 1067521, average= 1382409.100000, max= 2454614}
{count=50, sum=2088201423, min=33737827, average=41764028.460000, max=64498375}

We’re gaining a factor of 30x in this benchmark run on my machine. The actual results may vary depending on your hardware and software (e.g. I’m running Oracle 12cR2 in Docker), but regardless of the setup, the results are always very significant.

Does this mean that GET_LINE is slow?

When looking at benchmark results, we must always be very very careful not to draw the wrong conclusions. There can be many reasons why benchmarks show differences in performance. One simple (albeit improbable) reason could be that the GET_LINE implementation simply sucks.

So, I’ve tried to re-implement this benchmark in pure PL/SQL. The full benchmark can be seen here. The beef of it is this:

FOR r IN 1..5 LOOP
  v_ts := SYSTIMESTAMP;
      
  FOR i IN 1..v_repeat LOOP
    m();
     
    v_i := v_max;
    dbms_output.get_lines(v_array, v_i);
  END LOOP;
      
  INSERT INTO results VALUES (1, (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    m();
    
    FOR j IN 1 .. v_max LOOP
      dbms_output.get_line(v_string, v_i);
    END LOOP;
  END LOOP;
     
  INSERT INTO results VALUES (2, (SYSTIMESTAMP - v_ts));
END LOOP;

Where m() is:

PROCEDURE m IS BEGIN
  FOR i IN 1 .. v_max LOOP 
    dbms_output.put_line('Message ' || i);
  END LOOP;
END m;

The results are now rather different:

stmt    sum     avg      min     max
1       0.0609  0.01218  0.0073  0.0303
2       0.0333  0.00666  0.0063  0.007

This time, calling GET_LINE individually seems to have been 2x faster than the GET_LINES version. Again, it is important not to draw the wrong conclusions! This could be due to:

  • GET_LINES allocating an additional array copy of the original lines, which resides in the PGA, might be costly
  • GET_LINE might have profited from some additional optimisation because we’re never actually consuming the result in the benchmark

But the one thing we can conclude with certainty is: There’s no problem in GET_LINE, so calling it is not inherently worse than calling GET_LINES.

Which brings us back to the JDBC calls

While guess work and hypotheses are usually dangerous, in this case I’m certain of the reason why the JDBC based approach shows such drastic differences. Just watch this excellent talk by Toon Koppelaars from Oracle “NoPLSql and Thick Database Approaches with Toon Koppelaars”, where he explains this with some impressive flame graphs:

The obvious difference between the JDBC benchmark and the PL/SQL one is the fact that the JDBC call has to traverse a vast amount of logic, APIs, “barriers” between the JVM and the Oracle kernel before it can actually invoke the really interesting part. This includes:

  • JVM overhead
  • JDBC logic
  • Network overhead
  • Various “outer” layers inside the Oracle database
  • Oracle’s API layers to get into the SQL and PL/SQL execution engines
  • The actual code running in the PL/SQL engine

In Toon’s talk (which again, you should definitely watch), the examples are running SQL code, not PL/SQL code, but the results are the same. The actual logic is relatively cheap inside of the database (as we’ve seen in the PL/SQL only benchmark), but the overhead is significant when calling database logic from outside the database.

Thus: It is very important to minimise that overhead

There are two ways to minimise that overhead:

  • The super hard way: Change and / or tweak the API technology, e.g. in Oracle’s case, using the C/OCI bindings can be much faster than JDBC
  • The easy way: Just move some data collection logic into the database and fetch data in bulk

Let me repeat this one more time:

Fetch (or send) data in bulk

… it’s almost always faster than processing things row-by-row (or as the Oracle folks call it: “slow-by-slow”).

And it does not matter at all, if that database logic is written in SQL or in a procedural language. The point is that accessing objects over the network (any network) is expensive, so you should minimise the access calls if ever possible.

As I’ve tweeted recently:

When calling logic over the network (any network), we should move logic to the data, not data to the logic. When working with RDBMS, we’re doing this through SQL (preferrably) or if SQL doesn’t suffice, we resort to using stored procedures.

When working with HTTP, we’re doing this with – well, it doesn’t have a name, but we should prefer making few physical HTTP calls that aggregate several logical API calls in order to transfer a lot of data in bulk.

When working with “map reduce” or “serverless” etc technology, we’re calling this “functions” or “lambdas”, which are just fancy, more modern names for stored procedures.

Conclusion

I’m not an expert in how to design complex distributed systems. This stuff is really hard. But I’ve spent many years working with RDBMS, which are also, in a very simple way, distributed systems (data on one server, client logic on another one).

A very significant amount of performance problems with RDBMS is related to the simple fact of clients making way too many calls to the database for what could be implemented in a single SQL query. Within the database, once your logic has reached the kernel, stuff gets executed really really fast. Adding more logic to a query is going to cause far less trouble than adding more queries.

Does your application take into account these things? Especially, if you’re using an ORM that generates the SQL for you, does it generate the right amount of queries, or are you suffering from “N+1 problems”? The main reason why ORM-based systems tend to be slow is because developers are not aware of the SQL their ORMs generate, e.g. due to excessive lazy loading, when a single SQL (or JPQL) query would have been a much better choice. It’s not the ORM’s fault. Most ORMs can get this right.

It’s the developer’s responsibility to think about where the logic should be executed. The rule of thumb is:

If you’re looping in the client to fetch individual things from the same server, you’re doing it wrong. Move the loop into the server.

And by doing so, you’ve written your first (ghasp) “stored procedure”. You’ll write many more, and you’ll love it, once you realise how much speed you’re gaining.

Or, in other words:

Update: Some criticism from the reddit discussion of this article

/u/cogman10 made good points in his comment warning about batching “too big” workloads, which is perfectly correct when batching write heavy tasks. Large batches may increase the contention inside of your database. If you’re working in an MVCC environment (such as Oracle), having transactions that span millions of updates will put a lot of pressure on the UNDO/REDO log, and all other sessions reading from the same data will feel that pain.

Also, when working with HTTP, beware of the fact that batches are harder to cache than individual requests. This article made the assumption that HTTP requests are:

  • Authorised – i.e. caching doesn’t even make sense as the authorisation might be revoked or the session terminated
  • Operating on similar resources, e.g. fetching a bulk of IDs in one go might be more sensible than fetching each ID individuall

… of course, as always, don’t follow advice you find on the internet blindly 🙂 This article illustrated a common mistake. The fix isn’t always as simple as illustrated here, but often it really is.

How to Fetch Oracle DBMS_OUTPUT from JDBC

When working with Oracle stored procedures, it is not uncommon to have debug log information available from DBMS_OUTPUT commands. For instance, if we have a procedure like this:

CREATE TABLE my_table (i INT);

CREATE OR REPLACE PROCEDURE my_procedure (i1 INT, i2 INT) IS
BEGIN
  INSERT INTO my_table 
  SELECT i1 FROM dual UNION ALL 
  SELECT i2 FROM dual;
  
  dbms_output.put_line(sql%rowcount || ' rows inserted');
END my_procedure;
/

The procedure works just the same, regardless if we’re reading the output from the DBMS_OUTPUT call. It is there purely for logging purposes. Now, if we call the above procedure from a tool like SQL Developer or sqlplus, we could write:

SET SERVEROUTPUT ON
BEGIN
  my_procedure(1, 2);
END;
/

To get a result like this:

PL/SQL-Prozedur erfolgreich abgeschlossen.
2 rows inserted

(pardon my german)

How to get this output from JDBC

By default, we don’t get such output from JDBC as the overhead of transferring all this output is usually not worth the trouble. If we still wanted to call the procedure AND get the server output, we cannot simply write SET SERVEROUTPUT ON, as that is a command specific to sqlplus. We have to wrap our procedure calls in two other calls:

try (Connection c = DriverManager.getConnection(url, properties);
     Statement s = c.createStatement()) {

    try {
        // First, we have to enable the DBMS_OUTPUT. Otherwise,
        // all calls to DBMS_OUTPUT made on our connection won't
        // have any effect.
        s.executeUpdate("begin dbms_output.enable(); end;");

        // Now, this is the actually interesting procedure call
        s.executeUpdate("begin my_procedure(1, 2); end;");

        // After we're done with our call(s), we can proceed to
        // fetch the SERVEROUTPUT explicitly, using
        // DBMS_OUTPUT.GET_LINES
        try (CallableStatement call = c.prepareCall(
            "declare "
          + "  num integer := 1000;"
          + "begin "
          + "  dbms_output.get_lines(?, num);"
          + "end;"
        )) {
            call.registerOutParameter(1, Types.ARRAY,
                "DBMSOUTPUT_LINESARRAY");
            call.execute();

            Array array = null;
            try {
                array = call.getArray(1);
                Stream.of((Object[]) array.getArray())
                      .forEach(System.out::println);
            }
            finally {
                if (array != null)
                    array.free();
            }
        }
    }

    // Don't forget to disable DBMS_OUTPUT for the remaining use
    // of the connection.
    finally {
        s.executeUpdate("begin dbms_output.disable(); end;");
    }
}

As can be seen above, this is rather simple:

  • Initialise a connection with DBMS_OUTPUT.ENABLE
  • Do the actually interesting work
  • Fetch the output and call DBMS_OUTPUT.DISABLE

This could also be refactored into a utility:

// Alternatively, just use https://github.com/jOOQ/jOOL
interface WhyUNoCheckedExceptionRunnable {
    void run() throws Exception;
}

static void logServerOutput(
    Connection connection, 
    WhyUNoCheckedExceptionRunnable runnable
) throws Exception {
    try (Statement s = connection.createStatement()) {
       try {
           s.executeUpdate("begin dbms_output.enable(); end;");
           runnable.run();

           try (CallableStatement call = connection.prepareCall(
               "declare "
             + "  num integer := 1000;"
             + "begin "
             + "  dbms_output.get_lines(?, num);"
             + "end;"
           )) {
               call.registerOutParameter(1, Types.ARRAY,
                   "DBMSOUTPUT_LINESARRAY");
               call.execute();

               Array array = null;
               try {
                   array = call.getArray(1);
                   Stream.of((Object[]) array.getArray())
                         .forEach(System.out::println);
               }
               finally {
                   if (array != null)
                       array.free();
               }
           }
       }
       finally {
           s.executeUpdate("begin dbms_output.disable(); end;");
       }
   }
}

This can now be called conveniently as such:

try (Connection c = DriverManager.getConnection(url, properties);
     Statement s = c.createStatement()) {

    logServerOutput(c, () -> 
        s.executeUpdate("begin my_procedure(1, 2); end;"));
}

How to do the same with jOOQ?

jOOQ 3.11 will have built in support for fetching this server output through its ExecuteListener SPI with https://github.com/jOOQ/jOOQ/issues/6580

We can either use jOOQ’s plain SQL API as such:

try (Connection c = DriverManager.getConnection(url, properties)) {

    // Specify this setting to fetch server output explicitly
    DSLContext ctx = DSL.using(c, 
        new Settings().withFetchServerOutputSize(10));
    ctx.execute("begin my_procedure(1, 2); end;");
}

Or, use the code generator for even more type safe calls to the procedures:

try (Connection c = DriverManager.getConnection(url, properties)) {

    // Specify this setting to fetch server output explicitly
    DSLContext ctx = DSL.using(c, 
        new Settings().withFetchServerOutputSize(10));
    myProcedure(ctx.configuration(), 1, 2);
}

The log output will be:

DEBUG [org.jooq.tools.LoggerListener          ] - Executing query : begin my_procedure(1, 2); end;
DEBUG [org.jooq.impl.FetchServerOutputListener] - 2 rows inserted          

A Common Mistake Developers Make When Caching Nullable Values

Caching is hard in various ways. Whenever you’re caching things, you have to at least think of:

  • Memory consumption
  • Invalidation

In this article, I want to show a flaw that often sneaks into custom cache implementations, making them inefficient for some execution paths. I’ve encountered this flaw in Eclipse, recently.

What did Eclipse do wrong?

I periodically profile Eclipse using Java Mission Control (JMC) when I discover a performance issue in the compiler (and I’ve discovered a few).

Just recently, I’ve found a new regression that must have been introduced with the new Java 9 module support in Eclipse 4.7.1a:

Luckily, the issue has already been fixed for 4.7.2 (https://bugs.eclipse.org/bugs/show_bug.cgi?id=526209). What happened?

In that profiling session, I’ve found an awful lot of accesses to java.util.zip.ZipFile whenever I used the “content assist” feature (auto completion). This was the top stack trace in the profiler:

int java.util.zip.ZipFile$Source.hashN(byte[], int, int)
void java.util.zip.ZipFile$Source.initCEN(int)
void java.util.zip.ZipFile$Source.(ZipFile$Source$Key, boolean)
ZipFile$Source java.util.zip.ZipFile$Source.get(File, boolean)
void java.util.zip.ZipFile.(File, int, Charset)
void java.util.zip.ZipFile.(File, int)
void java.util.zip.ZipFile.(File)
ZipFile org.eclipse.jdt.internal.core.JavaModelManager.getZipFile(IPath, boolean)
ZipFile org.eclipse.jdt.internal.core.JavaModelManager.getZipFile(IPath)
ZipFile org.eclipse.jdt.internal.core.JarPackageFragmentRoot.getJar()
byte[] org.eclipse.jdt.internal.core.AbstractClassFile.getClassFileContent(JarPackageFragmentRoot, String)
IBinaryModule org.eclipse.jdt.internal.core.ModularClassFile.getJarBinaryModuleInfo()
IBinaryModule org.eclipse.jdt.internal.core.ModularClassFile.getBinaryModuleInfo()
boolean org.eclipse.jdt.internal.core.ModularClassFile.buildStructure(...)
void org.eclipse.jdt.internal.core.Openable.generateInfos(Object, HashMap, IProgressMonitor)
Object org.eclipse.jdt.internal.core.JavaElement.openWhenClosed(Object, boolean, IProgressMonitor)
Object org.eclipse.jdt.internal.core.JavaElement.getElementInfo(IProgressMonitor)
Object org.eclipse.jdt.internal.core.JavaElement.getElementInfo()
boolean org.eclipse.jdt.internal.core.JavaElement.exists()
boolean org.eclipse.jdt.internal.core.Openable.exists()
IModuleDescription org.eclipse.jdt.internal.core.PackageFragmentRoot.getModuleDescription()
IModuleDescription org.eclipse.jdt.internal.core.NameLookup.getModuleDescription(IPackageFragmentRoot, Map, Function)
...

In fact, the profiling session doesn’t show the exact number of accesses, but the number of stack trace samples that contained the specific method(s) which corresponds to the time spent inside of a method, not the number of calls (which is less relevant). Clearly, accessing zip files shouldn’t be the thing that Eclipse should be doing most of the time, when auto completing my code. So, why did it do it anyway?

It turns out, the problem was in the method getModuleDescription(), which can be summarised as follows:

static IModuleDescription getModuleDescription(
    IPackageFragmentRoot root, 
    Map<IPackageFragmentRoot,IModuleDescription> cache, 
    Function<IPackageFragmentRoot,IClasspathEntry> rootToEntry
) {
    IModuleDescription module = cache.get(root);
    if (module != null)
        return module;

    ...
    // Expensive call to open a Zip File in these calls:
    if (root.getKind() == IPackageFragmentRoot.K_SOURCE)
        module = root.getJavaProject().getModuleDescription();
    else
        module = root.getModuleDescription();

    if (module == null) {
        ...
    }

    if (module != null)
        cache.put(root, module);
    return module;
}

The ZipFile access is hidden inside the getModuleDescription() call. A debugger revealed that the JDK’s rt.jar file was opened quite a few times to look for a module-info.class file. Can you spot the mistake in the code?

The method gets an external cache that may already contain the method’s result. But the method may also return null in case there is no module description. Which there isn’t. jOOQ has not yet been modularised, and most libraries on which jOOQ depends haven’t been modularised either, nor has the JDK been modularised using which jOOQ is currently built (JDK 8). So, this method always returns null for non-modular stuff.

But if it returns null, it won’t put anything in the cache:

    if (module != null)
        cache.put(root, module);
    return module;
}

… which means the next time it is called, there’s a cache miss:

    IModuleDescription module = cache.get(root);
    if (module != null)
        return module;

… and the expensive logic involving the ZipFile call is invoked again. In other words, it is invoked all the time (for us).

Caching optional values

This is an important thing to always remember, and it is not easy to remember. Why? Because the developer who implemented this cache implemented it for the “happy path” (from the perspective of someone working with modules). They probably tried their code with a modular project, in case of which the cache worked perfectly. But they didn’t check if the code still works for everyone else. And in fact, it does work. The logic isn’t wrong. It’s just not optimal.

The solution to these things is simple. If the value null encodes a cache miss, we need another “PSEUDO_NULL” to encode the actual null value, or in this case something like NO_MODULE. So, the method can be rewritten as:

static IModuleDescription getModuleDescription(
    IPackageFragmentRoot root, 
    Map<IPackageFragmentRoot,IModuleDescription> cache, 
    Function<IPackageFragmentRoot,IClasspathEntry> rootToEntry
) {
    IModuleDescription module = cache.get(root);

    // Decode encoded NO_MODULE value:
    if (module == NO_MODULE)
        return null;
    if (module != null)
        return module;

    module = ...

    if (module != null)
        cache.put(root, module);

    // Encode null value:
    else
        cache.put(root, NO_MODULE);

    return module;
}

… where this NO_MODULE can be a simple java.lang.Object if you don’t care about generics, or a dummy IModuleDescription in our case:

static final IModuleDescription NO_MODULE = 
  new IModuleDescription() { ... };

Since it will be a singleton instance, we can use identity comparisons in our method.

Conclusion

When caching method results, always check if null is a valid result for the method. If it is, and if your cache is a simple Map, then you have to encode the null value with some sort of NO_MODULE value for the cache to work properly. Otherwise, you won’t be able to distinguish Map.get(key) == null for the cases:

  • Cache miss and Map returns null
  • Cache hit and the value is null

Update after some useful reddit / DZone comments

As /u/RayFowler pointed out on this article’s reddit discussion, the concept illustrated here is called “negative caching”

Something that is often forgotten when performing negative caching is the fact that exceptions are also a result, as pointed out by /u/zombifai in the same reddit discussion. The fix in Eclipse correctly took this into account as can be seen here: https://git.eclipse.org/c/jdt/eclipse.jdt.core.git/commit/?id=addfd789e17dbb99af0304912ef45e4ae72c0605

While a Map.containsKey() based solution would work in a similar way and would have the advantage of not needing a “dummy” / sentinel value, it is not a good approach in situations where performance really matters – remember that in this case, we’re talking about an Eclipse compiler optimisation where we really don’t want two Map lookups where one would suffice. This is a generally interesting thought for caches, which are introduced after all to improve performance!

How to Ensure Your Code Works With Older JDKs

jOOQ is a very backwards compatible product. This doesn’t only mean that we keep our own API backwards compatible as well as possible, but we also still support Java 6 in our commercial distributions.

In a previous blog post, I’ve shown how we manage to support Java 6 while at the same time not missing out on cool Java 8 language and API features, such as Stream and Optional support. For instance, you can do this with jOOQ’s ordinary distribution:

// Fetching 0 or 1 actors
Optional<Record2<String, String>> actor =
ctx.select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
   .from(ACTOR)
   .where(ACTOR.ID.eq(1))
   .fetchOptional();

// Fetching a stream of actors
try (Stream<Record2<String, String>> actor = ctx
       .select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
       .from(ACTOR)
       .fetchStream()) {
    ...
}

This API is present in jOOQ’s ordinary distribution and it is stripped from that distribution prior to building the Java 6 distribution.

But what about the JDK’s more subtle APIs?

It is relatively easy to remember not to use Streams, Optionals, lambdas, method references, default methods lightheartedly in your library’s code. After all, those were all major changes to Java 8 and we can easily add our API removal markers around those parts. And even if we forgot, building the Java 6 distribution would quite probably fail, because Streams are very often used with lambdas, in case of which a compiler that is configured for Java version 1.6 will not compile the code.

But recently, we’ve had a more subtle bug, #6860. jOOQ API was calling java.lang.reflect.Method.getParameterCount(). Since we compile jOOQ’s Java 6 distribution with Java 8, this didn’t fail. The sources were kept Java 6 language compatible, but not JDK 6 API compatible, and unfortunately, there’s no option in javac, nor in the Maven compiler plugin to do such a check.

Why not use Java 6 to compile the Java 6 distribution?

The reason why we’re using Java 8 to build jOOQ’s Java 6 distribution is the fact that Java 8 “fixed” a lot (and I mean a lot) of very old and weird edge cases related to generics,
overloading, varargs, and all that stuff. While this might be irrelevant for ordinary APIs, for jOOQ it is not. We really push the limits of what’s possible with the Java language.

So, we’re paying a price for building jOOQ’s Java 6 distribution with Java 8. We’re flying in “stealth mode”, not 100% sure whether our JDK API usage is compliant.

Luckily, the JDK doesn’t change much between releases, so a lot of stuff from JDK 8 was already there in JDK 6. Also, our integration tests would fail, if we did accidentally use a method like the above. Unfortunately, that particular method call simply slipped by the integration tests (there will never be enough tests for every scenario).

The solution

Apart from fixing the trivial bug and avoiding that particular method, we’ve now added the cool “animal sniffer” Maven plugin to our Java 6 build, whose usage you can see here:
http://www.mojohaus.org/animal-sniffer/animal-sniffer-maven-plugin/usage.html

All we needed to add to our Java 6 distribution profile was this little snippet:

<plugin>
  <groupId>org.codehaus.mojo</groupId>
  <artifactId>animal-sniffer-maven-plugin</artifactId>
  <version>1.16</version>
  <executions>
    <execution>
      <phase>test</phase>
      <goals>
        <goal>check</goal>
      </goals>
      <configuration>
        <signature>
          <groupId>org.codehaus.mojo.signature</groupId>
          <artifactId>java16</artifactId>
          <version>1.1</version>
        </signature>
      </configuration>
    </execution>
  </executions>
</plugin>

This will then produce a validation error like the following:

[INFO] --- animal-sniffer-maven-plugin:1.16:check (default) @ jooq-codegen ---
[INFO] Checking unresolved references to org.codehaus.mojo.signature:java16:1.0
[ERROR] C:\..\JavaGenerator.java:232: Undefined reference: int java.lang.reflect.Method.getParameterCount()
[ERROR] C:\..\JavaGenerator.java:239: Undefined reference: int java.lang.reflect.Method.getParameterCount()

Perfect!