How to Simulate a Liquibase Migration using H2

This post is part of a new blog series about database migrations, which will cover a variety of database change management topics.

In the near future, we’ll look much more into these topics, hoping to add more value to our users’ existing Flyway, Liquibase, and other integrations where the migration tools can profit a lot from jOOQ’s most recent features, including the parser.

Embracing Liquibase in the next jOOQ versions

In jOOQ 3.13, we’re going to offer a better integration for those users who use Liquibase for database change management. Our existing DDLDatabase parses, translates, and simulates a DDL script based migration against an in-memory H2 database instance.

This is using our built-in SQL translation functionality (test it on our website here: https://www.jooq.org/translate) to translate this Oracle SQL:

create table t (v varchar2(100))

to this H2 SQL:

create table t (v varchar(100))

That’s just a trivial example. More sophisticated translations are possible too. The main purpose of doing this has been, historically, to offer an “offline” jOOQ source code generation step that does not require connecting to an actual Oracle database instance to reverse engineer your schema, which you can represent in form of SQL migration scripts, which you’d typically run with something like Flyway.

In the future, we’ll add more features around this. One thing we’ve been thinking about has been to allow for translating Flyway migrations to other dialects, to make them vendor agnostic.

Simulating Liquibase migrations

Flyway and Liquibase work in quite a similar fashion, with Liquibase offering an additional abstraction layer over the SQL language. While pure SQL migrations are also possible with Liquibase’s SQL change, Liquibase also offers a set of mostly low level DDL command abstractions, such as the ADD COLUMN change.

Using their XML based DSL, you don’t have to remember whether the command is called:

alter table t add i int;
alter table t add column i int;

Or whatever creative syntax your SQL vendor came up with. Notice that, again, you can achieve the same thing with jOOQ’s translator, which you can use as an API or command line interface

Now, assuming you have the following liquibase database change log:

<?xml version="1.0" encoding="UTF-8"?>
<databaseChangeLog
    xmlns="http://www.liquibase.org/xml/ns/dbchangelog"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog
    http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-3.8.xsd">
    <changeSet author="authorName" id="changelog-1.0">
        <createTable tableName="TAB">
            <column name="COL" type="VARCHAR(10)">
                <constraints nullable="true"
                    primaryKey="false" unique="false" />
            </column>
        </createTable>
    </changeSet>
</databaseChangeLog>

And now, you would like to check if it is correct, i.e. if you can ship it to production as it is. Just create an in-memory H2 connection, and run the file against it:

Properties info = new Properties();
info.put("user", "sa");
info.put("password", "");

try (Connection con = new org.h2.Driver()
        .connect("jdbc:h2:mem:db", info)) {
    Database database = DatabaseFactory.getInstance()
      .findCorrectDatabaseImplementation(new JdbcConnection(con));
    Liquibase liquibase = new Liquibase("/path/to/liquibase.xml", 
      new FileSystemResourceAccessor(), database);
    liquibase.update("");
}

To check, just repeat the table definition a second time

<createTable tableName="TAB">
    <column name="COL" type="VARCHAR(10)">
        <constraints nullable="true"
            primaryKey="false" unique="false" />
    </column>
</createTable>
<createTable tableName="TAB">
    <column name="COL" type="VARCHAR(10)">
        <constraints nullable="true"
            primaryKey="false" unique="false" />
    </column>
</createTable>

and see how H2 complains about the statement:

Table "TAB" already exists; SQL statement:
CREATE TABLE PUBLIC.TAB (COL CLOB) [42101-200] [Failed SQL: (42101) CREATE TABLE PUBLIC.TAB (COL CLOB)]
	at liquibase.changelog.ChangeSet.execute(ChangeSet.java:646)
	at liquibase.changelog.visitor.UpdateVisitor.visit(UpdateVisitor.java:53)
	at liquibase.changelog.ChangeLogIterator.run(ChangeLogIterator.java:83)
	at liquibase.Liquibase.update(Liquibase.java:202)
	at liquibase.Liquibase.update(Liquibase.java:179)
	at liquibase.Liquibase.update(Liquibase.java:175)
	at liquibase.Liquibase.update(Liquibase.java:168)
	at LB.main(LB.java:21)

Using jOOQ’s tool chain to create schema snapshots

Want to export the schema again as DDL? Very easy, just use jOOQ for the task. For example, you can write the following Java code right after the above Liquibase migration simulation:

DSLContext ctx = DSL.using(connection);
ctx.ddl(ctx.meta().getSchemas("PUBLIC").get(0))
    .forEach(System.out::println);

And you can get, in H2 syntax (omitting the liquibase tables):

create table "PUBLIC"."TAB"(
  "COL" varchar(10) null
)

Want to export this to Oracle, instead? And using a custom schema, instead of H2’s PUBLIC? Write this instead:

DSLContext h2 = DSL.using(connection);
DSLContext oracle = DSL.using(ORACLE, new Settings()
   .withRenderMapping(new RenderMapping()
       .withSchemata(new MappedSchema()
           .withInput("PUBLIC")
           .withOutput("MY_SCHEMA"))));
oracle.ddl(h2.meta().getSchemas("PUBLIC").get(0))
    .forEach(System.out::println);

And now, we’re getting:

create table "MY_SCHEMA"."TAB"(
  "COL" varchar2(10) null
)

(notice, this works once #9384 is fixed).

Prefer an XML representation of your schema snapshot? Easy with jOOQ, as well:

DSLContext ctx = DSL.using(connection);
JAXB.marshal(ctx.informationSchema(
   ctx.meta().getSchemas("PUBLIC").get(0)
), System.out);

The output being:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<information_schema xmlns="http://www.jooq.org/xsd/jooq-meta-3.12.0.xsd">
    <schemata>
        <schema>
            <schema_name>PUBLIC</schema_name>
            <comment></comment>
        </schema>
    </schemata>
    <tables>
        <table>
            <table_schema>PUBLIC</table_schema>
            <table_name>TAB</table_name>
            <comment></comment>
        </table>
    </tables>
    <columns>
        <column>
            <table_schema>PUBLIC</table_schema>
            <table_name>TAB</table_name>
            <column_name>COL</column_name>
            <data_type>varchar</data_type>
            <character_maximum_length>10</character_maximum_length>
            <ordinal_position>1</ordinal_position>
            <is_nullable>true</is_nullable>
            <column_default></column_default>
            <comment></comment>
        </column>
    </columns>
</information_schema>

A useful file that can be put under version control with your current commit set of database change set to validate your schema, or export to other tools that can reverse engineer DDL or XML.

Using Liquibase migrations as jOOQ code generation input

Just like the pre-existing DDLDatabase, starting from jOOQ 3.13, you can simulate your Liquibase migration in-memory against an H2 database to reverse engineer it again for jOOQ’s code generator. This is documented here.

The relevant code generation configuration looks like this (standalone or with maven):

<configuration>
  <generator>
    <database>
      <name>org.jooq.meta.extensions.liquibase.LiquibaseDatabase</name>
      <properties>
        <property>
          <key>scripts</key>
          <value>src/main/resources/database.xml</value>
        </property>
      </properties>
    </database>
  </generator>
</configuration>

That’s it! As a bonus, you won’t have to manually simulate your migration anymore, as jOOQ’s LiquibaseDatabase already does it for you, behind the scenes, using the same three lines of code:

Database database = DatabaseFactory.getInstance()
  .findCorrectDatabaseImplementation(new JdbcConnection(con));
Liquibase liquibase = new Liquibase("/path/to/liquibase.xml", 
  new FileSystemResourceAccessor(), database);
liquibase.update("");

Stay tuned for more goodies on the topic of SQL migrations and database change management

A Quick Trick to Make a Java Stream Construction Lazy

One of the Stream APIs greatest features is its laziness. The whole pipeline is constructed lazily, stored as a set of instructions, akin to a SQL execution plan. Only when we invoke a terminal operation, the pipeline is started. It is still lazy, meaning that some operations may be short circuited.

Some third party libraries produce streams that are not entirely lazy. For example, jOOQ until version 3.12 eagerly executed a SQL query when calling ResultQuery.stream(), regardless if the Stream is consumed afterwards:

try (var stream = ctx.select(T.A, T.B).from(T).stream()) {
    // Not consuming the stream here
}

While this is probably a bug in client code, not executing the statement in this case might still be a useful feature. The exception being, of course, if the query contains a FOR UPDATE clause, in case of which the user probably uses Query.execute() instead, if they don’t care about the result.

A more interesting example where laziness helps is the fact that we might not want this query to be executed right away, as are perhaps still on the wrong thread to execute it. Or we would like any possible exceptions to be thrown from wherever the result is consumed, i.e. where the terminal operation is called. For example:

try (var stream = ctx.select(T.A, T.B).from(T).stream()) {
    consumeElsewhere(stream);
}

And then:

public void consumeElsewhere(Stream<? extends Record> stream) {
    runOnSomeOtherThread(() -> {
        stream.map(r -> someMapping(r))
              .forEach(r -> someConsumer(r));
    });
}

While we’re fixing this in jOOQ 3.13 (https://github.com/jOOQ/jOOQ/issues/4934), you may be stuck to an older version of jOOQ, or have another library do the same thing. Luckily, there’s an easy trick to quickly make a third party stream “lazy”. Flatmap it! Just write this instead:

try (var stream = Stream.of(1).flatMap(
    i -> ctx.select(T.A, T.B).from(T).stream()
)) {
    consumeElsewhere(stream);
}

The following small test illustrates that the stream() is now being constructed lazily

public class LazyStream {

    @Test(expected = RuntimeException.class)
    public void testEager() {
        Stream<String> stream = stream();
    }

    @Test
    public void testLazyNoTerminalOp() {
        Stream<String> stream = Stream.of(1).flatMap(i -> stream());
    }

    @Test(expected = RuntimeException.class)
    public void testLazyTerminalOp() {
        Optional<String> result = stream().findAny();
    }

    public Stream<String> stream() {
        String[] array = { "heavy", "array", "creation" };

        // Some Resource Problem that might occur
        if (true)
            throw new RuntimeException();

        return Stream.of(array);
    }
}

Caveat

Depending on the JDK version you’re using, the above approach has its own significant problems. For example, in older versions of the JDK 8, flatMap() itself might not be lazy at all! More recent versions of the JDK have fixed that problem, including JDK 8u222: https://bugs.openjdk.java.net/browse/JDK-8225328

How to Map MySQL’s TINYINT(1) to Boolean in jOOQ

MySQL 8 does not yet support the BOOLEAN type as specified in the SQL standard. There is a DDL “type” called BOOL, which is just an alias for TINYINT:

create table t(b bool);

select 
  table_name, 
  column_name, 
  data_type, 
  column_type
from information_schema.columns
where table_name = 't';

The above produces:

TABLE_NAME|COLUMN_NAME|DATA_TYPE|COLUMN_TYPE|
----------|-----------|---------|-----------|
t         |b          |tinyint  |tinyint(1) |

Notice that BOOL translates to a specific “type” of TINYINT, a TINYINT(1), where we might be inclined to believe that the (1) corresponds to some sort of precision, as with NUMERIC types.

However, counter intuitively, that is not the case. It corresponds to the display width of the type, when fetching it, using some deprecated modes. Consider:

insert into t(b) values (0), (1), (10);
select * from t;

We’re getting:

b |
--|
 0|
 1|
10|

Notice also that MySQL can process non-boolean types as booleans. Running the following statement:

select * from t where b;

We’re getting:

b |
--|
 1|
10|

Using this column as a Boolean column in jOOQ

By default, jOOQ doesn’t recognise such TINYINT(1) columns as boolean columns, because it is totally possible that a user has created such a column without thinking of boolean types, as the above example has shown.

In previous versions of jOOQ, the data type rewriting feature could be used on arbitrary expressions that match the boolean column name, e.g. the below would treat all columns named "B" as BOOLEAN:

<forcedTypes>
  <forcedType>
    <name>BOOLEAN</name>
    <includeExpression>B</includeExpression>
  </forcedType>
</forcedTypes>

With jOOQ 3.12.0 (issue #7719), we can now match this display width as well for MySQL types. That way, you can write this single data type rewriting configuration to treat all integer types of display width 1 as booleans:

<forcedTypes>
  <forcedType>
    <name>BOOLEAN</name>
    <includeTypes>(?i:TINYINT\(1\))</includeTypes>
  </forcedType>
</forcedTypes>

Using this configuration in the code generator, the above query:

select * from t where b;

… can now be written as follows, in jOOQ

selectFrom(T).where(T.B).fetch();

What’s Faster? COUNT(*) or COUNT(1)?

One of the biggest and undead myths in SQL is that COUNT(*) is faster than COUNT(1). Or was it that COUNT(1) is faster than COUNT(*)? Impossible to remember, because there’s really no reason at all why one should be faster than the other. But is the myth justified?

Let’s measure!

How does COUNT(…) work?

But first, let’s look into some theory. The two ways to count things are not exactly the same thing. Why?

  • COUNT(*) counts all the tuples in a group
  • COUNT(<expr>) counts all the tuples in a group for which <expr> evaluates to something that IS NOT NULL

This distinction can be quite useful. Most of the time, we’ll simply COUNT(*) for convenience, but there are (at least) two cases where we don’t want that, for example:

When outer joining

Imagine that in the Sakila database, we have some actors that did not play in any films. Making sure such an actor actually exists:

INSERT INTO actor (actor_id, first_name, last_name)
VALUES (201, 'SUSAN', 'DAVIS');

When inner joining, we might write the following (using PostgreSQL syntax):

SELECT actor_id, a.first_name, a.last_name, count(*) AS c
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
GROUP BY actor_id
ORDER BY c ASC, actor_id ASC;

And we won’t get the newly added SUSAN DAVIS, because of the nature of inner join:

actor_id|first_name |last_name   |c |
--------|-----------|------------|--|
     148|EMILY      |DEE         |14|
      35|JUDY       |DEAN        |15|
     199|JULIA      |FAWCETT     |15|
     186|JULIA      |ZELLWEGER   |16|
      31|SISSY      |SOBIESKI    |18|
      71|ADAM       |GRANT       |18|
       1|PENELOPE   |GUINESS     |19|
      30|SANDRA     |PECK        |19|

So we might change our query to use LEFT JOIN instead

SELECT actor_id, a.first_name, a.last_name, count(*) AS c
FROM actor AS a
LEFT JOIN film_actor AS fa USING (actor_id)
LEFT JOIN film AS f USING (film_id)
GROUP BY actor_id
ORDER BY c ASC, actor_id ASC;

There she is now, but oops, wrong count! She doesn’t have any films, which we have proven before with the INNER JOIN query. Yet we get 1:

actor_id|first_name |last_name   |c |
--------|-----------|------------|--|
     201|SUSAN      |DAVIS       | 1|
     148|EMILY      |DEE         |14|
      35|JUDY       |DEAN        |15|
     199|JULIA      |FAWCETT     |15|
     186|JULIA      |ZELLWEGER   |16|
      31|SISSY      |SOBIESKI    |18|
      71|ADAM       |GRANT       |18|
       1|PENELOPE   |GUINESS     |19|
      30|SANDRA     |PECK        |19|

Her COUNT(*) value is 1, because we do get 1 film tuple for her in the group, with all columns being NULL. The solution is to count the FILM_ID instead, which cannot be NULL in the table (being a primary key), but only because of the LEFT JOIN:

SELECT actor_id, a.first_name, a.last_name, count(film_id) AS c
FROM actor AS a
LEFT JOIN film_actor AS fa USING (actor_id)
LEFT JOIN film AS f USING (film_id)
GROUP BY actor_id
ORDER BY c ASC, actor_id ASC;

Notice, we could count other things than the primary key, but with the primary key, we’re quite certain we don’t get any other “accidental” nulls in our groups, which we did not want to exclude from the count value.

Now, we’re getting the correct result:

actor_id|first_name |last_name   |c |
--------|-----------|------------|--|
     201|SUSAN      |DAVIS       | 0|
     148|EMILY      |DEE         |14|
      35|JUDY       |DEAN        |15|
     199|JULIA      |FAWCETT     |15|
     186|JULIA      |ZELLWEGER   |16|
      31|SISSY      |SOBIESKI    |18|
      71|ADAM       |GRANT       |18|
       1|PENELOPE   |GUINESS     |19|
      30|SANDRA     |PECK        |19|

When counting subsets of a group

An even more powerful application of counting only non-null evaluations of an expression is counting only subsets of a group. We’ve already blogged about this technique in our previous post about aggregating several expressions in one single query.

For example, counting in a single query:

  • All actors
  • Actors with their first_name starting with A
  • Actors with their first_name ending with A
  • Actors with their first_name containing A

In SQL:

SELECT 
  count(*),
  count(CASE WHEN first_name LIKE 'A%' THEN 1 END),
  count(CASE WHEN first_name LIKE '%A' THEN 1 END),
  count(CASE WHEN first_name LIKE '%A%' THEN 1 END)
FROM actor;

This yields:

count|count|count|count|
-----|-----|-----|-----|
  201|   13|   30|  105|

This is very useful when pivoting data sets (see also Oracle/SQL Server PIVOT clause).

Notice that PostgreSQL supports the SQL standard FILTER clause for this, which is more convenient and more readable. The above query can be written like this, in PostgreSQL:

SELECT 
  count(*),
  count(*) FILTER (WHERE first_name LIKE 'A%'),
  count(*) FILTER (WHERE first_name LIKE '%A'),
  count(*) FILTER (WHERE first_name LIKE '%A%')
FROM actor;

Back to COUNT(*) vs COUNT(1)

Now that we know the theory behind these COUNT expressions, what’s the difference between COUNT(*) and COUNT(1). There is none, effectively. The 1 expression in COUNT(1) evaluates a constant expression for each row in the group, and it can be proven that this constant expression will never evaluate to NULL, so effectively, we’re running COUNT(*), counting ALL the rows in the group again.

There should be no difference, and parsers / optimisers should be able to recognise this and not do the extra work of checking every expression evaluation for NULL-ness.

I recently saw this discussion on Twitter, though, where Vik Fearing looked up the PostgreSQL sources, showing that PostgreSQL does do the extra work instead of optimising this:

So, I was curious to see if it mattered. I ran a benchmark on the 4 most popular RDBMS, with these results:

  • MySQL: Doesn’t matter. Sometimes COUNT(1) was faster, sometimes COUNT(*) was faster, so all differences were only benchmark artifacts
  • Oracle: Doesn’t matter. Like MySQL
  • PostgreSQL: Does matter (!). COUNT(*) was consistently faster by around 10% on 1M rows, that’s much more than I had expected
  • SQL Server: Doesn’t matter. Like MySQL

The benchmark code can be found in the following gists:

The results are below. Each benchmark run repeated SELECT COUNT(*) FROM t or SELECT COUNT(1) FROM t 100 times on a 1M row table, and then the benchmark was repeated 5 times to mitigate any warmup penalties and be fair with respect to caching.

The times displayed are relative to the fastest run per database product. This removes any distraction that may be caused by interpreting actual execution times as we do not want to compare database products against each other.

The database versions I’ve used are:

  • MySQL 8.0.16 (in Docker)
  • Oracle 18c XE (in Docker)
  • PostgreSQL 11.3 (in Docker)
  • SQL Server 2017 Express (in Windows)

MySQL

No relevant difference, nor a clear winner:

RUN     STMT    RELATIVE_TIME
-----------------------------
0	1	1.0079
0	2	1.0212
1	1	1.0229
1	2	1.0256
2	1	1.0009
2	2	1.0031
3	1	1.0291
3	2	1.0256
4	1	1.0618
4	2	1.0000

Oracle

No relevant difference, nor a clear winner

Run 1, Statement 1 : 1.06874
Run 1, Statement 2 : 1.01982
Run 2, Statement 1 : 1.09175
Run 2, Statement 2 : 1.0301
Run 3, Statement 1 : 1.00308
Run 3, Statement 2 : 1.02499
Run 4, Statement 1 : 1.02503
Run 4, Statement 2 : 1
Run 5, Statement 1 : 1.01259
Run 5, Statement 2 : 1.05828

PostgreSQL

A significant, consistent difference of almost 10%:

RUN 1, Statement 1: 1.00134
RUN 1, Statement 2: 1.09538
RUN 2, Statement 1: 1.00190
RUN 2, Statement 2: 1.09115
RUN 3, Statement 1: 1.00000
RUN 3, Statement 2: 1.09858
RUN 4, Statement 1: 1.00266
RUN 4, Statement 2: 1.09260
RUN 5, Statement 1: 1.00454
RUN 5, Statement 2: 1.09694

Again, I’m surprised by the order of magnitude of this difference. I would have expected it to be less. Curious to hear about your own results in the comments, or further ideas why this is so significant in PostgreSQL.

SQL Server

No relevant difference, nor a clear winner

Run 1, Statement 1: 1.00442
Run 1, Statement 2: 1.00702
Run 2, Statement 1: 1.00468
Run 2, Statement 2: 1.00000
Run 3, Statement 1: 1.00208
Run 3, Statement 2: 1.00624
Run 4, Statement 1: 1.00780
Run 4, Statement 2: 1.00364
Run 5, Statement 1: 1.00468
Run 5, Statement 2: 1.00702

Conclusion

As it is now in 2019, given the database versions mentioned above, unfortunately, there is a significant difference between COUNT(*) and COUNT(1) in PostgreSQL. Luckily (and this is rare in SQL), all the other dialects don’t care and thus, consistently using COUNT(*), rather than COUNT(1) is a slightly better choice for ALL measured database products from this article.

Do note that the benchmark only tried a very simple query! The results might be different when using joins, unions, or any other SQL constructs, or in other edge cases, e.g. when using COUNT() in HAVING or ORDER BY or with window functions, etc.

In any case, there shouldn’t be any difference, and I’m sure that a future PostgreSQL version will optimise the constant expression in the COUNT(<expr>) aggregate function directly in the parser to avoid the extra work.

For other interesting optimisations that do not depend on the cost model, see this article here.

Oracle’s BINARY_DOUBLE Can Be Much Faster Than NUMBER

Using the right data type for some calculation sounds like some obvious advice.

There are many blogs about using temporal data types for temporal data, instead of strings. An obvious reason is data integrity and correctness. We don’t gain much in storing dates as 2019-09-10 in one record, and as Nov 10, 2019 in the next one. Performance is also important in this case – for example, because of statistics being more optimal (an often overlooked side effect of unfit data types).

But at a customer site, I’ve recently discovered a surprising (not surprising in hindsight) performance issue when using NUMBER instead of BINARY_DOUBLE in an Oracle database.

NUMBER for monetary amounts

The NUMBER type (or DECIMAL, NUMERIC in other RDBMS) is perfectly suited for all decimal numbers, which require the correct precision and rounding. I mean, if you ever encounter this kind of display in an invoice, such as when I purchase video games on steam:

… your trust level for the platform immediately goes down by factor 1.0e+1 (even if technically, it does not matter in this case). So, the default, e.g. in banking systems for monetary amounts is always to use NUMBER or an equivalent type, e.g. java.math.BigDecimal.

Calculations on numbers

But now, let’s assume we want to do some statistics on these amounts. Some aggregations of various numeric values, maybe even of their logarithms. We can establish that these two expressions are equal:

ln(a * b) = ln(a) + ln(b)

Or in other words, for positive numbers:

a * b = exp(ln(a) + ln(b))

We’ve already blogged about how this approach can be used to emulate a PRODUCT() aggregate function in SQL, which is very useful for some cases, but none of the SQL databases jOOQ supports has built-in support for this yet. Notice, the blog post also takes care of zero and negative numbers.

But which number type to choose?

Now, we might be tempted to just calculate the LN(SOME_NUMBER) value, and sum that up using SUM(LN(SOME_NUMBER)) for this use-case. This turns out to be terribly slow in Oracle. We were thinking of bad indexes, first, even doubted aggregation in general, until I suggested we try using BINARY_DOUBLE instead, in this case. In our case, we didn’t care about the exact numeric value of the amount. A IEEE 754 floating point number with double precision was going to be good enough.

The results were surprising. In a simple benchmark, we compared 10 approaches to calculating this sum of logarithms:

  1. Using NUMBER(20, 10) and SUM(LN(the_number))
  2. Using NUMBER(20, 10) and SUM(LN(CAST(the_number AS BINARY_DOUBLE)))
  3. Using NUMBER(20, 10) and SUM(LN(TO_BINARY_DOUBLE(the_number)))
  4. Using NUMBER(20, 10), containing a pre-calculated LN value
  5. Using DOUBLE PRECISION and SUM(LN(the_double))
  6. Using DOUBLE PRECISION and SUM(LN(CAST(the_double AS BINARY_DOUBLE)))
  7. Using DOUBLE PRECISION and SUM(LN(TO_BINARY_DOUBLE(the_double)))
  8. Using DOUBLE PRECISION, containing a pre-calculated LN value
  9. Using BINARY_DOUBLE and SUM(LN(the_binary))
  10. Using BINARY_DOUBLE, containing a pre-calculated LN value

These were the thoughts:

  • We tried the above 3 possible numeric data types, expecting BINARY_DOUBLE to be the fastest
  • We tried to pre-calculate the LN() value for the benchmark, to see how much effort goes into summing, and how much effort goes into the LN() calculation with each type. While in general, in this system, such precalculation is impractical, we still wanted to have a benchmark comparison, in case a materialised view or other technique would be feasible.
  • We tried casting and converting each type to BINARY_DOUBLE prior to passing the value to the LN() function. Instead of migrating all the data (with possible side effects), we wanted to see if we can solve this to a reasonable extent “on the fly”

The benchmark that I’m posting here uses this table and example data (full benchmark code at the end of the article):

CREATE TABLE data (
  n1 NUMBER(20, 10),
  n2 NUMBER(20, 10),
  d1 DOUBLE PRECISION,
  d2 DOUBLE PRECISION,
  b1 BINARY_DOUBLE,
  b2 BINARY_DOUBLE
);

INSERT INTO data
SELECT level, ln(level), level, ln(level), level, ln(level)
FROM dual
CONNECT BY level <= 100000;

So, we have 100000 records, whose SUM(LN(x)) we want to calculate in the above 10 different ways. N1 contains the raw numeric value, and N2 contains the pre-calculated LN(N1).

The benchmark technique is described here. Do note it has a lot of caveats and is only useful for a limited number of verifications. Please always be careful when running such benchmarks, they often do not test production-like situations – and never use such benchmarks to compare different RDBMS products directly. They are only useful to compare two approaches on the same RDBMS product.

The results as run on Oracle 18c XE in Docker on development hardware are below. Times are compared relative to the fastest run, as actual time spent in each execution is not interesting for comparison. We did have similar results on production-like hardware, though, and I’m sure you can produce similar results in other RDBMS:

NUMBER(20, 10)
-------------------------------------------------
Run 3, Statement 1 : 280.34143  (avg : 280.75347)
Run 3, Statement 2 : 7.99402    (avg : 8.03506)
Run 3, Statement 3 : 7.71383    (avg : 7.73017)
Run 3, Statement 4 : 1.05456    (avg : 1.11735)

DOUBLE PRECISION
------------------------------------------------
Run 3, Statement 5 : 278.89476 (avg : 279.72981)
Run 3, Statement 6 : 8.06512   (avg : 8.07033)
Run 3, Statement 7 : 7.81873   (avg : 7.80063)
Run 3, Statement 8 : 1.5315    (avg : 1.54347)

BINARY_DOUBLE
------------------------------------------------
Run 3, Statement 9 : 2.4963    (avg : 2.57184)
Run 3, Statement 10: 1         (avg : 1.02943)

How to read these results?

  • Statement 10 is the fastest one, unsurprisingly, as it aggregates pre-calculated LN(binary_double) values. The precalculation of the function means that all the work has been done already before the report, and the data type is the one we expected to perform best in general
  • Statements 4 and 8 are almost as fast (precalculated LN() values). Being only slight factors off, we can attribute the difference to the usual benchmark flaws, although it’s interesting to see that DOUBLE PRECISION seems 1.5x slower to sum than BINARY_DOUBLE and even NUMBER
  • Statements 1, 5, 9 are the ones where no data type conversion is applied and SUM(LN(the_value)) is being calculated. It is staggering how much slower both NUMBER and DOUBLE PRECISION are than BINARY_DOUBLE. Statements 1 and 5 are a factor of 112x slower than statement 9!
  • Statements 2-3, 6-7 prove that converting the NUMBER or DOUBLE PRECISION vales to BINARY_DOUBLE on the fly provides a sufficiently performant workaround, which made statements 2-3, 6-7 only 3x slower than statement 9
  • Statements 2-3, 6-7 show that casting and converting are about equivalent

Note, we found similar results with other functions, such as EXP()

Analysis

The order of magnitude may seem surprising at first, but thinking about it, it is not. We would never do CPU intensive computation with java.math.BigDecimal in Java. The BigDecimal type is there for numeric accuracy, e.g. when it really matters what the monetary amount is, exactly. When we run analytics on monetary amounts, using double is sufficient in Java as well.

If our data is BigDecimal, and we cannot reasonably change that, it might still be better to use the BigDecimal::doubleValue conversion prior to further processing using e.g. Math::log. So, this translates directly to SQL, whose LN() implementations are data type specific. IEEE 754 having been designed precisely for this purpose.

When doing CPU intensive computations both in Java, or in the database, we should always evaluate our various options of

  • Quick fixing our data sets for the report (ad-hoc conversion prior to calculation)
  • Thoroughly fixing our data sets in the schema (migration towards a better data type)
  • Preprocessing our data sets (precalculating some very commonly used computations)

Benchmark logic

Just run the below on your own hardware. I’m curious to see your results:

-- Copyright Data Geekery GmbH
--
-- Licensed under the Apache License, Version 2.0 (the "License");
-- you may not use this file except in compliance with the License.
-- You may obtain a copy of the License at
--
--     http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing, software
-- distributed under the License is distributed on an "AS IS" BASIS,
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- See the License for the specific language governing permissions and
-- limitations under the License.
--
-- This version displays relative execution times (fastest execution = 1)
SET SERVEROUTPUT ON

ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;

CREATE TABLE data (
  n1 NUMBER(20, 10),
  n2 NUMBER(20, 10),
  d1 DOUBLE PRECISION,
  d2 DOUBLE PRECISION,
  b1 BINARY_DOUBLE,
  b2 BINARY_DOUBLE
);

INSERT INTO data
SELECT level, ln(level), level, ln(level), level, ln(level)
FROM dual
CONNECT BY level <= 100000;

CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 10;
  v_stmt NUMBER;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
    v_stmt := 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(n1)) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(cast(n1 as binary_double))) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(to_binary_double(n1))) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(n2) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
    
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(d1)) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
    
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(cast(d1 as binary_double))) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
    
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(to_binary_double(d1))) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(d2) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(b1)) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(b2) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio,
      CAST(AVG(elapsed) OVER (PARTITION BY stmt) / MIN(elapsed) OVER() AS NUMBER(10, 5)) avg_ratio
    FROM results
    ORDER BY run, stmt
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio || ' (avg : ' || rec.avg_ratio || ')');
  END LOOP;
  
  dbms_output.put_line('');
  dbms_output.put_line('Copyright Data Geekery GmbH');
  dbms_output.put_line('https://www.jooq.org/benchmark');
END;
/

DROP TABLE data;
DROP TABLE results;

Using DISTINCT ON in Non-PostgreSQL Databases

A nice little gem in PostgreSQL’s SQL syntax is the DISTINCT ON clause, which is as powerful as it is esoteric.

In a previous post, we’ve blogged about some caveats to think of when DISTINCT and ORDER BY are used together. The bigger picture can be seen in our article about the logical order of operations in SQL SELECT.

The PostgreSQL documentation explains it well:

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY. […] For example:

SELECT DISTINCT ON (location) location, time, report
    FROM weather_reports
    ORDER BY location, time DESC;

retrieves the most recent weather report for each location. […]

Again, this is quite esoteric as the distinct-ness is now decided only based on the columns listed separately in parentheses. For all the other columns, only the first row according to ORDER BY is projected. The SQL language is probably one of the only ones where the syntactic order of operations has absolutely nothing to do with the logical order of operations, and this DISTINCT ON syntax isn’t helping. In a more straightforward language design, the above statement could read, instead:

FROM weather_reports
WINDOW w AS (PARTITION BY location ORDER BY time DESC)
SELECT 
  location, 
  FIRST_VALUE (time) OVER w AS time,
  FIRST_VALUE (report) OVER w AS report
DISTINCT
ORDER BY location

In other words, we would execute these operations in the following logical order:

  1. FROM: Access the weather_reports table
  2. WINDOW: Specify “windows” or “groups” in our data set, grouping by location and ordering contents of each group by time, descendingly (using the term “WINDOW” is no accident as we’ll see afterwards)
  3. SELECT: Project the location (which is the grouping) and per group, the first values of time / report ordered by time
  4. DISTINCT: Remove all the duplicates, because the above operation will produce the same time and report value for each record that shares the same location
  5. ORDER BY: Finally, order the results per grouping

That’s what really happens, and incidentally, the above synthetic SQL syntax matches the actual logical order of operations in the SQL language, so translating it back to an actual SQL statement would yield:

SELECT DISTINCT
  location, 
  FIRST_VALUE (time) OVER w AS time,
  FIRST_VALUE (report) OVER w AS report
FROM weather_reports
WINDOW w AS (PARTITION BY location ORDER BY time DESC)
ORDER BY location

If your database doesn’t support the WINDOW clause, just expand it into the individual window functions. E.g. in Oracle, write:

SELECT DISTINCT
  location, 
  FIRST_VALUE (time) OVER (PARTITION BY location ORDER BY time DESC),
  FIRST_VALUE (report) OVER (PARTITION BY location ORDER BY time DESC)
FROM weather_reports
ORDER BY location

From a readability perspective, I would definitely prefer the standard SQL syntax over DISTINCT ON.

Want to play around with it? Here’s some sample data:

create table weather_reports (location text, time date, report text);
insert into weather_reports values ('X', DATE '2000-01-01', 'X1');
insert into weather_reports values ('X', DATE '2000-01-02', 'X2');
insert into weather_reports values ('X', DATE '2000-01-03', 'X3');
insert into weather_reports values ('Y', DATE '2000-01-03', 'Y1');
insert into weather_reports values ('Y', DATE '2000-01-05', 'Y2');
insert into weather_reports values ('Z', DATE '2000-01-04', 'Z1');

The result being:

|location|time      |report|
|--------|----------|------|
|X       |2000-01-03|X3    |
|Y       |2000-01-05|Y2    |
|Z       |2000-01-04|Z1    |

Notice that jOOQ already supports PostgreSQL DISTINCT ON and in the future, we might emulate it for other dialects using the above technique: https://github.com/jOOQ/jOOQ/issues/3564

Quantified LIKE ANY predicates in jOOQ 3.12

Quantified comparison predicates

One of SQL’s weirdes features are quantified comparison predicates. I’ve hardly ever seen these in the wild:

SELECT *
FROM t
WHERE id = ANY (1, 2, 3)

The above example is equivalent to using the much more readable IN predicate:

SELECT *
FROM t
WHERE id IN (1, 2, 3)

This equivalence is defined in the SQL standard. There are more esoteric cases that could be solved using such quantified comparison predicates more conveniently than otherwise, such as:

SELECT *
FROM t
WHERE (a, b) > ALL (
  SELECT x, y
  FROM u
)

This is the same thing as writing the more verbose, and in my opinion, a bit less readable:

SELECT *
FROM t
WHERE (a, b) > (
  SELECT x, y
  FROM u
  ORDER BY x, y
  FETCH FIRST ROW ONLY
)

Assuming, of course, that your RDBMS can compare row value expressions like that.

Quantified LIKE predicate

Unfortuantely, the SQL standard and most implementations support the above quantified comparison predicates only for the <, <=, >, >=, =, != comparison operators. Not for other predicate types. For example, the LIKE predicate would profit greatly from such a syntax:

SELECT *
FROM customers
WHERE last_name LIKE ANY ('A%', 'B%', 'C%')

The syntax is immediately understandable and translates to

SELECT *
FROM customers
WHERE last_name LIKE 'A%'
OR last_name LIKE 'B%'
OR last_name LIKE 'C%'

… which is much less convenient to write!

Furthermore, imagine producing such patterns from a subquery:

SELECT *
FROM customers
WHERE last_name LIKE ANY (
  SELECT pattern
  FROM patterns
  WHERE pattern.customer_type = customer.customer_type
)

This is a bit trickier to emulate in standard SQL. For example, in PostgreSQL, we could write:

SELECT *
FROM customers
WHERE true = ANY (
  SELECT last_name LIKE pattern
  FROM patterns
  WHERE pattern.customer_type = customer.customer_type
)

In this case, we can use boolean types. Oracle would make this a bit more difficult:

SELECT *
FROM customers
WHERE 1 = ANY (
  SELECT CASE
    WHEN last_name LIKE pattern THEN 1
    WHEN NOT(last_name LIKE pattern) THEN 0
    ELSE NULL
  END
  FROM patterns
  WHERE pattern.customer_type = customer.customer_type
)

Wouldn’t this a useful SQL feature to support?

jOOQ 3.12 support for this

jOOQ does support this syntax starting from jOOQ 3.12. You can now write

ctx.selectFrom(CUSTOMERS)
   .where(CUSTOMERS.LAST_NAME.like(any("A%", "B%", "C%")))
   .fetch();

ctx.selectFrom(CUSTOMERS)
   .where(CUSTOMERS.LAST_NAME.like(any(
      select(PATTERNS.PATTERN)
      .from(PATTERNS)
      .where(PATTERN.CUSTOMER_TYPE.eq(CUSTOMER.CUSTOMER_TYPE))
   )))
   .fetch();

All the previously mentioned emulations are available. You can play around with it downloading jOOQ:
https://www.jooq.org/download

Or directly on our website:
https://www.jooq.org/translate