Stop Mapping Stuff in Your Middleware. Use SQL’s XML or JSON Operators Instead

It’s been a while since I’ve ranted on this blog, but I was recently challenged by a reddit thread to write about this topic, so here goes…

So, you’re writing a service that produces some JSON from your database model. What do you need? Let’s see:

  • Read a book on DDD
  • Read another book on DDD
  • Write some entities, DTOs, factories, and factory builders
  • Discuss whether your entities, DTOs, factories, and factory builders should be immutable, and use Lombok, Autovalue, or Immutables to ease the pain of construction of said objects
  • Discuss whether you want to use standard JPA, or Hibernate specific features for your mapping
  • Plug in Jackson, the XML and JSON mapper library, because you’ve read a nice blog post about it
  • Debug 1-2 problems arising from combining Jackson, JAXB, Lombok, and JPA annotations. Minor thing
  • Debug 1-2 N+1 cases

STOP IT

No, seriously. Just stop it right there!

What you needed was this kind of JSON structure, exported form your favourite Sakila database:

[{
  "first_name": "PENELOPE",
  "last_name": "GUINESS",
  "categories": [{
    "name": "Animation",
    "films": [{
      "title": "ANACONDA CONFESSIONS"
    }]
   }, {
    "name": "Family",
    "films": [{
      "title": "KING EVOLUTION"
    }, {
      "title": "SPLASH GUMP"
    }]
  }]
}, {
   ...

In English: We need a list of actors, and the film categories they played in, and grouped in each category, the individual films they played in.

Let me show you how easy this is with SQL Server SQL (all other database dialects can do it these days, I just happen to have a SQL Server example ready:

-- 1) Produce actors
SELECT
  a.first_name,
  a.last_name, (

    -- 2) Nest categories in each actor
    SELECT
      c.name, (

        -- 3) Nest films in each category
        SELECT title
        FROM film AS f
        JOIN film_category AS fc ON f.film_id = fc.film_id
        JOIN film_actor AS fa ON fc.film_id = fa.film_id
        WHERE fc.category_id = c.category_id
        AND a.actor_id = fa.actor_id
        FOR JSON PATH -- 4) Turn into JSON
      ) AS films
    FROM category AS c
    JOIN film_category AS fc ON c.category_id = fc.category_id
    JOIN film_actor AS fa ON fc.film_id = fa.film_id
    WHERE fa.actor_id = a.actor_id
    GROUP BY c.category_id, c.name
    FOR JSON PATH -- 4) Turn into JSON
  ) AS categories
FROM
  actor AS a 
FOR JSON PATH, ROOT ('actors') -- 4) Turn into JSON

That’s it. That’s all there is to it. Only basic SQL-92, enhanced with some vendor-specific JSON export syntax. (There are also SQL standard JSON APIs as implemented in other RDBMS). Let’s discuss it quickly:

  1. The outer most query produces a set of actors. As you would have expected
  2. For each actor, a correlated subquery produces a nested JSON array of categories
  3. For each category, another correlated subquery finds all the films per actor and category
  4. Finally, turn all the result structures into JSON

That’s it.

Want to change the result structure? Super easy. Just modify the query accordingly. No need to modify:

  • Whatever you thought your DDD “root aggregate was”
  • Your gazillion entities, DTOs, factories, and factory builders
  • Your gazillion Lombok, Autovalue, or Immutables annotations
  • Your hacks and workarounds to get this stuff through your standard JPA, or Hibernate specific features for your mapping
  • Your gazilion Jackson, the XML and JSON mapper library annotations
  • Debugging another 1-2 problems arising from combining Jackson, JAXB, Lombok, and JPA annotations
  • Debugging another 1-2 N+1 cases

No! No need! It’s so simple. Just stream the JSON directly from the database to the client using whatever SQL API of your preference: JDBC, jOOQ, JdbcTemplate, MyBatis, or even JPA native query. Just don’t go mapping that stuff in the middleware if you’re not consuming it in the middleware. Let me repeat that for emphasis:

Don’t go mapping that stuff in the middleware if you’re not consuming it in the middleware.

Oh, want to switch to XML? Easy. In SQL Server, this amounts to almost nothing but replacing JSON by XML:

SELECT
  a.first_name,
  a.last_name, (
    SELECT
      c.name, (
	    SELECT title
	    FROM film AS f
	    JOIN film_category AS fc ON f.film_id = fc.film_id
	    JOIN film_actor AS fa ON fc.film_id = fa.film_id
	    WHERE fc.category_id = c.category_id
	    AND a.actor_id = fa.actor_id
	    FOR XML PATH ('film'), TYPE
      ) AS films
    FROM category AS c
    JOIN film_category AS fc ON c.category_id = fc.category_id
    JOIN film_actor AS fa ON fc.film_id = fa.film_id
    WHERE fa.actor_id = a.actor_id
    GROUP BY c.category_id, c.name
    FOR XML PATH ('category'), TYPE
  ) AS categories
FROM
  actor AS a 
FOR XML PATH ('actor'), ROOT ('actors')

And now, you’re getting:

<actors>
  <actor>
    <first_name>PENELOPE</first_name>
    <last_name>GUINESS</last_name>
    <categories>
      <category>
        <name>Animation</name>
        <films>
          <film>
            <title>ANACONDA CONFESSIONS</title>
          </film>
        </films>
      </category>
      <category>
        <name>Family</name>
        <films>
          <film>
            <title>KING EVOLUTION</title>
          </film>
          <film>
            <title>SPLASH GUMP</title>
          </film>
        </films>
      </category>
      ...

It’s so easy with SQL!

Want to support both without rewriting too much logic? Produce XML and use XSLT to automatically generate the JSON. Whatever.

FAQ, Q&A

But my favourite Java SQL API can’t handle it

So what. Write a view and query that instead.

But this doesn’t fit our architecture

Then fix the architecture

But SQL is bad

No, it’s great. It’s based on relational algebra and augments it in many many useful ways. It’s a declarative 4GL, the optimiser produces way better execution plans than you could ever imagine (see my talk), and it’s way more fun than your gazillion 3GL mapping libraries.

But SQL is evil because of Oracle

Then use PostgreSQL. It can do JSON.

But what about testing

Just spin up a test database with https://www.testcontainers.org, install your schema with some migration framework like Flyway or Liquibase in it, fill in some sample data, and write your simple integration tests.

But mocking is better

It is not. The more you mock away the database, the more you’re writing your own database.

But I’m paid by the lines of code

Well, good riddance, then.

But what if we have to change the RDBMS

So what? Your management paid tens of millions for the new licensing. They can pay you tens of hundreds to spend 20 minutes rewriting your 5-10 SQL queries. You already wrote the integration tests above.

Anyway. It won’t happen. And if it will, then those few JSON queries will not be your biggest problem.

What was that talk of yours again?

Here, highly recommended:

But we’ve already spent so many person years implementing our middleware

It has a name

But I’ve read this other blog post…

And now you’ve read mine.

But that’s like 90s style 2 tier architecture

So what? You’ve spent 5% the time to implement it. That’s 95% more time adding value to your customers, rather than bikeshedding mapping technology. I call that a feature.

What about ingestion? We need abstraction over ingestion

No, you don’t. You can send the JSON directly into your database, and transform / normalise it from there, using the same technique. You don’t need middleware abstraction and mapping, you just want middleware abstraction and mapping.

A Guide to SQL Naming Conventions

One of Java’s big strengths, in my opinion, is the fact that most naming conventions have been established by the creators of the language. For example:

  • Class names are in PascalCase
  • Member names are in camelCase
  • Constants are in SNAKE_CASE

If someone does not adhere to these conventions, the resulting code quickly looks non-idiomatic.

What about SQL?

SQL is different. While some people claim UPPER CASE IS FASTEST:

Others do not agree on the “correct” case:

There seems to be a tendency towards writing identifiers in lower case, with no agreement on the case of keywords. Also, in most dialects, people prefer snake_case for identifiers, although in SQL Server, people seem to prefer PascalCase or camelCase.

That’s for style. And I’d love to hear your opinion on style and naming conventions in the comments!

What about naming conventions?

In many languages, naming conventions (of identifiers) is not really relevant, because the way the language designs namespacing, there is relatively little risk for conflict. In SQL, this is a bit different. Most SQL databases support only a 3-4 layered set of namespaces:

  1. Catalog
  2. Schema
  3. Table (or procedure, type)
  4. Column (or parameter, attribute)

Some dialect dependent caveats:

  • While SQL Server supports both catalog AND schema, most dialects only support one of them
  • MySQL treats the catalog (“database”) as the schema
  • Oracle supports a package namespace for procedures, between schema and procedure

In any case, there is no such concept as package (“schema”) hierarchies as there is in languages like Java, which makes namespacing in SQL quite tricky. A problem that can easily happen when writing stored procedures:

FUNCTION get_name (id NUMBER) IS
  result NUMBER;
BEGIN
  SELECT name
  INTO result
  FROM customer
  WHERE id = id; -- Ehm...

  RETURN result;
END;

As can be seen above, both the CUSTOMER.ID column as well as the GET_NAME.ID parameter could be resolved by the unqualified ID expression. This is easy to work around, but a tedious problem to think of all the time.

Another example is when joining tables, which probably have duplicate column names:

SELECT *
FROM customer c
JOIN address a ON c.id = a.customer_id

This query might produce two ambiguous ID columns: CUSTOMER.ID and ADDRESS.ID. In the SQL language, it is mostly easy to distinguish between them by qualifying them. But in clients (e.g. Java), they are less easy to qualify properly. If we put the query in a view, it gets even trickier.

“Hungarian notation”

Hence, SQL and the procedural languages are a rare case where some type of Hungarian notation could be useful. Unlike with hungarian notation itself, where the data type is encoded in the name, in this case, we might encode some other piece of information in the name. Here’s a list of rules I’ve found very useful in the past:

1. Prefixing objects by semantic type

Tables, views, and other “tabular things” may quickly conflict with each other. Especially in Oracle, where one does not simply create a schema because of all the security hassles this produces (schemas and users are kinda the same thing, which is nuts of course. A schema should exist completely independently from a user), it may be useful to encode a schema in the object name:

  • C_TABLE: The C_ prefix denotes “customer data”, e.g. as opposed to:
  • S_TABLE: The S_ prefix denotes “system data” or “master data”
  • L_TABLE: The L_ prefix denotes “log data”
  • V_VIEW: The V_ prefix denotes a view
  • P_PARAMETER: The P_ prefix denotes a procedure or function parameter
  • L_VARIABLE: The L_ prefix denotes a local variable

Besides, when using views for security and access control, one might have additional prefixes or suffixes to denote the style of view:

  • _R: The _R suffix denotes read only views
  • _W: The _W suffix denotes writeable (updatable) views

This list is obviously incomplete. I’m undecided whether this is necessarily a good thing in general. For example, should packages, procedures, sequences, constraints be prefixed as well? Often, they do not lead to ambiguities in namespace resolution. But sometimes they do. The importance, as always, is to be consistent with a ruleset. So, once this practice is embraced, it should be applied everywhere.

2. Singular or plural table names

Who cares. Just pick one and use it consistently.

3. Establishing standard aliasing

Another technique that I’ve found very useful in the past is a standard approach to aliasing things. We need to alias tables all the time, e.g. in queries like this:

SELECT *
FROM customer c
JOIN address a ON c.id = a.customer_id

But what if we have to join ACCOUNT as well? We already used A for ADDRESS, so we cannot reuse A. But if we don’t re-use the same aliases in every query, the queries start to be a bit confusing to read.

We could just not use aliases and always fully qualify all identifiers:

SELECT *
FROM customer
JOIN address ON customer.id = address.customer_id

But that quickly turns out to be verbose, especially with longer table names, so also not very readable. The standard approach to aliasing things I’ve found very useful is to use this simple algorithm that produces 4 letter aliases for every table. Given the Sakila database, we could establish:

PREFIX TABLE NAME
ACTO ACTOR
ADDR ADDRESS
CATE CATEGORY
CITY CITY
COUN COUNTRY
CUST CUSTOMER
FILM FILM
FIAC FILM_ACTOR
FICA FILM_CATEGORY
FITE FILM_TEXT
INVE INVENTORY
LANG LANGUAGE
PAYM PAYMENT
RENT RENTAL
STAF STAFF
STOR STORE

The algorithm to shorten a table name is simple:

  • If the name does not contain an underscore, take the four first letters, e.g CUSTOMER becomes CUST
  • If the name contains 1 underscore, take the first two letters of each word, e.g. FILM_ACTOR becomes FIAC
  • If the name contains 2 underscores, take the first two letters of the first word, and the first letter of the other words, e.g. FILM_CATEGORY_DETAILS becomes FICD
  • If the name contains 3 or more underscores, take the first letter of each word
  • If a new abbreviation causes a conflict with the existing ones, make a pragmatic choice

This technique worked well for large-ish schemas with 500+ tables. You’d think that abbreviations like FICD are meaningless, and indeed, they are, at first. But once you start writing a ton of SQL against this schema, you start “learning” the abbreviations, and they become meaningful.

What’s more, you can use these abbreviations everywhere, not just when writing joins:

SELECT 
  cust.first_name,
  cust.last_name,
  addr.city
FROM customer cust
JOIN address addr ON cust.id = addr.customer_id

But also when aliasing columns in views or derived tables:

SELECT 
  cust.first_name AS cust_first_name,
  cust.last_name AS cust_last_name,
  addr.city AS addr_city
FROM customer cust
JOIN address addr ON cust.id = addr.customer_id

This becomes invaluable when your queries become more complex (say, 20-30 joins) and you start projecting tons of columns in a library of views that select from other views that select from other views. It’s easy to keep consistent, and you can also easily recognise things like:

  • What table a given column originates from
  • If that column has an index you can use (on a query against the view!)
  • If two columns that look the same (e.g. FIRST_NAME) really are the same

I think that if you work with views extensively (I’ve worked with schemas of 1000+ views in the past), then such a naming convention is almost mandatory.

Conclusion

There isn’t really a “correct” way to name things in any language, including SQL. But given the limitations of some SQL dialects, or the fact that after joining, two names may easily conflict, I’ve found the above two tools very useful in the past: 1) Prefixing identifiers with a hint about their object types, 2) Establishing a standard for aliasing tables, and always alias column names accordingly.

When you’re using a code generator like jOOQ’s, the generated column names on views will already include the table name as a prefix, so you can easily “see” what you’re querying.

I’m curious about your own naming conventions, looking forward to your comments in the comment section!

Dogfooding in Product Development

Dogfooding, or eating your own dog food, is a practice that all product developers should implement all the time. According to wikipedia:

Dogfooding, occurs when an organization uses its own product. This can be a way for an organization to test its products in real-world usage. Hence dogfooding can act as quality control, and eventually a kind of testimonial advertising. Once in the market, dogfooding demonstrates confidence in the developers’ own products

I’ve recently started delivering this talk about API design at conferences, where I mentioned dogfooding as an excellent approach to make sure the user experience of your API is great:

The more you use your own API, the better it gets from a UX and usability perspective.

Dogfooding via tests

We do a lot of dogfooding ourselves, inevitably, as we write tons and tons of tests for jOOQ, to make sure jOOQ works correctly on all the currently 26 RDBMS that we support.

Writing a test for new API means we have to immediately use the new API for the first time. This helps discover the first usability problems. But there are even better ways:

Dogfooding via significant new functionality

Another great way to dogfood is to create significant new features. For example, this new schema diff tool that jOOQ 3.13 will ship with (#9425). It is part of a bigger project that we call DDL interpretation, where we start implementing the DDL part of a database by maintaining an up to date database schema depending on a stream of DDL statements. This will be part of a variety of improvements and value propositions in the area of database change management / SQL migrations. Recently, we’ve published a post about an improved Liquibase integration, which goes in a similar direction.

The schema diff tool is something many database products are offering. For us, it is relatively simple to implement as we:

  • Support a variety of schema meta representations
  • Support a lot of DDL syntax
  • Support exporting schema meta representations as DDL

We’ve always had a type called org.jooq.Meta. That type represents your database. Historically, it just gave jOOQ-style access to JDBC’s java.sql.DatabaseMetaData. But over time, we also started supporting exposing generated jOOQ code, or XML files as org.jooq.Meta.

With jOOQ 3.13, we’ll support interpreting arbitrary DDL (parsed or manually created using the jOOQ API), or also Liquibase XML files to create such a org.jooq.Meta representation. Which can then be turned again into DDL using the new Meta.ddl() method:

System.out.println(
  ctx.meta("")
     .apply("create table t (i int)")
     .apply("alter table t add j int")
     .apply("alter table t alter i set not null")
     .apply("alter table t add primary key (i)")
);

When we run the above program, we’re getting the following output:

create table t(
  i int not null,
  j int null,
  primary key (i)
);

That’s already cool – we can create a snapshot of our schema at any point of a set of database migration scripts. Can we also do the inverse? We can, with the new diff tool:

System.out.println(
  ctx.meta(
    "create table t (i int)"
  ).diff(ctx.meta(
    "create table t ("
  + "i int not null, "
  + "j int null, "
  + "primary key (i))"
  ))
);

The output generated from this is:

alter table t alter i set not null;
alter table t add j integer null;
alter table t add constraint primary key (i);

This is part of an exciting new set of DDL / migration features, we can hardly wait to publish with jOOQ 3.13 (around Q1 2020), as we believe it will greatly help with your existing database change management solution – perhaps as a Flyway or Liquibase plugin.

What does this have to do with dogfooding?

While developing this feature, we’ve discovered numerous missing features, which we also implemented for jOOQ 3.13:

  • #7752: Our current meta model for sequences does remember flags like MAXVALUE
  • #9428: Meta.toString() could just call Meta.ddl(), the DDL export. This is what I’ve shown above, very useful when debugging with small schemas!
  • #9433: Meta.equals() and Meta.hashCode() was not yet implemented, as this didn’t make sense, historically.
  • #9434: Our current DDL export should support reordering objects in alphabetical order for the export. This is useful for equality checks, text-based diffs, etc.
  • #9437: We don’t support ALTER SEQUENCE statements that allow for modifying sequence properties like MAXVALUE yet.
  • #9438: The current emulation of ALTER SEQUENCE .. RESTART hard codes restarting the sequence at 1, when in fact, it could be some other START WITH value
  • #9440: We’ll need a synthetic syntax to drop unnamed foreign keys.

This is just a short extract of things we’ve implemented this week based on our discoveries of what’s missing when implementing the “big” feature: The Meta.diff(Meta) method. The diff() method will be the one highlighted in the release notes of 3.13. But the little things above are what ultimately makes jOOQ so useful – the many little things that were discovered while dogfooding and that help everyone, not just the users that use the diff() method.

Dogfooding via blogging and documentation

But, perhaps even better than when implementing new features is blogging or documenting API. When blogging or documenting, the vendor / maintainer has to put themselves into the position of the user, in particular, the first time user of a specific API.

Imagine, after writing all this code that I as an author love, I have to reset my experience, and pretend I don’t know what to expect. Then, sit down, and write a really good and simple example using the API that I wrote.

The first example I came up with was not like this (as that API didn’t exist yet):

System.out.println(
  ctx.meta("")
     .apply("create table t (i int)")
     .apply("alter table t add j int")
     .apply("alter table t alter i set not null")
     .apply("alter table t add primary key (i)")
);

Instead, it was like this (which works the same way):

System.out.println(
  ctx.meta("create table t (i int);\n"
         + "alter table t add j int;\n"
         + "alter table t alter i set not null;\n"
         + "alter table t add primary key (i);")
);

I thought to myself, do I really need all that string concatenation? I already have this Meta.apply(Queries) method, which would add even more value. But then, my code would look like this:

System.out.println(
  ctx.meta("")
     .apply(queries(createTable("t").column("i", INTEGER)))
     .apply(queries(alterTable("t").add("i", INTEGER)))
     .apply(queries(alterTable("t").alter("i").setNotNull()))
     .apply(queries(alterTable("t").add(primaryKey("i")))
);

That would be cool, too. I could show that our DSL API and our parser can do the same things. But I didn’t want to show the DSL API in this blog post. It would only distract from my main point. I could, of course, parse the queries explicitly:

System.out.println(
  ctx.meta("")
     .apply(ctx.parse("create table t (i int)"))
     .apply(ctx.parse("alter table t add j int"))
     .apply(ctx.parse("alter table t alter i set not null"))
     .apply(ctx.parse("alter table t add primary key (i)"))
);

But why not just add convenience API that does this for me?

System.out.println(
  ctx.meta("")
     .apply("create table t (i int)")
     .apply("alter table t add j int")
     .apply("alter table t alter i set not null")
     .apply("alter table t add primary key (i)")
);

Where Meta.apply(String) looks like this:

public final Meta apply(String diff) {
  return apply(dsl().parser().parse(diff));
}

That’s mere convenience. It is not needed. But it is very useful:

  • For tests
  • For blog posts
  • For documentation
  • For new users
  • For simple applications

The more thorough, more powerful API that accepts Queries (which wraps a bunch of parsed or hand-constructed jOOQ Query) objects is the real feature here. But convenience helps the user very much!

I would not have discovered this requirement without dogfooding.

Convenience

A much underrated tweet by Brian Goetz is this one:

APIs need a ton of abstractions. The JDK Collector API is a very good example. In Java and in SQL, in order to write a custom aggregate function, you need these four operations:

  • Supplier<A>: A supplier that provides an empty, intermediary data structure to aggregate into
  • BiConsumer<A, T>: A accumulator that accumulates new values from the stream into our intermediary data structure.
  • BinaryOperator<A>: A combiner that combines two intermediary data structures. This is used for parallel streams only.
  • Function<A, R>: The finisher function that extracts the result from the intermediary data structure.

Writing a Collector from scratch is super annoying and “low level”. Using convenience API is much better, this is why the JDK has Collectors, or jOOλ has Agg, and there are other Collector libraries that provide some basic building blocks. The Stream API does not have to know about all of these building blocks, it just knows a single, abstract type. (serves the API). But we users, we don’t want only a single abstraction. We want those convenient building blocks (serves the user).

Convenience is never requirement, but always a big plus. In APIs like in languages.

Have you noticed this new method on InputStream, which has been added in Java 9?

public long transferTo(OutputStream out) throws IOException {
    Objects.requireNonNull(out, "out");
    long transferred = 0;
    byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];
    int read;
    while ((read = this.read(buffer, 0, DEFAULT_BUFFER_SIZE)) >= 0) {
        out.write(buffer, 0, read);
        transferred += read;
    }
    return transferred;
}

How many times have we written this kind of code? N times (and a big N, too)

How many times have we enjoyed writing this code? 1 time

How many times too many have we written this code? N + 1 times.

Dogfooding is a very good way to make sure this kind of convenience our users love us for will make it into products much earlier on. Because we, the vendors, are the first users, and we hate writing this silly code all the time. And who is a better first user to implement cool new features for, if not ourselves?

How to Simulate a Liquibase Migration using H2

This post is part of a new blog series about database migrations, which will cover a variety of database change management topics.

In the near future, we’ll look much more into these topics, hoping to add more value to our users’ existing Flyway, Liquibase, and other integrations where the migration tools can profit a lot from jOOQ’s most recent features, including the parser.

Embracing Liquibase in the next jOOQ versions

In jOOQ 3.13, we’re going to offer a better integration for those users who use Liquibase for database change management. Our existing DDLDatabase parses, translates, and simulates a DDL script based migration against an in-memory H2 database instance.

This is using our built-in SQL translation functionality (test it on our website here: https://www.jooq.org/translate) to translate this Oracle SQL:

create table t (v varchar2(100))

to this H2 SQL:

create table t (v varchar(100))

That’s just a trivial example. More sophisticated translations are possible too. The main purpose of doing this has been, historically, to offer an “offline” jOOQ source code generation step that does not require connecting to an actual Oracle database instance to reverse engineer your schema, which you can represent in form of SQL migration scripts, which you’d typically run with something like Flyway.

In the future, we’ll add more features around this. One thing we’ve been thinking about has been to allow for translating Flyway migrations to other dialects, to make them vendor agnostic.

Simulating Liquibase migrations

Flyway and Liquibase work in quite a similar fashion, with Liquibase offering an additional abstraction layer over the SQL language. While pure SQL migrations are also possible with Liquibase’s SQL change, Liquibase also offers a set of mostly low level DDL command abstractions, such as the ADD COLUMN change.

Using their XML based DSL, you don’t have to remember whether the command is called:

alter table t add i int;
alter table t add column i int;

Or whatever creative syntax your SQL vendor came up with. Notice that, again, you can achieve the same thing with jOOQ’s translator, which you can use as an API or command line interface

Now, assuming you have the following liquibase database change log:

<?xml version="1.0" encoding="UTF-8"?>
<databaseChangeLog
    xmlns="http://www.liquibase.org/xml/ns/dbchangelog"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog
    http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-3.8.xsd">
    <changeSet author="authorName" id="changelog-1.0">
        <createTable tableName="TAB">
            <column name="COL" type="VARCHAR(10)">
                <constraints nullable="true"
                    primaryKey="false" unique="false" />
            </column>
        </createTable>
    </changeSet>
</databaseChangeLog>

And now, you would like to check if it is correct, i.e. if you can ship it to production as it is. Just create an in-memory H2 connection, and run the file against it:

Properties info = new Properties();
info.put("user", "sa");
info.put("password", "");

try (Connection con = new org.h2.Driver()
        .connect("jdbc:h2:mem:db", info)) {
    Database database = DatabaseFactory.getInstance()
      .findCorrectDatabaseImplementation(new JdbcConnection(con));
    Liquibase liquibase = new Liquibase("/path/to/liquibase.xml", 
      new FileSystemResourceAccessor(), database);
    liquibase.update("");
}

To check, just repeat the table definition a second time

<createTable tableName="TAB">
    <column name="COL" type="VARCHAR(10)">
        <constraints nullable="true"
            primaryKey="false" unique="false" />
    </column>
</createTable>
<createTable tableName="TAB">
    <column name="COL" type="VARCHAR(10)">
        <constraints nullable="true"
            primaryKey="false" unique="false" />
    </column>
</createTable>

and see how H2 complains about the statement:

Table "TAB" already exists; SQL statement:
CREATE TABLE PUBLIC.TAB (COL CLOB) [42101-200] [Failed SQL: (42101) CREATE TABLE PUBLIC.TAB (COL CLOB)]
	at liquibase.changelog.ChangeSet.execute(ChangeSet.java:646)
	at liquibase.changelog.visitor.UpdateVisitor.visit(UpdateVisitor.java:53)
	at liquibase.changelog.ChangeLogIterator.run(ChangeLogIterator.java:83)
	at liquibase.Liquibase.update(Liquibase.java:202)
	at liquibase.Liquibase.update(Liquibase.java:179)
	at liquibase.Liquibase.update(Liquibase.java:175)
	at liquibase.Liquibase.update(Liquibase.java:168)
	at LB.main(LB.java:21)

Using jOOQ’s tool chain to create schema snapshots

Want to export the schema again as DDL? Very easy, just use jOOQ for the task. For example, you can write the following Java code right after the above Liquibase migration simulation:

DSLContext ctx = DSL.using(connection);
ctx.ddl(ctx.meta().getSchemas("PUBLIC").get(0))
    .forEach(System.out::println);

And you can get, in H2 syntax (omitting the liquibase tables):

create table "PUBLIC"."TAB"(
  "COL" varchar(10) null
)

Want to export this to Oracle, instead? And using a custom schema, instead of H2’s PUBLIC? Write this instead:

DSLContext h2 = DSL.using(connection);
DSLContext oracle = DSL.using(ORACLE, new Settings()
   .withRenderMapping(new RenderMapping()
       .withSchemata(new MappedSchema()
           .withInput("PUBLIC")
           .withOutput("MY_SCHEMA"))));
oracle.ddl(h2.meta().getSchemas("PUBLIC").get(0))
    .forEach(System.out::println);

And now, we’re getting:

create table "MY_SCHEMA"."TAB"(
  "COL" varchar2(10) null
)

(notice, this works once #9384 is fixed).

Prefer an XML representation of your schema snapshot? Easy with jOOQ, as well:

DSLContext ctx = DSL.using(connection);
JAXB.marshal(ctx.informationSchema(
   ctx.meta().getSchemas("PUBLIC").get(0)
), System.out);

The output being:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<information_schema xmlns="http://www.jooq.org/xsd/jooq-meta-3.12.0.xsd">
    <schemata>
        <schema>
            <schema_name>PUBLIC</schema_name>
            <comment></comment>
        </schema>
    </schemata>
    <tables>
        <table>
            <table_schema>PUBLIC</table_schema>
            <table_name>TAB</table_name>
            <comment></comment>
        </table>
    </tables>
    <columns>
        <column>
            <table_schema>PUBLIC</table_schema>
            <table_name>TAB</table_name>
            <column_name>COL</column_name>
            <data_type>varchar</data_type>
            <character_maximum_length>10</character_maximum_length>
            <ordinal_position>1</ordinal_position>
            <is_nullable>true</is_nullable>
            <column_default></column_default>
            <comment></comment>
        </column>
    </columns>
</information_schema>

A useful file that can be put under version control with your current commit set of database change set to validate your schema, or export to other tools that can reverse engineer DDL or XML.

Using Liquibase migrations as jOOQ code generation input

Just like the pre-existing DDLDatabase, starting from jOOQ 3.13, you can simulate your Liquibase migration in-memory against an H2 database to reverse engineer it again for jOOQ’s code generator. This is documented here.

The relevant code generation configuration looks like this (standalone or with maven):

<configuration>
  <generator>
    <database>
      <name>org.jooq.meta.extensions.liquibase.LiquibaseDatabase</name>
      <properties>
        <property>
          <key>scripts</key>
          <value>src/main/resources/database.xml</value>
        </property>
      </properties>
    </database>
  </generator>
</configuration>

That’s it! As a bonus, you won’t have to manually simulate your migration anymore, as jOOQ’s LiquibaseDatabase already does it for you, behind the scenes, using the same three lines of code:

Database database = DatabaseFactory.getInstance()
  .findCorrectDatabaseImplementation(new JdbcConnection(con));
Liquibase liquibase = new Liquibase("/path/to/liquibase.xml", 
  new FileSystemResourceAccessor(), database);
liquibase.update("");

Stay tuned for more goodies on the topic of SQL migrations and database change management

A Quick Trick to Make a Java Stream Construction Lazy

One of the Stream APIs greatest features is its laziness. The whole pipeline is constructed lazily, stored as a set of instructions, akin to a SQL execution plan. Only when we invoke a terminal operation, the pipeline is started. It is still lazy, meaning that some operations may be short circuited.

Some third party libraries produce streams that are not entirely lazy. For example, jOOQ until version 3.12 eagerly executed a SQL query when calling ResultQuery.stream(), regardless if the Stream is consumed afterwards:

try (var stream = ctx.select(T.A, T.B).from(T).stream()) {
    // Not consuming the stream here
}

While this is probably a bug in client code, not executing the statement in this case might still be a useful feature. The exception being, of course, if the query contains a FOR UPDATE clause, in case of which the user probably uses Query.execute() instead, if they don’t care about the result.

A more interesting example where laziness helps is the fact that we might not want this query to be executed right away, as are perhaps still on the wrong thread to execute it. Or we would like any possible exceptions to be thrown from wherever the result is consumed, i.e. where the terminal operation is called. For example:

try (var stream = ctx.select(T.A, T.B).from(T).stream()) {
    consumeElsewhere(stream);
}

And then:

public void consumeElsewhere(Stream<? extends Record> stream) {
    runOnSomeOtherThread(() -> {
        stream.map(r -> someMapping(r))
              .forEach(r -> someConsumer(r));
    });
}

While we’re fixing this in jOOQ 3.13 (https://github.com/jOOQ/jOOQ/issues/4934), you may be stuck to an older version of jOOQ, or have another library do the same thing. Luckily, there’s an easy trick to quickly make a third party stream “lazy”. Flatmap it! Just write this instead:

try (var stream = Stream.of(1).flatMap(
    i -> ctx.select(T.A, T.B).from(T).stream()
)) {
    consumeElsewhere(stream);
}

The following small test illustrates that the stream() is now being constructed lazily

public class LazyStream {

    @Test(expected = RuntimeException.class)
    public void testEager() {
        Stream<String> stream = stream();
    }

    @Test
    public void testLazyNoTerminalOp() {
        Stream<String> stream = Stream.of(1).flatMap(i -> stream());
    }

    @Test(expected = RuntimeException.class)
    public void testLazyTerminalOp() {
        Optional<String> result = stream().findAny();
    }

    public Stream<String> stream() {
        String[] array = { "heavy", "array", "creation" };

        // Some Resource Problem that might occur
        if (true)
            throw new RuntimeException();

        return Stream.of(array);
    }
}

Caveat

Depending on the JDK version you’re using, the above approach has its own significant problems. For example, in older versions of the JDK 8, flatMap() itself might not be lazy at all! More recent versions of the JDK have fixed that problem, including JDK 8u222: https://bugs.openjdk.java.net/browse/JDK-8225328

How to Map MySQL’s TINYINT(1) to Boolean in jOOQ

MySQL 8 does not yet support the BOOLEAN type as specified in the SQL standard. There is a DDL “type” called BOOL, which is just an alias for TINYINT:

create table t(b bool);

select 
  table_name, 
  column_name, 
  data_type, 
  column_type
from information_schema.columns
where table_name = 't';

The above produces:

TABLE_NAME|COLUMN_NAME|DATA_TYPE|COLUMN_TYPE|
----------|-----------|---------|-----------|
t         |b          |tinyint  |tinyint(1) |

Notice that BOOL translates to a specific “type” of TINYINT, a TINYINT(1), where we might be inclined to believe that the (1) corresponds to some sort of precision, as with NUMERIC types.

However, counter intuitively, that is not the case. It corresponds to the display width of the type, when fetching it, using some deprecated modes. Consider:

insert into t(b) values (0), (1), (10);
select * from t;

We’re getting:

b |
--|
 0|
 1|
10|

Notice also that MySQL can process non-boolean types as booleans. Running the following statement:

select * from t where b;

We’re getting:

b |
--|
 1|
10|

Using this column as a Boolean column in jOOQ

By default, jOOQ doesn’t recognise such TINYINT(1) columns as boolean columns, because it is totally possible that a user has created such a column without thinking of boolean types, as the above example has shown.

In previous versions of jOOQ, the data type rewriting feature could be used on arbitrary expressions that match the boolean column name, e.g. the below would treat all columns named "B" as BOOLEAN:

<forcedTypes>
  <forcedType>
    <name>BOOLEAN</name>
    <includeExpression>B</includeExpression>
  </forcedType>
</forcedTypes>

With jOOQ 3.12.0 (issue #7719), we can now match this display width as well for MySQL types. That way, you can write this single data type rewriting configuration to treat all integer types of display width 1 as booleans:

<forcedTypes>
  <forcedType>
    <name>BOOLEAN</name>
    <includeTypes>(?i:TINYINT\(1\))</includeTypes>
  </forcedType>
</forcedTypes>

Using this configuration in the code generator, the above query:

select * from t where b;

… can now be written as follows, in jOOQ

selectFrom(T).where(T.B).fetch();

What’s Faster? COUNT(*) or COUNT(1)?

One of the biggest and undead myths in SQL is that COUNT(*) is faster than COUNT(1). Or was it that COUNT(1) is faster than COUNT(*)? Impossible to remember, because there’s really no reason at all why one should be faster than the other. But is the myth justified?

Let’s measure!

How does COUNT(…) work?

But first, let’s look into some theory. The two ways to count things are not exactly the same thing. Why?

  • COUNT(*) counts all the tuples in a group
  • COUNT(<expr>) counts all the tuples in a group for which <expr> evaluates to something that IS NOT NULL

This distinction can be quite useful. Most of the time, we’ll simply COUNT(*) for convenience, but there are (at least) two cases where we don’t want that, for example:

When outer joining

Imagine that in the Sakila database, we have some actors that did not play in any films. Making sure such an actor actually exists:

INSERT INTO actor (actor_id, first_name, last_name)
VALUES (201, 'SUSAN', 'DAVIS');

When inner joining, we might write the following (using PostgreSQL syntax):

SELECT actor_id, a.first_name, a.last_name, count(*) AS c
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
GROUP BY actor_id
ORDER BY c ASC, actor_id ASC;

And we won’t get the newly added SUSAN DAVIS, because of the nature of inner join:

actor_id|first_name |last_name   |c |
--------|-----------|------------|--|
     148|EMILY      |DEE         |14|
      35|JUDY       |DEAN        |15|
     199|JULIA      |FAWCETT     |15|
     186|JULIA      |ZELLWEGER   |16|
      31|SISSY      |SOBIESKI    |18|
      71|ADAM       |GRANT       |18|
       1|PENELOPE   |GUINESS     |19|
      30|SANDRA     |PECK        |19|

So we might change our query to use LEFT JOIN instead

SELECT actor_id, a.first_name, a.last_name, count(*) AS c
FROM actor AS a
LEFT JOIN film_actor AS fa USING (actor_id)
LEFT JOIN film AS f USING (film_id)
GROUP BY actor_id
ORDER BY c ASC, actor_id ASC;

There she is now, but oops, wrong count! She doesn’t have any films, which we have proven before with the INNER JOIN query. Yet we get 1:

actor_id|first_name |last_name   |c |
--------|-----------|------------|--|
     201|SUSAN      |DAVIS       | 1|
     148|EMILY      |DEE         |14|
      35|JUDY       |DEAN        |15|
     199|JULIA      |FAWCETT     |15|
     186|JULIA      |ZELLWEGER   |16|
      31|SISSY      |SOBIESKI    |18|
      71|ADAM       |GRANT       |18|
       1|PENELOPE   |GUINESS     |19|
      30|SANDRA     |PECK        |19|

Her COUNT(*) value is 1, because we do get 1 film tuple for her in the group, with all columns being NULL. The solution is to count the FILM_ID instead, which cannot be NULL in the table (being a primary key), but only because of the LEFT JOIN:

SELECT actor_id, a.first_name, a.last_name, count(film_id) AS c
FROM actor AS a
LEFT JOIN film_actor AS fa USING (actor_id)
LEFT JOIN film AS f USING (film_id)
GROUP BY actor_id
ORDER BY c ASC, actor_id ASC;

Notice, we could count other things than the primary key, but with the primary key, we’re quite certain we don’t get any other “accidental” nulls in our groups, which we did not want to exclude from the count value.

Now, we’re getting the correct result:

actor_id|first_name |last_name   |c |
--------|-----------|------------|--|
     201|SUSAN      |DAVIS       | 0|
     148|EMILY      |DEE         |14|
      35|JUDY       |DEAN        |15|
     199|JULIA      |FAWCETT     |15|
     186|JULIA      |ZELLWEGER   |16|
      31|SISSY      |SOBIESKI    |18|
      71|ADAM       |GRANT       |18|
       1|PENELOPE   |GUINESS     |19|
      30|SANDRA     |PECK        |19|

When counting subsets of a group

An even more powerful application of counting only non-null evaluations of an expression is counting only subsets of a group. We’ve already blogged about this technique in our previous post about aggregating several expressions in one single query.

For example, counting in a single query:

  • All actors
  • Actors with their first_name starting with A
  • Actors with their first_name ending with A
  • Actors with their first_name containing A

In SQL:

SELECT 
  count(*),
  count(CASE WHEN first_name LIKE 'A%' THEN 1 END),
  count(CASE WHEN first_name LIKE '%A' THEN 1 END),
  count(CASE WHEN first_name LIKE '%A%' THEN 1 END)
FROM actor;

This yields:

count|count|count|count|
-----|-----|-----|-----|
  201|   13|   30|  105|

This is very useful when pivoting data sets (see also Oracle/SQL Server PIVOT clause).

Notice that PostgreSQL supports the SQL standard FILTER clause for this, which is more convenient and more readable. The above query can be written like this, in PostgreSQL:

SELECT 
  count(*),
  count(*) FILTER (WHERE first_name LIKE 'A%'),
  count(*) FILTER (WHERE first_name LIKE '%A'),
  count(*) FILTER (WHERE first_name LIKE '%A%')
FROM actor;

Back to COUNT(*) vs COUNT(1)

Now that we know the theory behind these COUNT expressions, what’s the difference between COUNT(*) and COUNT(1). There is none, effectively. The 1 expression in COUNT(1) evaluates a constant expression for each row in the group, and it can be proven that this constant expression will never evaluate to NULL, so effectively, we’re running COUNT(*), counting ALL the rows in the group again.

There should be no difference, and parsers / optimisers should be able to recognise this and not do the extra work of checking every expression evaluation for NULL-ness.

I recently saw this discussion on Twitter, though, where Vik Fearing looked up the PostgreSQL sources, showing that PostgreSQL does do the extra work instead of optimising this:

So, I was curious to see if it mattered. I ran a benchmark on the 4 most popular RDBMS, with these results:

  • MySQL: Doesn’t matter. Sometimes COUNT(1) was faster, sometimes COUNT(*) was faster, so all differences were only benchmark artifacts
  • Oracle: Doesn’t matter. Like MySQL
  • PostgreSQL: Does matter (!). COUNT(*) was consistently faster by around 10% on 1M rows, that’s much more than I had expected
  • SQL Server: Doesn’t matter. Like MySQL

The benchmark code can be found in the following gists:

The results are below. Each benchmark run repeated SELECT COUNT(*) FROM t or SELECT COUNT(1) FROM t 100 times on a 1M row table, and then the benchmark was repeated 5 times to mitigate any warmup penalties and be fair with respect to caching.

The times displayed are relative to the fastest run per database product. This removes any distraction that may be caused by interpreting actual execution times as we do not want to compare database products against each other.

The database versions I’ve used are:

  • MySQL 8.0.16 (in Docker)
  • Oracle 18c XE (in Docker)
  • PostgreSQL 11.3 (in Docker)
  • SQL Server 2017 Express (in Windows)

MySQL

No relevant difference, nor a clear winner:

RUN     STMT    RELATIVE_TIME
-----------------------------
0	1	1.0079
0	2	1.0212
1	1	1.0229
1	2	1.0256
2	1	1.0009
2	2	1.0031
3	1	1.0291
3	2	1.0256
4	1	1.0618
4	2	1.0000

Oracle

No relevant difference, nor a clear winner

Run 1, Statement 1 : 1.06874
Run 1, Statement 2 : 1.01982
Run 2, Statement 1 : 1.09175
Run 2, Statement 2 : 1.0301
Run 3, Statement 1 : 1.00308
Run 3, Statement 2 : 1.02499
Run 4, Statement 1 : 1.02503
Run 4, Statement 2 : 1
Run 5, Statement 1 : 1.01259
Run 5, Statement 2 : 1.05828

PostgreSQL

A significant, consistent difference of almost 10%:

RUN 1, Statement 1: 1.00134
RUN 1, Statement 2: 1.09538
RUN 2, Statement 1: 1.00190
RUN 2, Statement 2: 1.09115
RUN 3, Statement 1: 1.00000
RUN 3, Statement 2: 1.09858
RUN 4, Statement 1: 1.00266
RUN 4, Statement 2: 1.09260
RUN 5, Statement 1: 1.00454
RUN 5, Statement 2: 1.09694

Again, I’m surprised by the order of magnitude of this difference. I would have expected it to be less. Curious to hear about your own results in the comments, or further ideas why this is so significant in PostgreSQL.

SQL Server

No relevant difference, nor a clear winner

Run 1, Statement 1: 1.00442
Run 1, Statement 2: 1.00702
Run 2, Statement 1: 1.00468
Run 2, Statement 2: 1.00000
Run 3, Statement 1: 1.00208
Run 3, Statement 2: 1.00624
Run 4, Statement 1: 1.00780
Run 4, Statement 2: 1.00364
Run 5, Statement 1: 1.00468
Run 5, Statement 2: 1.00702

Conclusion

As it is now in 2019, given the database versions mentioned above, unfortunately, there is a significant difference between COUNT(*) and COUNT(1) in PostgreSQL. Luckily (and this is rare in SQL), all the other dialects don’t care and thus, consistently using COUNT(*), rather than COUNT(1) is a slightly better choice for ALL measured database products from this article.

Do note that the benchmark only tried a very simple query! The results might be different when using joins, unions, or any other SQL constructs, or in other edge cases, e.g. when using COUNT() in HAVING or ORDER BY or with window functions, etc.

In any case, there shouldn’t be any difference, and I’m sure that a future PostgreSQL version will optimise the constant expression in the COUNT(<expr>) aggregate function directly in the parser to avoid the extra work.

For other interesting optimisations that do not depend on the cost model, see this article here.