Truth First, or Why You Should Mostly Implement Database First Designs

In this much overdue article, I will explain why I think that in almost all cases, you should implement a “database first” design in your application’s data models, rather than a “Java first” design (or whatever your client language is), the latter approach leading to a long road of pain and suffering, once your project grows.

This article is inspired by a recent Stack Overflow question.

Interesting reddit discussions on /r/java and /r/programming.

Code generation

To my surprise, a small group of first time jOOQ users seem to be appalled by the fact that jOOQ heavily relies on source code generation. No one keeps you from using jOOQ the way you want and you don’t have to use code generation, but the default way to use jOOQ according to the manual is to start with a (legacy) database schema, reverse engineer that using jOOQ’s code generator to get a bunch of classes representing your tables, and then to write type safe queries against those tables:

for (Record2<String, String> record : DSL.using(configuration)
//   ^^^^^^^^^^^^^^^^^^^^^^^ Type information derived from the 
//   generated code referenced from the below SELECT clause

       .select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
//           vvvvv ^^^^^^^^^^^^  ^^^^^^^^^^^^^^^ Generated names
       .from(ACTOR)
       .orderBy(1, 2)) {
    // ...
}

The code is generated either manually outside of the build, or automatically with every build. For instance, such a re-generation could follow immediately after a Flyway database migration, which can also be run either manually or automatically.

Source code generation

There are different philosophies, advantages, and disadvantages regarding these manual/automatic approaches, which I don’t want to discuss in this article. But essentially, the point of generated code is that it provides a Java representation of something that we take for granted (a “truth”) either within or outside of our system. In a way, compilers do the same thing when they generate byte code, machine code, or some other type of source code from the original sources – we get a representation of our “truth” in a different language, for whatever reason.

There are many such code generators out there. For instance, XJC can generate Java code from XSD or WSDL files. The principle is always the same:

  • There is some truth (internal or external), like a specification, data model, etc.
  • We need a local representation of that truth in our programming language

And it almost always makes sense to generate that latter, to avoid redundancy.

Type providers and annotation processing

Noteworthy: Another, more modern approach to jOOQ’s particular code generation use-case would be Type Providers, as implemented by F#, in case of which the code is generated by the compiler while compiling. It never really exists in its source form. A similar (but less sophisticated) tool in Java are annotation processors, e.g. Lombok.

In a way, this does the same thing except:

  • You don’t see the generated code (perhaps that’s less appalling to some?)
  • You must ensure the types can be provided, i.e. the “truth” must always be available. Easy in the case of Lombok, which annotates the “truth”. A bit more difficult with database models, which rely on an always available live connection.

What’s the problem with code generation?

Apart from the tricky question whether to trigger code generation manually or automatically, some people seem to think that code must not be generated at all. The reason I hear the most is the idea that it is difficult to set up in a build pipeline. And yes, that is true. There is extra infrastructure overhead. Especially if you’re new to a certain product (like jOOQ, or JAXB, or Hibernate, etc.), setting up an environment takes time you would rather spend learning the API itself and getting value out of it.

If the overhead of learning how the code generator works is too high, then indeed, the API failed to make the code generator easy to use (and later on, to customise). That should be a high priority for any such API. But that’s the only argument against code generation. Other than that, it makes absolutely no sense at all to hand-write the local representation of the internal or external truth.

Many people argue that they don’t have time for that stuff. They need to ship their MVPs. They can finalise their build pipelines later. I say:

“But Hibernate / JPA makes coding Java first easy”

Yes that’s true. And it’s both a bliss and a curse for Hibernate and its users. In Hibernate, you can just write a couple of entities, such as:

@Entity
class Book {
  @Id
  int id;
  String title;
}

And you’re almost set. Let Hibernate generate the boring “details” of how to define this entity in your SQL dialect’s DDL:

CREATE TABLE book (
  id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  title VARCHAR(50),

  CONSTRAINT pk_book PRIMARY KEY (id)
);

CREATE INDEX i_book_title ON book (title);

… and start running the application. That’s really cool to get started quickly and to try out things.

But, huh, wait. I cheated.

  • Will Hibernate really apply that named primary key definition?
  • Will it create the index on TITLE, which I know we’ll need?
  • Will it add an identity specification?

Probably not. While you’re developing your greenfield project, it is convenient to always throw away your entire database and re-generate it from scratch, once you’ve added the additional annotations. So, the Book entity would eventually look like this:

@Entity
@Table(name = "book", indexes = {
  @Index(name = "i_book_title", columnList = "title")
})
class Book {
  @Id
  @GeneratedValue(strategy = IDENTITY)
  int id;
  String title;
}

Cool. Re-generate. Again, this makes it really easy to get started.

But you’ll pay the price later on

At some point, you go to production. And that’s when this model no longer works. Because

Once you go live, you can no longer throw away your database, as your database has become legacy.

From now on, you have to write DDL migration scripts, e.g. using Flyway. And then, what happens to your entities? You can either adapt them manually (so you double the work), or have Hibernate re-generate them for you (how big are your chances of the generation matching your expectations?) You can only lose.

Because once you go to production, you need hotfixes. And those have to go live fast. And since you didn’t prepare for pipelining your migrations to production smoothly, you’ll patch things wildly. And then you run out of time to do it right™. And you’ll blame Hibernate, because it’s always someone else’s fault…

Instead, you could have done things entirely differently from the beginning. Like using those round wheels.

Go Database First

The real “truth” of your database schema, and the “sovereignty” over it, resides with your database. The database is the only place where the schema is defined, and all clients have a copy of the database schema, not vice versa. The data is in your database, not in your client, so it makes perfect sense to enforce the schema and its integrity in the database, right where the data is.

This is old wisdom, nothing new. Primary and unique keys are good. Foreign keys are good. Check constraints are good. Assertions (when they’re finally implemented) are good.

And that’s not where it ends. For instance, if you’re using Oracle, you may want to specify:

  • In what tablespace your table resides
  • What PCTFREE value it has
  • What the cache size of your sequence (behind the identity) is

Maybe, all of this doesn’t matter in small systems, but you don’t have to go “big data” before you can profit from vendor-specific storage optimisations as the above. None of the ORMs I’ve ever seen (including jOOQ) will allow you to use the full set of DDL options that you may want to use on your database. ORMs offer some tools to help you write DDL.

But ultimately, a well-designed schema is hand written in DDL. All generated DDL is only an approximation of that.

What about the client model?

As mentioned before, you will need a copy of your database schema in your client, a client representation. Needless to say that this client representation needs to be in-sync with the real model. How to best do that? By using a code generator.

All databases expose their meta information through SQL. Here’s how to get all tables from your database in various SQL dialects:

-- H2, HSQLDB, MySQL, PostgreSQL, SQL Server
SELECT table_schema, table_name
FROM information_schema.tables

-- DB2
SELECT tabschema, tabname
FROM syscat.tables

-- Oracle
SELECT owner, table_name
FROM all_tables

-- SQLite
SELECT name
FROM sqlite_master

-- Teradata
SELECT databasename, tablename
FROM dbc.tables

These queries (or similar ones, e.g. depending on whether views, materialised views, table valued functions should also be considered) are also run by JDBC’s DatabaseMetaData.getTables() call, or by the jOOQ-meta module.

From the result of such queries, it’s relatively easy to generate any client representation of your database model, regardless what your client technology is.

  • If you’re using JDBC or Spring, you can create a bunch of String constants
  • If you’re using JPA, you can generate the entities themselves
  • If you’re using jOOQ, you can generate the jOOQ meta model

Depending on the amount of features your client API offers (e.g. jOOQ or JPA), the generated meta model can be really rich and complete. Consider, for instance, jOOQ 3.11’s implicit join feature, which relies on generated meta information about the foreign key relationships between your tables.

Now, any database increment will automatically lead to updated client code. For instance, imagine:

ALTER TABLE book RENAME COLUMN title TO book_title;

Would you really want to do this work twice? No way. Just commit the DDL, run it through your build pipeline, and have an updated entity:

@Entity
@Table(name = "book", indexes = {

  // Would you have thought of this?
  @Index(name = "i_book_title", columnList = "book_title")
})
class Book {
  @Id
  @GeneratedValue(strategy = IDENTITY)
  int id;

  @Column("book_title")
  String bookTitle;
}

Or an updated jOOQ class. Plus: Your client code might no longer compile, which can be a good thing! Most DDL changes are also semantic changes, not just syntactic ones. So, it’s great to be able to see in compiled client source code, what code is (or may be) affected by your database increment.

A single truth

Regardless what technology you’re using, there’s always one model that contains the single truth for a subsystem – or at least, we should aim for this goal and avoid the enterprisey mess where “truth” is everywhere and nowhere. It just makes everything much simpler. If you exchange XML files with some other system, you’re going to use XSD. Like jOOQ’s INFORMATION_SCHEMA meta model in XML form:
https://www.jooq.org/xsd/jooq-meta-3.10.0.xsd

  • XSD is well understood
  • XSD specifies XML content very well, and allows for validation in all client languages
  • XSD can be versioned easily, and evolved backwards compatibly
  • XSD can be translated to Java code using XJC

The last bullet is important. When communicating with an external system through XML messages, we want to be sure our messages are valid. That’s really really easy to do with JAXB, XJC, and XSD. It would be outright nuts to think that a Java-first approach where we design our messages as Java objects could somehow be reasonably mapped to XML for someone else to consume. That generated XML would be of very poor quality, undocumented, and hard to evolve. If there’s an SLA on such an interface, we’d be screwed.

Frankly, that’s what happens to JSON APIs all the time, but that’s another story, another rant…

Databases: Same thing

When you’re using databases, it’s the same thing. The database owns its data and it should be the master of the schema. All modifications to the schema should be implemented using DDL directly, to update the single truth.

Once that truth is updated, all clients need to update their copies of the model as well. Some clients may be written in Java, using either (or both of) jOOQ and Hibernate, or JDBC. Other clients may be written in Perl (good luck to them). Even other clients may be written in C#. It doesn’t matter. The main model is in the database. ORM-generated models are of poor quality, not well documented, and hard to evolve.

So, don’t do it. And, don’t do it from the very beginning. Instead, go database first. Build a deployment pipeline that can be automated. Include code generators to copy your database model back into the clients. And stop worrying about code generation. It’s a good thing. You’ll be productive. All it takes is a bit of initial effort to set it up, and you’ll get years of improved productivity for the rest of your project.

Thank me later.

Clarification

Just to be sure: This article in no way asserts that your database model should be imposed on your entire system (e.g. your domain, your business logic, etc. etc.). The claim I made here is that client code interacting with the database should act upon the database model, and not have its own first class model of the database instead. This logic typically resides in the data access layer of your client.

In 2-tier architectures, which still have their place sometimes, that may be the only model of your system. In most systems, however, I consider the data access layer a “subsystem” that encapsulates the database model. So, there.

Exceptions

There are always exceptions, and I promised that the database first and code generation approach may not always be the right choice. These exceptions are (probably not exhaustive):

  • When the schema is unknown and must be discovered. E.g. you’re a tool vendor helping users navigate any schema. Duh… No code generation. But still database first.
  • When the schema needs to be generated on the fly for some task. This sounds a lot like a more or less sophisticated version of the entity attribute value pattern, i.e. you don’t really have a well-defined schema. In that case, it’s often not even sure if an RDBMS will be the right choice.

The nature of exceptions is that they’re exceptional. In the majority of RDBMS usage, the schema is known in advance, placed inside the RDBMS as the single source of “truth”, and clients will have derived copies from it – ideally generated using a code generator.

Flyway and jOOQ for Unbeatable SQL Development Productivity

When performing database migrations, we at Data Geekery recommend using jOOQ with Flyway – Database Migrations Made Easy. In this post, we’re going to look into a simple way to get started with the two frameworks.

Philosophy

There are a variety of ways how jOOQ and Flyway could interact with each other in various development setups. In this tutorial we’re going to show just one variant of such framework team play – a variant that we find particularly compelling for most use cases.

The general philosophy and workflow behind the following approach can be summarised as this:

  • 1. Database increment
  • 2. Database migration
  • 3. Code re-generation
  • 4. Development

The four steps above can be repeated time and again, every time you need to modify something in your database. More concretely, let’s consider:

  • 1. Database increment – You need a new column in your database, so you write the necessary DDL in a Flyway script
  • 2. Database migration – This Flyway script is now part of your deliverable, which you can share with all developers who can migrate their databases with it, the next time they check out your change
  • 3. Code re-generation – Once the database is migrated, you regenerate all jOOQ artefacts (see code generation), locally
  • 4. Development – You continue developing your business logic, writing code against the udpated, generated database schema

0.1. Maven Project Configuration – Properties

The following properties are defined in our pom.xml, to be able to reuse them between plugin configurations:

<properties>
    <db.url>jdbc:h2:~/flyway-test</db.url>
    <db.username>sa</db.username>
</properties>

0.2. Maven Project Configuration – Dependencies

While jOOQ and Flyway could be used in standalone migration scripts, in this tutorial, we’ll be using Maven for the standard project setup. You will also find the source code of this tutorial on GitHub, and the full pom.xml file here.

These are the dependencies that we’re using in our Maven configuration:

<!-- We'll add the latest version of jOOQ 
     and our JDBC driver - in this case H2 -->
<dependency>
    <groupId>org.jooq</groupId>
    <artifactId>jooq</artifactId>
    <version>3.4.0</version>
</dependency>
<dependency>
    <groupId>com.h2database</groupId>
    <artifactId>h2</artifactId>
    <version>1.4.177</version>
</dependency>

<!-- For improved logging, we'll be using 
     log4j via slf4j to see what's going
     on during migration and code generation -->
<dependency>
    <groupId>log4j</groupId>
    <artifactId>log4j</artifactId>
    <version>1.2.16</version>
</dependency>
<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-log4j12</artifactId>
    <version>1.7.5</version>
</dependency>

<!-- To esnure our code is working, we're
     using JUnit -->
<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.11</version>
    <scope>test</scope>
</dependency>

0.3. Maven Project Configuration – Plugins

After the dependencies, let’s simply add the Flyway and jOOQ Maven plugins like so. The Flyway plugin:

<plugin>
    <groupId>org.flywaydb</groupId>
    <artifactId>flyway-maven-plugin</artifactId>
    <version>3.0</version>

    <!-- Note that we're executing the Flyway
         plugin in the "generate-sources" phase -->
    <executions>
        <execution>
            <phase>generate-sources</phase>
            <goals>
                <goal>migrate</goal>
            </goals>
        </execution>
    </executions>

    <!-- Note that we need to prefix the db/migration
         path with filesystem: to prevent Flyway
         from looking for our migration scripts
         only on the classpath -->
    <configuration>
        <url>${db.url}</url>
        <user>${db.username}</user>
        <locations>
            <location>filesystem:src/main/resources/db/migration</location>
        </locations>
    </configuration>
</plugin>

The above Flyway Maven plugin configuration will read and execute all database migration scripts from src/main/resources/db/migrationprior to compiling Java source code. While the official Flyway documentation suggests that migrations be done in the compile phase, the jOOQ code generator relies on such migrations having been done prior to code generation.

After the Flyway plugin, we’ll add the jOOQ Maven Plugin. For more details, please refer to the manual’s section about the code generation configuration.

<plugin>
    <groupId>org.jooq</groupId>
    <artifactId>jooq-codegen-maven</artifactId>
    <version>${org.jooq.version}</version>

    <!-- The jOOQ code generation plugin is also 
         executed in the generate-sources phase,
         prior to compilation -->
    <executions>
        <execution>
            <phase>generate-sources</phase>
            <goals>
                <goal>generate</goal>
            </goals>
        </execution>
    </executions>

    <!-- This is a minimal working configuration. 
         See the manual's section about the code
         generator for more details -->
    <configuration>
        <jdbc>
            <url>${db.url}</url>
            <user>${db.username}</user>
        </jdbc>
        <generator>
            <database>
                <includes>.*</includes>
                <inputSchema>FLYWAY_TEST</inputSchema>
            </database>
            <target>
                <packageName>org.jooq.example.flyway.db.h2</packageName>
                <directory>target/generated-sources/jooq-h2</directory>
            </target>
        </generator>
    </configuration>
</plugin>

This configuration will now read the FLYWAY_TEST schema and reverse-engineer it into the target/generated-sources/jooq-h2 directory, and within that, into the org.jooq.example.flyway.db.h2 package.

1. Database increments

Now, when we start developing our database. For that, we’ll create database increment scripts, which we put into thesrc/main/resources/db/migration directory, as previously configured for the Flyway plugin. We’ll add these files:

  • V1__initialise_database.sql
  • V2__create_author_table.sql
  • V3__create_book_table_and_records.sql

These three scripts model our schema versions 1-3 (note the capital V!). Here are the scripts’ contents

-- V1__initialise_database.sql
DROP SCHEMA flyway_test IF EXISTS;

CREATE SCHEMA flyway_test;
-- V2__create_author_table.sql
CREATE SEQUENCE flyway_test.s_author_id START WITH 1;

CREATE TABLE flyway_test.author (
  id INT NOT NULL,
  first_name VARCHAR(50),
  last_name VARCHAR(50) NOT NULL,
  date_of_birth DATE,
  year_of_birth INT,
  address VARCHAR(50),

  CONSTRAINT pk_t_author PRIMARY KEY (ID)
);
-- V3__create_book_table_and_records.sql
CREATE TABLE flyway_test.book (
  id INT NOT NULL,
  author_id INT NOT NULL,
  title VARCHAR(400) NOT NULL,

  CONSTRAINT pk_t_book PRIMARY KEY (id),
  CONSTRAINT fk_t_book_author_id FOREIGN KEY (author_id) REFERENCES flyway_test.author(id)
);


INSERT INTO flyway_test.author VALUES (next value for flyway_test.s_author_id, 'George', 'Orwell', '1903-06-25', 1903, null);
INSERT INTO flyway_test.author VALUES (next value for flyway_test.s_author_id, 'Paulo', 'Coelho', '1947-08-24', 1947, null);

INSERT INTO flyway_test.book VALUES (1, 1, '1984');
INSERT INTO flyway_test.book VALUES (2, 1, 'Animal Farm');
INSERT INTO flyway_test.book VALUES (3, 2, 'O Alquimista');
INSERT INTO flyway_test.book VALUES (4, 2, 'Brida');

2. Database migration and 3. Code regeneration

The above three scripts are picked up by Flyway and executed in the order of the versions. This can be seen very simply by executing:

mvn clean install

And then observing the log output from Flyway…

[INFO] --- flyway-maven-plugin:3.0:migrate (default) @ jooq-flyway-example ---
[INFO] Database: jdbc:h2:~/flyway-test (H2 1.4)
[INFO] Validated 3 migrations (execution time 00:00.004s)
[INFO] Creating Metadata table: "PUBLIC"."schema_version"
[INFO] Current version of schema "PUBLIC": <>
[INFO] Migrating schema "PUBLIC" to version 1
[INFO] Migrating schema "PUBLIC" to version 2
[INFO] Migrating schema "PUBLIC" to version 3
[INFO] Successfully applied 3 migrations to schema "PUBLIC" (execution time 00:00.073s).

… and from jOOQ on the console:

[INFO] --- jooq-codegen-maven:3.5.0-SNAPSHOT:generate (default) @ jooq-flyway-example ---
[INFO] Using this configuration:
...
[INFO] Generating schemata      : Total: 1
[INFO] Generating schema        : FlywayTest.java
[INFO] ----------------------------------------------------------
[....]
[INFO] GENERATION FINISHED!     : Total: 337.576ms, +4.299ms

4. Development

Note that all of the previous steps are executed automatically, every time someone adds new migration scripts to the Maven module. For instance, a team member might have committed a new migration script, you check it out, rebuild and get the latest jOOQ-generated sources for your own development or integration-test database.

Now, that these steps are done, you can proceed writing your database queries. Imagine the following test case

import org.jooq.Result;
import org.jooq.impl.DSL;
import org.junit.Test;

import java.sql.DriverManager;

import static java.util.Arrays.asList;
import static org.jooq.example.flyway.db.h2.Tables.*;
import static org.junit.Assert.assertEquals;

public class AfterMigrationTest {

    @Test
    public void testQueryingAfterMigration() throws Exception {
        try (Connection c = DriverManager.getConnection("jdbc:h2:~/flyway-test", "sa", "")) {
            Result<?> result =
            DSL.using(c)
               .select(
                   AUTHOR.FIRST_NAME,
                   AUTHOR.LAST_NAME,
                   BOOK.ID,
                   BOOK.TITLE
               )
               .from(AUTHOR)
               .join(BOOK)
               .on(AUTHOR.ID.eq(BOOK.AUTHOR_ID))
               .orderBy(BOOK.ID.asc())
               .fetch();

            assertEquals(4, result.size());
            assertEquals(asList(1, 2, 3, 4), result.getValues(BOOK.ID));
        }
    }
}

If you run the mvn clean install again, the above integration test will now compile and pass!

Reiterate

The power of this approach becomes clear once you start performing database modifications this way. Let’s assume that the French guy on our team prefers to have things his way (no offense intended ;-) ):

-- V4__le_french.sql
ALTER TABLE flyway_test.book 
  ALTER COLUMN title RENAME TO le_titre;

They check it in, you check out the new database migration script, run

mvn clean install

And then observe the log output:

[INFO] --- flyway-maven-plugin:3.0:migrate (default) @ jooq-flyway-example ---
[INFO] --- flyway-maven-plugin:3.0:migrate (default) @ jooq-flyway-example ---
[INFO] Database: jdbc:h2:~/flyway-test (H2 1.4)
[INFO] Validated 4 migrations (execution time 00:00.005s)
[INFO] Current version of schema "PUBLIC": 3
[INFO] Migrating schema "PUBLIC" to version 4
[INFO] Successfully applied 1 migration to schema "PUBLIC" (execution time 00:00.016s).

So far so good, but later on:

[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] C:\...\AfterMigrationTest.java:[24,19] error: cannot find symbol
[INFO] 1 error

When we go back to our Java integration test, we can immediately see that the TITLE column is still being referenced, but it no longer exists:

public class AfterMigrationTest {

    @Test
    public void testQueryingAfterMigration() throws Exception {
        try (Connection c = DriverManager.getConnection("jdbc:h2:~/flyway-test", "sa", "")) {
            Result<?> result =
            DSL.using(c)
               .select(
                   AUTHOR.FIRST_NAME,
                   AUTHOR.LAST_NAME,
                   BOOK.ID,
                   BOOK.TITLE
                   //   ^^^^^ This column no longer exists. 
                   //   We'll have to rename it to LE_TITRE
               )
               .from(AUTHOR)
               .join(BOOK)
               .on(AUTHOR.ID.eq(BOOK.AUTHOR_ID))
               .orderBy(BOOK.ID.asc())
               .fetch();

            assertEquals(4, result.size());
            assertEquals(asList(1, 2, 3, 4), result.getValues(BOOK.ID));
        }
    }
}

Conclusion

This tutorial shows very easily how you can build a rock-solid development process using Flyway and jOOQ to prevent SQL-related errors very early in your development lifecycle – immediately at compile time, rather than in production!

Visit the Flyway website and the jOOQ website.

Liquibase for DB Migrations

I have just now discovered a very nice-looking tool for database migrations: Liquibase

http://www.liquibase.org/

With Liquibase, you can model your DB increments as XML files that will translate to as many as 13 different databases. A sample DB increment (taken from the Liquibase manual):

<!--
  ALTER TABLE PERSON MODIFY COLUMN firstname VARCHAR(5000);
  -->
<modifyColumn tableName="person">
    <column name="firstname" type="varchar(5000)"/>
</modifyColumn>

<!--
  ALTER TABLE ADDRESS ADD CONSTRAINT fk_address_person
  FOREIGN KEY (person_id) REFERENCES person (id);
  -->
<addForeignKeyConstraint constraintName="fk_address_person"
    baseTableName="address" baseColumnNames="person_id"
    referencedTableName="person" referencedColumnNames="id"
/>

<!--
  UPDATE ProductSettings SET property = 'vatCategory'
  WHERE property = 'vat';
  -->
<update tableName="ProductSettings">
    <column name="property" value="vatCategory"/>
    <where>property='vat'</where>
</update>

…and so on. I guess it’s about time to contact the Liquibase folks and ask for cooperation! A fully integrated solution involving database schema management, database schema migration, and jOOQ’s source code generation seem to be the perfect toolset for a Java database developer