Truth First, or Why You Should Mostly Implement Database First Designs

In this much overdue article, I will explain why I think that in almost all cases, you should implement a “database first” design in your application’s data models, rather than a “Java first” design (or whatever your client language is), the latter approach leading to a long road of pain and suffering, once your project grows.

This article is inspired by a recent Stack Overflow question.

Interesting reddit discussions on /r/java and /r/programming.

Code generation

To my surprise, a small group of first time jOOQ users seem to be appalled by the fact that jOOQ heavily relies on source code generation. No one keeps you from using jOOQ the way you want and you don’t have to use code generation, but the default way to use jOOQ according to the manual is to start with a (legacy) database schema, reverse engineer that using jOOQ’s code generator to get a bunch of classes representing your tables, and then to write type safe queries against those tables:

for (Record2<String, String> record : DSL.using(configuration)
//   ^^^^^^^^^^^^^^^^^^^^^^^ Type information derived from the 
//   generated code referenced from the below SELECT clause

       .select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
//           vvvvv ^^^^^^^^^^^^  ^^^^^^^^^^^^^^^ Generated names
       .from(ACTOR)
       .orderBy(1, 2)) {
    // ...
}

The code is generated either manually outside of the build, or automatically with every build. For instance, such a re-generation could follow immediately after a Flyway database migration, which can also be run either manually or automatically.

Source code generation

There are different philosophies, advantages, and disadvantages regarding these manual/automatic approaches, which I don’t want to discuss in this article. But essentially, the point of generated code is that it provides a Java representation of something that we take for granted (a “truth”) either within or outside of our system. In a way, compilers do the same thing when they generate byte code, machine code, or some other type of source code from the original sources – we get a representation of our “truth” in a different language, for whatever reason.

There are many such code generators out there. For instance, XJC can generate Java code from XSD or WSDL files. The principle is always the same:

  • There is some truth (internal or external), like a specification, data model, etc.
  • We need a local representation of that truth in our programming language

And it almost always makes sense to generate that latter, to avoid redundancy.

Type providers and annotation processing

Noteworthy: Another, more modern approach to jOOQ’s particular code generation use-case would be Type Providers, as implemented by F#, in case of which the code is generated by the compiler while compiling. It never really exists in its source form. A similar (but less sophisticated) tool in Java are annotation processors, e.g. Lombok.

In a way, this does the same thing except:

  • You don’t see the generated code (perhaps that’s less appalling to some?)
  • You must ensure the types can be provided, i.e. the “truth” must always be available. Easy in the case of Lombok, which annotates the “truth”. A bit more difficult with database models, which rely on an always available live connection.

What’s the problem with code generation?

Apart from the tricky question whether to trigger code generation manually or automatically, some people seem to think that code must not be generated at all. The reason I hear the most is the idea that it is difficult to set up in a build pipeline. And yes, that is true. There is extra infrastructure overhead. Especially if you’re new to a certain product (like jOOQ, or JAXB, or Hibernate, etc.), setting up an environment takes time you would rather spend learning the API itself and getting value out of it.

If the overhead of learning how the code generator works is too high, then indeed, the API failed to make the code generator easy to use (and later on, to customise). That should be a high priority for any such API. But that’s the only argument against code generation. Other than that, it makes absolutely no sense at all to hand-write the local representation of the internal or external truth.

Many people argue that they don’t have time for that stuff. They need to ship their MVPs. They can finalise their build pipelines later. I say:

“But Hibernate / JPA makes coding Java first easy”

Yes that’s true. And it’s both a bliss and a curse for Hibernate and its users. In Hibernate, you can just write a couple of entities, such as:

@Entity
class Book {
  @Id
  int id;
  String title;
}

And you’re almost set. Let Hibernate generate the boring “details” of how to define this entity in your SQL dialect’s DDL:

CREATE TABLE book (
  id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  title VARCHAR(50),

  CONSTRAINT pk_book PRIMARY KEY (id)
);

CREATE INDEX i_book_title ON book (title);

… and start running the application. That’s really cool to get started quickly and to try out things.

But, huh, wait. I cheated.

  • Will Hibernate really apply that named primary key definition?
  • Will it create the index on TITLE, which I know we’ll need?
  • Will it add an identity specification?

Probably not. While you’re developing your greenfield project, it is convenient to always throw away your entire database and re-generate it from scratch, once you’ve added the additional annotations. So, the Book entity would eventually look like this:

@Entity
@Table(name = "book", indexes = {
  @Index(name = "i_book_title", columnList = "title")
})
class Book {
  @Id
  @GeneratedValue(strategy = IDENTITY)
  int id;
  String title;
}

Cool. Re-generate. Again, this makes it really easy to get started.

But you’ll pay the price later on

At some point, you go to production. And that’s when this model no longer works. Because

Once you go live, you can no longer throw away your database, as your database has become legacy.

From now on, you have to write DDL migration scripts, e.g. using Flyway. And then, what happens to your entities? You can either adapt them manually (so you double the work), or have Hibernate re-generate them for you (how big are your chances of the generation matching your expectations?) You can only lose.

Because once you go to production, you need hotfixes. And those have to go live fast. And since you didn’t prepare for pipelining your migrations to production smoothly, you’ll patch things wildly. And then you run out of time to do it right™. And you’ll blame Hibernate, because it’s always someone else’s fault…

Instead, you could have done things entirely differently from the beginning. Like using those round wheels.

Go Database First

The real “truth” of your database schema, and the “sovereignty” over it, resides with your database. The database is the only place where the schema is defined, and all clients have a copy of the database schema, not vice versa. The data is in your database, not in your client, so it makes perfect sense to enforce the schema and its integrity in the database, right where the data is.

This is old wisdom, nothing new. Primary and unique keys are good. Foreign keys are good. Check constraints are good. Assertions (when they’re finally implemented) are good.

And that’s not where it ends. For instance, if you’re using Oracle, you may want to specify:

  • In what tablespace your table resides
  • What PCTFREE value it has
  • What the cache size of your sequence (behind the identity) is

Maybe, all of this doesn’t matter in small systems, but you don’t have to go “big data” before you can profit from vendor-specific storage optimisations as the above. None of the ORMs I’ve ever seen (including jOOQ) will allow you to use the full set of DDL options that you may want to use on your database. ORMs offer some tools to help you write DDL.

But ultimately, a well-designed schema is hand written in DDL. All generated DDL is only an approximation of that.

What about the client model?

As mentioned before, you will need a copy of your database schema in your client, a client representation. Needless to say that this client representation needs to be in-sync with the real model. How to best do that? By using a code generator.

All databases expose their meta information through SQL. Here’s how to get all tables from your database in various SQL dialects:

-- H2, HSQLDB, MySQL, PostgreSQL, SQL Server
SELECT table_schema, table_name
FROM information_schema.tables

-- DB2
SELECT tabschema, tabname
FROM syscat.tables

-- Oracle
SELECT owner, table_name
FROM all_tables

-- SQLite
SELECT name
FROM sqlite_master

-- Teradata
SELECT databasename, tablename
FROM dbc.tables

These queries (or similar ones, e.g. depending on whether views, materialised views, table valued functions should also be considered) are also run by JDBC’s DatabaseMetaData.getTables() call, or by the jOOQ-meta module.

From the result of such queries, it’s relatively easy to generate any client representation of your database model, regardless what your client technology is.

  • If you’re using JDBC or Spring, you can create a bunch of String constants
  • If you’re using JPA, you can generate the entities themselves
  • If you’re using jOOQ, you can generate the jOOQ meta model

Depending on the amount of features your client API offers (e.g. jOOQ or JPA), the generated meta model can be really rich and complete. Consider, for instance, jOOQ 3.11’s implicit join feature, which relies on generated meta information about the foreign key relationships between your tables.

Now, any database increment will automatically lead to updated client code. For instance, imagine:

ALTER TABLE book RENAME COLUMN title TO book_title;

Would you really want to do this work twice? No way. Just commit the DDL, run it through your build pipeline, and have an updated entity:

@Entity
@Table(name = "book", indexes = {

  // Would you have thought of this?
  @Index(name = "i_book_title", columnList = "book_title")
})
class Book {
  @Id
  @GeneratedValue(strategy = IDENTITY)
  int id;

  @Column("book_title")
  String bookTitle;
}

Or an updated jOOQ class. Plus: Your client code might no longer compile, which can be a good thing! Most DDL changes are also semantic changes, not just syntactic ones. So, it’s great to be able to see in compiled client source code, what code is (or may be) affected by your database increment.

A single truth

Regardless what technology you’re using, there’s always one model that contains the single truth for a subsystem – or at least, we should aim for this goal and avoid the enterprisey mess where “truth” is everywhere and nowhere. It just makes everything much simpler. If you exchange XML files with some other system, you’re going to use XSD. Like jOOQ’s INFORMATION_SCHEMA meta model in XML form:
https://www.jooq.org/xsd/jooq-meta-3.10.0.xsd

  • XSD is well understood
  • XSD specifies XML content very well, and allows for validation in all client languages
  • XSD can be versioned easily, and evolved backwards compatibly
  • XSD can be translated to Java code using XJC

The last bullet is important. When communicating with an external system through XML messages, we want to be sure our messages are valid. That’s really really easy to do with JAXB, XJC, and XSD. It would be outright nuts to think that a Java-first approach where we design our messages as Java objects could somehow be reasonably mapped to XML for someone else to consume. That generated XML would be of very poor quality, undocumented, and hard to evolve. If there’s an SLA on such an interface, we’d be screwed.

Frankly, that’s what happens to JSON APIs all the time, but that’s another story, another rant…

Databases: Same thing

When you’re using databases, it’s the same thing. The database owns its data and it should be the master of the schema. All modifications to the schema should be implemented using DDL directly, to update the single truth.

Once that truth is updated, all clients need to update their copies of the model as well. Some clients may be written in Java, using either (or both of) jOOQ and Hibernate, or JDBC. Other clients may be written in Perl (good luck to them). Even other clients may be written in C#. It doesn’t matter. The main model is in the database. ORM-generated models are of poor quality, not well documented, and hard to evolve.

So, don’t do it. And, don’t do it from the very beginning. Instead, go database first. Build a deployment pipeline that can be automated. Include code generators to copy your database model back into the clients. And stop worrying about code generation. It’s a good thing. You’ll be productive. All it takes is a bit of initial effort to set it up, and you’ll get years of improved productivity for the rest of your project.

Thank me later.

Clarification

Just to be sure: This article in no way asserts that your database model should be imposed on your entire system (e.g. your domain, your business logic, etc. etc.). The claim I made here is that client code interacting with the database should act upon the database model, and not have its own first class model of the database instead. This logic typically resides in the data access layer of your client.

In 2-tier architectures, which still have their place sometimes, that may be the only model of your system. In most systems, however, I consider the data access layer a “subsystem” that encapsulates the database model. So, there.

Exceptions

There are always exceptions, and I promised that the database first and code generation approach may not always be the right choice. These exceptions are (probably not exhaustive):

  • When the schema is unknown and must be discovered. E.g. you’re a tool vendor helping users navigate any schema. Duh… No code generation. But still database first.
  • When the schema needs to be generated on the fly for some task. This sounds a lot like a more or less sophisticated version of the entity attribute value pattern, i.e. you don’t really have a well-defined schema. In that case, it’s often not even sure if an RDBMS will be the right choice.

The nature of exceptions is that they’re exceptional. In the majority of RDBMS usage, the schema is known in advance, placed inside the RDBMS as the single source of “truth”, and clients will have derived copies from it – ideally generated using a code generator.

16 thoughts on “Truth First, or Why You Should Mostly Implement Database First Designs

  1. Nice article like my professor in school would say: Database is in the heart of every information system.

    I can say from personal experience working with a bad DB design that it is a nightmare.

    About entity attribute value pattern I would say think 5x or better yet 10x if you rely need it and then DON’T DO IT. You will regret it in a year or more. I had to work with a system like that query’s that should run for 100ms max were running more than 1s and joining in code.

    RDBMS is a powerful tool in right hands and a horrible mess in wrong ones.

  2. The problem is that java is a oop language and the database is relational. They are not compatible. By putting database first you can’t use oop design for your business logic. That can’t be right in an oop language.
    If we ignore the object-relational impedance mismatch and only talk about the database api than i actually agree with the article.

    • The article only talks about the database API to the client, i.e. the client’s data access layer. There isn’t really an impedance mismatch at the access layer. You can map database tuples to OO objects and vice versa, see an earlier article here:
      https://blog.jooq.org/2015/08/26/there-is-no-such-thing-as-object-relational-impedance-mismatch

      If you want to extend this discussion to any kind of modelling, then yes, relational isn’t necessarily a good choice, outside of the database. But there isn’t necessarily an impedance mismatch, because your domain and business logic may be implemented in an entirely different “subsystem”, which doesn’t know anything about databases.

    • I don’t know what “not compatible” means here, but it can’t have the standard meaning of the words, because there’s millions of programs in Java (and other object-oriented languages) that use relational databases.

      I don’t see anything in this article that would preclude “using oop design for your business logic”, either. Can you point to a specific problem that would occur? There’s nothing here that says you can’t use classes or objects in your Java program. In fact, designing the tables and classes separately should make it easier to have classes that work well for both business logic and database storage.

      If you mean simply that it will take a little foresight to ensure your program will work well with your database in the future, that’s true — I think that’s the main point of the article. It’s a completely different claim than “not compatible”, though.

  3. I’ve been using a radically different design: A parameterized element model, where everything is kept in memory and the DB only persists the data for retrieval on system restart. Data is stored as JSON or XML in the DB, so very few columns. Sure, this isn’t good if you’re using millions of records, but that thankfully is never our case. All queries are in-memory and are blazing fast without us even needing indices. Clustering hasn’t been necessary either as the load is low.

    • The relational model solves a lot of problems, including that of normalisation and data integrity. Your model doesn’t seem to need such a solution, so you’re fine with hierarchical (or even networked) data models. But this is not the rule – the rule is that normalisation is a good thing and pays off in the long run for larger systems.

  4. At ToroDB we created very non standard things with jOOQ to store MongoDB documents in Posgtres, including table objects at runtime. It was a funny experience.

  5. Unfortunately not. @ahachete frequently request the dev team to blog about it (in fact, the idea was to write the post on your blog), but we never did it and… it was at 2015… I’m looking at the code right now and barely understand it :(.

    Here you can see how we translated from our *QueryCriteria* model (where we modeled abstract queries, including MongoDB queries) to SQL using jOOQ conditions: https://github.com/torodb/server/blob/94d04bd4c1ca7b88fb07195489f346c06101199f/torod/db-wrappers/postgresql/src/main/java/com/torodb/torod/db/postgresql/query/QueryCriteriaToSQLTranslator.java

    But as I said, it is not easy to understand (and I’m sure there are tons of errors and/or bad practices there

  6. A database first design is definitely a solid approach. I’m not sure why some folks like to hibernate to update the schema when new changes are pushed to production. In my situation, I was just out of school, the tech bubble crashed, and I was responsible for all layers. Keeping 4-5 six figure guys around was not sustainable. My DB skills weren’t exactly great, and my primary goal was to get us making money first, cause you can’t improve anything when you are homeless. Good to say, the company is now a multi-million dollar one and growing like fire. Bad to say, corporate ain’t gonna pause to correct my database sins of the past, and there is a ton of em. I reckon if you’re a mom and pop shop, the DB issues are the lab rat’s problem. When you’re a a million dollar ATM for corporate, the DB issues are corporate’s problem. Good luck to the new guy. 😉

    • Sure, I don’t deny that the option of going Java-first is helpful when bootstrapping new software. As I said:

      […] it’s both a bliss and a curse for Hibernate and its users

      The importance is to know when to stop doing Java-first and move to SQL-first in any product. Your multi million dollar company could be a 1.5 x multi million dollar company if it hadn’t amassed as much technical debt, but as you correctly put it, perhaps it would have never been possible to reach its current stage in the first place.

  7. I’m sorry, but this approach leads to an anemic model in its high-level language. Maybe it will work with simple CRUD.

    • You’re still assuming that the database model and the “high level language” model should be the same, and that’s simply not generally true.

      The first thing that’s not true in your claim is that the language you’re referring to is “high level”. SQL is a 4GL and thus a more advanced, “higher level” language. You’re referring to a “higher layer” in the architecture, not a higher level of sophistication in the language. And the higher layer, if properly designed and layered, doesn’t know too much about the database subsystem anyway, so I’m not quite sure what your criticism is aiming at, here.

  8. I’m definitely in favour of the “database first” design and I believe that for sufficiently complex/scalable applications that are intended to live for years it is the right way of doing things. In my opinion, Hibernate-like approach works for simple, short-lived applications where time to market is more important that long-term maintenance.

    I’m not in peace however with the code generation approach and I wonder how one could apply it in real life. What bothers me is not the price of learning (it’s not that difficult) but the hassle of living with it in dev environment. Subconsciously I feel there’s a chicken and egg problem here. With code generation, I have to have an updated schema before I can even begin to compile. In order to generate/update the schema I need to run flyway/liquibase scripts. One of the approaches with DropWizard is to execute migrations with the same executable as the server itself (different command line arguments). But then we have to compile first. Oops…
    And in addition there’s a challenge to support it the dev environment without accidentally causing a full rebuild here and there. Besides, IDEs never really “like” the generated code in general, there are always little glitches around it.

    When I started with jOOQ I first ran the code generator by hand, studied the result and eventually said “hey, it’s not a rocket science, I can do it myself”. So I do manually the part of defining the table classes (“extends TableImpl”) with their fields. I prefer not to create general purpose “record” classes because I believe every flow should design them for itself. So I agree, it’s a bit of manual work but my personal take that it’s less hassle than maintaining the integration of the code generator with the build/dev process.

    Again, what’s important is “DB first”! The rest is technicalities 🙂

    • Hi Victor,

      Certainly a lot of valid input here. The optimal way to handle this with code generation is to generate manually and check in versioned generated sources. There’s still a pending feature request somewhere to make versioning of generated sources easier. This way, you don’t have the problem of code not compiling.

      Managing the meta model manually is brittle and I don’t think this will work well with more than maybe 10-15 tables. Some jOOQ customers use jOOQ with schemas of 1000 tables or so. Code generation is the only option, here.

      Also, just be aware that TableImpl is internal API and may change incompatibly between minor releases.

      Having said so, there definitely is room for improvement.

      Cheers,
      Lukas

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.