Say NO to Excessive Use of Surrogate Keys if Performance Really Matters to You

We programmers keep cargo culting these wrong ideas. Recently, we said “NO” to Venn diagrams. Today we’re going to say no to surrogate keys.

The surrogate keys vs. natural keys non-debate is one of the most overheated debates in data architecture, and I don’t get why everyone is so emotional. Both sides claim to hold the ultimate truth (just like in the tabs vs. spaces non-debate) and prevent added value where it really matters.

This article sheds some light into why you shouldn’t be so dogmatic all the time. By the end of the article, you’ll agree, promised.

What’s so great about surrogate keys?

In case your SQL-speak is a bit rusty, a surrogate key is an artificial identifier. Or how Wikipedia puts it:

A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key which is derived from application data.

There are three clear advantages that surrogate keys have:

  1. You know every table has one, and it probably has the same name everywhere: ID.
  2. It is relatively small (e.g. a BIGINT)
  3. You don’t have to go through the “hassle” of designing your table for 30 seconds longer, trying to find a natural key

There are more advantages (check out the wikipedia article), and of course, there are disadvantages. Today, I’d like to talk about architectures where EVERY table has a surrogate key, even when it makes absolutely no sense.

When do surrogate keys make absolutely no sense?

I’m currently helping a customer improve performance on their queries that run against a log database. The log database essentially contains relevant information about sessions, and some transactions related to those sessions. (In case you’re jumping to the conclusion “hey don’t use SQL for logs”: Yes, they also use Splunk for unstructured logs. But these here are structured, and they run a lot of SQL style analytics against them).

Here’s what you can imagine is stored in that sessions table (using Oracle):

CREATE TABLE sessions (

  -- Useful values
  sess        VARCHAR2(50 CHAR) NOT NULL PRIMARY KEY,
  ip_address  VARCHAR2(15 CHAR) NOT NULL,
  tracking    VARCHAR2(50 CHAR) NOT NULL,
  
  -- and much more info here
  ...
)

Except, though, that’s not how the schema was designed. It was designed like this:

CREATE TABLE sessions (

  -- Meaningless surrogate key
  id NUMBER(18) NOT NULL PRIMARY KEY,

  -- Useful values
  sess        VARCHAR2(50 CHAR) NOT NULL,
  
  -- Meaningless foreign keys
  ip_id       NUMBER(18) NOT NULL,
  tracking_id NUMBER(18) NOT NULL,
  ...
  
  FOREIGN KEY (ip_id) 
    REFERENCES ip_addresses (ip_id),
  FOREIGN KEY (tracking_id) 
    REFERENCES tracking (tracking_id),
  ...
  
  -- Non-primary UNIQUE key
  UNIQUE (sess),
)

So, this so called “fact table” (it’s a star schema after all) contains nothing useful, only a set of surrogate keys that contain references to the interesting values, which are all located in other tables. For instance, if you want to count all session for a given IP address, you already need to run a JOIN:

SELECT ip_address, count(*)
FROM ip_addresses
JOIN sessions USING (ip_id)
GROUP BY ip_address

After all, we need the join because what the user really likes to see is the IP address, not the surrogate key. Duh, right? Here’s the execution plan:

------------------------------------------------------
| Operation           | Name         | Rows  | Cost  |
------------------------------------------------------
| SELECT STATEMENT    |              |  9999 |   108 |
|  HASH GROUP BY      |              |  9999 |   108 |
|   HASH JOIN         |              | 99999 |   104 |
|    TABLE ACCESS FULL| IP_ADDRESSES |  9999 |     9 |
|    TABLE ACCESS FULL| SESSIONS     | 99999 |    95 |
------------------------------------------------------

Perfect hash join with two full table scans. In my example database, I have 100k sessions and 10k IP addresses.

Obviously, there’s an index on the IP_ADDRESS column, because I want to be able to filter by it. Something meaningful, like:

SELECT ip_address, count(*)
FROM ip_addresses
JOIN sessions USING (ip_id)
WHERE ip_address LIKE '192.168.%'
GROUP BY ip_address

Obviously, the plan is a bit better, because we’re returning less data. Here’s there result:

----------------------------------------------------------------
| Operation                     | Name         | Rows  | Cost  |
----------------------------------------------------------------
| SELECT STATEMENT              |              |     1 |    99 |
|  HASH GROUP BY                |              |     1 |    99 |
|   HASH JOIN                   |              |    25 |    98 |
|    TABLE ACCESS BY INDEX ROWID| IP_ADDRESSES |     1 |     3 |
|     INDEX RANGE SCAN          | I_IP         |     1 |     2 |
|    TABLE ACCESS FULL          | SESSIONS     | 99999 |    95 |
----------------------------------------------------------------

Intersting. We can now use our index statistics to estimate that our predicate will return only one row from the ip_address table. Yet, we still get a hash join with significant cost, for what now appears to be a rather trivial query.

What would the world look like without surrogate keys?

Easy. We no longer need the join, every time we need something IP address related from the sessions table.

Our two queries become, trivially:

-- All counts
SELECT ip_address, count(*)
FROM sessions2
GROUP BY ip_address

-- Filtered counts
SELECT ip_address, count(*)
FROM sessions2
WHERE ip_address LIKE '192.168.%'
GROUP BY ip_address

The first query yields a simpler execution plan, with around the same cost estimate.

--------------------------------------------------
| Operation          | Name      | Rows  | Cost  |
--------------------------------------------------
| SELECT STATEMENT   |           |   256 |   119 |
|  HASH GROUP BY     |           |   256 |   119 |
|   TABLE ACCESS FULL| SESSIONS2 | 99999 |   116 |
--------------------------------------------------

We don’t seem to gain that much here, but what happens with the filtered query?

------------------------------------------------
| Operation            | Name  | Rows  | Cost  |
------------------------------------------------
| SELECT STATEMENT     |       |     1 |     4 |
|  SORT GROUP BY NOSORT|       |     1 |     4 |
|   INDEX RANGE SCAN   | I_IP2 |   391 |     4 |
------------------------------------------------

OMG! Where did our costs go? Huh, this seems to be extremely fast! Let’s benchmark!

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100000;
BEGIN
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT ip_address, count(*)
      FROM ip_addresses
      JOIN sessions USING (ip_id)
      WHERE ip_address LIKE '192.168.%'
      GROUP BY ip_address
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Surrogate: '
    || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT ip_address, count(*)
      FROM sessions2
      WHERE ip_address LIKE '192.168.%'
      GROUP BY ip_address
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Natural  : '
    || (SYSTIMESTAMP - v_ts));
END;
/
Surrogate: +000000000 00:00:03.453000000
Natural  : +000000000 00:00:01.227000000

The improvement is significant, and we don’t have a lot of data here. Now, when you think about it, it is kind of obvious, no? For the semantically exact same query, we either run a JOIN, or we don’t. And this is using very very little data. The actual customer database has hundreds of millions of sessions, where all these JOINs waste valuable resources for nothing but an artificial surrogate key which was introduced… because that’s how things were always done.

And, as always, don’t trust your execution plan costs. Measure. Benchmarking 100 iterations of the unfiltered query (the one that produced a hash join) yields:

Surrogate: +000000000 00:00:06.053000000
Natural  : +000000000 00:00:02.408000000

Still obvious, when you think of it.

Do note that we’re not denormalising yet. There’s still an IP_ADDRESS table, but now it contains the business key as the primary key (the address), not the surrogate key. In more rare querying use-cases, we’ll still join the table, in order to get IP-address related info (such as country).

Taking it to the extreme

The database at this customer site was designed by someone who appreciates purity, and I can certainly relate to that some times. But in this case, it went clearly wrong, because there were many of these queries that were actually filtering for a “natural key” (i.e. a business value), but needed to add yet another JOIN just to do that.

There were more of these cases, for instance:

  • ISO 639 language codes
  • ISIN security numbers
  • Account numbers, which had a formal identifier from an external system

and many more. Each and every time, there were dozens of JOINs required just to get that additional mapping between surrogate key and natural key.

Sometimes, surrogate keys are nice. They do use a little less disk space. Consider the following query (credits go to Stack Overflow user WW.)

SELECT 
  owner,
  table_name,
  TRUNC(SUM(bytes)/1024) kb,
  ROUND(ratio_to_report(SUM(bytes)) OVER() * 100) Percent
FROM (
  SELECT 
    segment_name table_name,
    owner,
    bytes
  FROM dba_segments
  WHERE segment_type IN (
    'TABLE', 
	'TABLE PARTITION', 
	'TABLE SUBPARTITION'
  )
  UNION ALL
  SELECT 
    i.table_name,
    i.owner,
    s.bytes
  FROM dba_indexes i,
    dba_segments s
  WHERE s.segment_name = i.index_name
  AND s.owner          = i.owner
  AND s.segment_type  IN (
    'INDEX', 
	'INDEX PARTITION', 
	'INDEX SUBPARTITION'
  )
)
WHERE owner = 'TEST'
AND table_name IN (
  'SESSIONS', 
  'IP_ADDRESSES', 
  'SESSIONS2'
 )
GROUP BY table_name, owner
ORDER BY SUM(bytes) DESC;

This will show us the disk space used by each object:

TABLE_NAME                   KB    PERCENT
-------------------- ---------- ----------
SESSIONS2                 12288         58
SESSIONS                   8192         39
IP_ADDRESSES                768          4

Yes. We use a little more disk space, because now our primary key in the sessions table is a VARCHAR2(50) rather than a NUMBER(18). But disk space is extremely cheap, whereas wall clock time performance is essential. By removing just a little complexity, we’ve greatly increased performance already for a simple query.

Conclusion

Surrogate keys or natural keys? I say both. Surrogate keys do have advantages. You generally don’t want a 5 column composite primary key. And even less so, you don’t want a 5 column composite foreign key. In these cases, using a surrogate key can add value by removing complexity (and probably increasing performance again). But choose wisely. Ever so often, blindly using surrogate keys will go very wrong, and you’ll pay for it dearly when it comes to querying your data.

Further reading:

There is no Such Thing as Object-Relational Impedance Mismatch

Much of the ORM criticism of the last decade missed the point, being inaccurate. By the end of this article, we will conclude with the following:

There is no significant difference between the relational (data) model and object oriented models

How to come to this conclusion? Read on!

How we came to believe in this fallacy

Many popular bloggers and opinion leaders have missed no chance to bash ORMs for their “obvious” impedance mismatch with the relational world. N+1, inefficient queries, library complexity, leaky abstractions, all sorts of buzzwords have been employed to dismiss ORMs – often containing a lot of truth, albeit without providing a viable alternative.

But are these articles really criticising the right thing?

Few of the above articles recognise a central fact, which has been elicited eloquently and humorously by Erik Meijer and Gavin Bierman in his very interesting paper “A co-Relational Model of Data for Large Shared Data Banks“, subtitled:

Contrary to popular belief, SQL and noSQL are really just two sides of the same coin.

Or in other words: The “hierarchical” object world and the “relational” database world model the exact same thing. The only difference is the direction of the arrows that you draw in your diagrams.

Let this sink in.

  • In the relational model, children point to their parent.
  • In the hierarchical model, parents point to their children.

That’s all there is to it.

hierarchical-vs-relational

What is an ORM?

ORMs fill the bridge between the two worlds. They’re the inverters of arrows, if you will. They will make sure that every “relation” in your RDBMS can be materialised as an “aggregation” or “composition” in your “hierarchical” world (this works for objects, XML, JSON, and any other format). They make sure that such materialisation is properly transacted. That changes to individual attributes or to relational (aggregational, compositional) attributes are properly tracked and purged back into the master model, the database – where the model is persisted. Individual ORMs differ in terms of offered features and in how much mapping logic they offer in addition to mapping individual entities to individual types.

  • Some ORMs may help you implement locking
  • Some may help you to patch model mismatches
  • Some may focus merely on a 1:1 mapping between these classes and tables

But all ORMs do one very simple thing. Ultimately, they take rows from your tables and materialise them as objects in your class model and vice-versa.

A very nice overview of different ORMs has been compiled on the Vertabelo blog, recently, by the way.

Tables and classes are the same thing

Give or take 1-2 implementation details, an RDBMS’s table and an OO language’s class is the same thing. A specification of a set of grouped attributes, each with their associated type. Consider the following example, using SQL and Java:

SQL

CREATE TABLE author (
  first_name VARCHAR(50),
  last_name VARCHAR(50)
);

Java

class Author {
  String firstName;
  String lastName;
}

There is absolutely no conceptual difference between the two – the mapping is straightforward. The mapping is even straightforward when you consider “relations” / “compositions” between different entities / types:

SQL (let’s leave away constraints for simplicity)

CREATE TABLE author (
  id BIGINT,
  first_name VARCHAR(50),
  last_name VARCHAR(50)
);

CREATE TABLE book (
  id BIGINT,
  author_id BIGINT,
  title VARCHAR(50),
);

Java

class Author {
  Long id;
  String firstName;
  String lastName;
  Set<Book> books;
}

class Book {
  Long id;
  Author author;
  String title;
}

The implementation details are omitted (and probably account for half of the criticism). But omitting further details allows for straight-forward 1:1 mapping of individual rows from your database to your Java model, without any surprises. Most ORMs – in the Java ecosystem Hibernate in particular – have managed to implement the above idea very well, hiding away all the technical details of actually doing such a model transfer between the RDBMS and Java.

In other words:

There is absolutely nothing wrong with this mapping approach!

Yet: There *IS* an impedance mismatch, somewhere

The “problems” that many bloggers criticise arise not from the non-existing mismatch between the two model representations (“relational” vs. “hierarchical”). The problems arise from SQL, which is a decent implementation of relational algebra.

In fact, the very same mismatch that everyone criticises is also present between:

Relational algebra has been defined in order to be able to query relations and to form new ad-hoc relations as an output of such queries. Depending on the operations and transformations that are applied, the resulting tuples may have absolutely nothing to do with the individual entities involved in a query. In other, ORM-y words: The product of relational algebra, and in particular of SQL has no use, as it can no longer be further processed by the ORM, let alone persisted back into the database.

To make things “worse”, SQL today is a large super-set of the features offered by relational algebra. It has gotten much more useful than when it was conceived.

Why this mismatch still affects modern ORMs

The previous paragraphs outlined the single main reason why ORMs are really criticised, even if such criticism often doesn’t mention this exact reason:

SQL / relational algebra is not really appropriate to partially materialise relations into a client / store changes back into the database. Yet, most RDBMS offer only SQL for that job.

Back to the author / book example. When you want to load and display an author and their books to a web application’s user, you’d like to simply fetch that author and their books, call simple methods like author.add(book) as well as author.remove(book) and let some magic flush your data back into the storage system.

Thinking about the amount of SQL code to be written for such a simple CRUD task makes everyone squeal.

tweet thisLife’s too short to spend time on CRUD

Perhaps QUEL might have been a better language for CRUD, but that ship has sailed. And unfortunately, because of SQL being an inappropriate language for this job, you cannot ignore that “magic” but have to know well what happens behind the scenes, e.g. by tweaking Hibernate’s fetching strategies.

Translated to SQL, this may be implemented in several ways:

1. Fetching with JOIN

Using outer joins, all the involved entities can be queried in one go:

SELECT author.*, book.*
FROM author
LEFT JOIN book ON author.id = book.author_id
WHERE author.id = ?

Advantages:

  • A single query can be issued and all the data can be transferred at once

Disadvantages:

  • The author attributes are repeated in every tuple. The client (ORM) has to de-duplicate authors first, before populating the author-book relationship. This can be particularly bad when you have many nested relations that should be fetched at once.

2. Fetching with SELECT

A single query is issued for each entity:

SELECT *
FROM author
WHERE id = ?

SELECT *
FROM book
WHERE author_id = ?

Advantages:

  • The amount of data to be transferred is minimal: Each row is transferred exactly once.

Disadvantages:

  • The amount of queries that are issued may explode into the well-known N+1 problem.

Hibernate in particular knows other types of fetch strategies, although they are essentially a variant / optimisation of one of the above.

Why not use SQL MULTISET?

The ideal way to fetch all data in this case using advanced SQL would be by using MULTISET:

SELECT author.*, MULTISET (
  SELECT book.*
  FROM book
  WHERE book.author_id = author.id
) AS books
FROM author
WHERE id = ?

The above will essentially create a nested collection for each author:

first_name  last_name   books (nested collection)
--------------------------------------------------

Leonard     Cohen       title
                        --------------------------
                        Book of Mercy
                        Stranger Music
                        Book of Longing

Ernest      Hemingway   title
                        --------------------------
                        For Whom the Bell Tolls
                        The Old Man and the Sea

If you add another nested entity, it is easy to see how another MULTISET could allow for additionally nested data:

SELECT author.*, MULTISET (
  SELECT book.*, MULTISET (
    SELECT c.*
    FROM language AS t
    JOIN book_language AS bl
    ON c.id = bc.language_id
    AND book.id = bc.book_id
  ) AS languages
  FROM book
  WHERE book.author_id = author.id
) AS books
FROM author
WHERE id = ?

The outcome would now be along the lines of:

first_name  last_name   books
-----------------------------------------------------

Leonard     Cohen       title            languages
                        -----------------------------
                        Book of Mercy    language
                                         ------------
                                         en

                        Stranger Music   language
                                         ------------
                                         en
                                         de

                        Book of Longing  language
                                         ------------
                                         en
                                         fr
                                         es

Advantages:

  • A single query can materialise all eager-loaded rows with minimal bandwidth usage.

Disadvantages:

  • None.

Unfortunately, MULTISET is poorly supported by RDBMS.

MULTISET (as well as arrays and other collection types) have been introduced formally into the SQL standard as of SQL:2003, as a part of an initiative to embed OO features into the SQL language. Oracle, for instance, has implemented much of it, much like Informix did, or the lesser-known CUBRID (although using vendor-specific syntax).

Other databases like PostgreSQL allow for aggregating nested rows into typed arrays, which works the same way although with a bit more syntactic effort.

MULTISET and other ORDBMS SQL features are the perfect compromise, allowing for combining the best of the “relational” model with the best of the “hierarchical” model. Allowing for combining CRUD operations with querying in one go, removing the need for sophisticated ORMs, as the SQL language can be used directly to map all your data from your (relational) database to your (hierarchical) client representation with no friction.

Conclusion and call to action!

We’re living through exciting times in our industry. The elephant (SQL) in the room is still here, learning new tricks all the time. The relational model has served us well, and has been enriched with hierarchical models in various implementations. Functional programming is gaining traction, complementing object orientation in very useful ways.

Think of the glue, putting all these great technological concepts together, allowing for:

  • Storing data in the relational model
  • Materialising data in the hierarchical model
  • Processing data using functional programming

That awesome combination of techniques is hard to beat – we’ve shown how SQL and functional programming can work with jOOQ. All that’s missing – in our opinion – is better support for MULTISET and other ORDBMS features from RDBMS vendors.

Thus, we urge you, PostgreSQL developers: You’re creating one of the most innovative databases out there. Oracle is ahead of you in this area – but their implementation is too strongly tied to PL/SQL, which makes it clumsy. Yet, you’re missing out on one of the most awesome SQL feature sets. The ability to construct nested collections (not just arrays), and to query them efficiently. If you lead the way, other RDBMS will follow.

And we can finally stop wasting time talking about the object-relational impedance non-mismatch.

jOOQ Newsletter December 13, 2013

subscribe to the newsletter here

A jOOQ Runtime Only Distribution

Several of our customers have made us aware of the fact that they don’t really need the jOOQ code generator, only the SQL builder / query DSL API and possibly the SQL execution functionality. In the next weeks, we’ll be working towards a new “jOOQ Runtime Only Distribution” that will ship at a lower price. Concretely, we’ll be giving those customers a 25% discount on the regular distributions for any of jOOQ Express, jOOQ Professional, and jOOQ Enterprise.

Obviously, we recommend to use jOOQ’s code generator to profit from the full feature scope that we offer.

License Amendments

We have updated our commercial license to improve your legal relationship with Data Geekery. Essentially, these things have been amended:

  • Definitions: We have added the missing definition of what is a “Minor Defect”
  • 6.2 Distribution Right: The distribution right section now grants you a perpetual license to distribute the software, instead of a timely limited one. This will allow you to continue to distribute, embed and use jOOQ in your End-user Application even if you terminate your Developer Workstation License Agreement with us. You will, however, still need to license jOOQ in order to maintain your End-user Application.
  • 7.1.1 Remedial Services: The remedial services section now formally defines how Minor Defects are remedied

As this updated license grants new rights to our customers, it shall be in effect immediately also for existing jOOQ 3.2 customers.

Upcoming Events

We have recently presented jOOQ at the Java2Days conference in Sofia, Bulgaria as well as at the Java User Group Berlin-Brandenburg. Both talks have enjoyed a high attendance with lots of SQL developers challenging us with interesting questions. Of course, jOOQ responds to most questions already, as Manuel Bernhardt, another speaker has observed:

If you want us to talk about jOOQ or SQL near you, do not hesitate to contact us.

Here is an overview of other, upcoming events:

Stay informed about 2014 events on www.jooq.org/news. If you’re looking for the slides, they’re available for free under a CC BY-SA 3.0 license on SlideShare.

Using jOOQ with Hibernate

While jOOQ can be seen as a popular alternative to Hibernate, there is no stopping you from using both frameworks in the same application. While Hibernate heavily improves every day CRUD operations, jOOQ heavily improves writing SQL in Java. The two goals can be orthogonal and thus it may make perfect sense to combine the two.

Vlad Mihalcea, an enthusiastic jOOQ user, is writing up a series of blog posts on his blog about how to use jOOQ with Hibernate:

Gavin King himself chimed in on Google+ and confirmed once again that it was never his intention for Hibernate to be used for everything.

If you want to share your own experience of your jOOQ / Spring, jOOQ / Hibernate, jOOQ / Anything integration, let us know!

SQL Zone – “Lightning Fast SQL with Proper Indexing”

Understanding SQL and the history of SQL is of the essence when you want to get the best out of your relational database. This has recently been nicely explained by Markus Winand at the Oredev conference 2013. Luckily, his presentation has been recorded for free review.

More about the history of SQL can be seen in this interesting article about Codd’s Relational Vision – Has NoSQL Come Full Circle? It proves what we have been evangelising for a while, ourselves. NoSQL is not a new technology. It is more of a return to pre-Codd times when people did not have a powerful relational model to abstract their storage implementation away from their application. This article claims that Codd’s greatest achievement was the fact that he surpassed the deficiencies of early databases:

  • Access dependencies
  • Order dependencies
  • Index dependencies

Read the full article for a very interesting historic insight into the relational model, and why NoSQL might not be a good answer for most general problems.

Use ModelMapper and jOOQ to Regain Control of your Domain Model

One of the things that Hibernate is quite good at is CRUD, i.e. persisting object graphs to the database. This is particularly true if your application runs in a Java domain-model-driven context. Your models are required to adhere to the standards set by JPA/Hibernate, of course. The same applies to mapping relational-model-driven data onto complex object graphs in memory. Again, you’ll have to adhere to the standards set by JPA/Hibernate.

If you’re operating on rather complex relational models, mapping data onto rather complex domain models, then you might want to get back in control of the mapping process, as auto-mapping will cause more headaches than it solves problems. An interesting approach has been shown recently on the ModelMapper website in an example integration with jOOQ. (note, there is also an example integration with JDBI). With the permission of the author Jonathan Halterman, I’m citing this interesting example:

jOOQ Integration

ModelMapper’s jOOQ integration allows you to map a jOOQ Record to a JavaBean.

Setup

To get started, add the modelmapper-jooq Maven dependency to your project:

<dependency>
  <groupId>org.modelmapper</groupId>
  <artifactId>modelmapper</artifactId>
  <version>0.6.1</version>
</dependency>
<dependency>
  <groupId>org.modelmapper</groupId>
  <artifactId>modelmapper-jooq</artifactId>
  <version>0.6.1</version>
</dependency>

Next, configure ModelMapper to support the RecordValueReader, which allows for values to be read and mapped from a jOOQ Record:

modelMapper.getConfiguration()
           .addValueReader(new RecordValueReader());

Example Mapping

Now let’s see an example mapping of a jOOQ record to a JavaBean. Consider the following record representing an order:

order_id customer_id customer_address_street customer_address_city
345 678 123 Main Street SF

We may need to map this to a more complex object model:

// Assume getters and setters are present

public class Order {
  private int id;
  private Customer customer;
}

public class Customer {
  private Address address;
}

public class Address {
  private String street;
  private String city;
}

Since the source Record’s fields in this example uses an underscore naming convention, we’ll need to configure ModelMapper to tokenize source property names by underscore:

modelMapper
  .getConfiguration()
  .setSourceNameTokenizer(NameTokenizers.UNDERSCORE);

With that set, mapping an order Record to an Order object is simple:

Order order =
  modelMapper.map(orderRecord, Order.class);

And we can assert that values are mapped as expected:

assertEquals(456, order.getId());
assertEquals(789, order.getCustomer().getId());
assertEquals("123 Main Street",
             order.getCustomer()
                  .getAddress()
                  .getStreet());
assertEquals("SF",
             order.getCustomer()
                  .getAddress()
                  .getCity());

Explicit Mapping

While ModelMapper will do its best to implicitly match Record values to destination properties, sometimes you may need to explicitly define mappings between properties.

Let’s map our Record’s customer_address_street to Order.customer.address.street:

PropertyMap<Record, Order> orderMap =
  new PropertyMap<Record, Order>() {
  protected void configure() {
    map(source("customer_address_street"))
        .getCustomer()
        .getAddress()
        .setStreet(null);
  }
};

Then we can add the mapping to our ModelMapper instance for the orderRecord:

modelMapper.createTypeMap(orderRecord, Order.class)
           .addMappings(orderMap);

(see the ModelMapper manual pages for more details about property mapping)

Things to Note

ModelMapper maintains a TypeMap for each source and destination type, containing the mappings bewteen the two types. For “generic” types such as Record this can be problematic since the structure of a Record can vary. In order to distinguish structurally different Records that map to the same destination type, we can provide a type map name to ModelMapper.

Continuing with the example above, let’s map another order Record, this one with a different structure, to the same Order class:

order_id order_customer_id order_customer_address_street order_customer_address_city
444 777 123 Main Street LA

Mapping this Record to an order is simple, but we’ll need to provide a type map name to distinguish this Record to Order mapping from the previous unnamed mapping:

Order order = modelMapper.map(
    longOrderRecord, Order.class, "long");

Example taken from:
http://modelmapper.org/user-manual/jooq-integration/

More Examples

When choosing ModelMapper, you’re not just chosing an API to map relational data to your domain model. ModelMapper is designed for arbitrary model transformation, which can make it a strategic choice for your stack.

Check out this marvelous Open Source gem on the ModelMapper website.