ORMs Should Update “Changed” Values, Not Just “Modified” Ones

In this article, I will establish how the SQL language and its implementations distinguish between changed values and modified values, where a changed value is a value that has been “touched”, but not necessarily modified, i.e. the value might be the same before and after the change.

Many ORMs, unfortunately, either update all of a record’s values, or only the modified ones. The first can be inefficient, and the latter can be wrong. Updating the changed values would be correct.

Note that you may have a different definition of changed and modified. For this article, let’s just assume that the above definition is as valid as it is useful.

Introduction

A very interesting discussion was triggered recently by Vlad Mihalcea who was looking for an answer to this interesting question:

What’s the overhead of updating all columns, even the ones that haven’t changed?

Apart from the question being very interesting from a performance perspective, the tweet also inspired functional aspects of a distinction between updating all columns vs. updating some columns, which I’ll summarise in this article.

What’s the Problem?

The problem is one that all ORM vendors need to solve: ORMs have a client side representation of the relational model, and that representation is cached (or “out of sync”) for a user to change and then persist again. The problem is now how to re-synchronise the client side representation with the server side representation in a consistent and correct way.

Sidenote: By ORM I understand any tool that maps from a client side representation of your database schema to the database schema itself, regardless if the product supports full-fledged JPA-style object graph persistence, or “merely” implements an “active record” pattern, such as jOOQ 3.x (I find that distinction a bit academic).

All such ORMs have a client side representation of a database record, for instance given the following table (I’m going to be using PostgreSQL syntax):

CREATE TABLE customer (
  customer_id SERIAL8     NOT NULL PRIMARY KEY,
  first_name  VARCHAR(50) NOT NULL,
  last_name   VARCHAR(50) NOT NULL
)

You’re going to have a client side representation as the following (using Java, e.g. jOOQ or JPA):

// jOOQ generated UpdatableRecord
public class CustomerRecord 
extends UpdatableRecordImpl<CustomerRecord> {

  public CustomerRecord setCustomerId(Long customerId) { ... }
  public Long getCustomerId() { ... }
  public CustomerRecord setFirstName(String firstName) { ... }
  public String getFirstName() { ... }

  ...
}

// JPA annotated entity
@Entity
public class Customer {

  @Id
  @GeneratedValue(strategy = IDENITITY)
  public long customerId;

  @Column
  public String firstName;

  ...
}

In principle, these two approaches are the same thing with the distinction that jOOQ explicitly governs all UpdatableRecord interactions through type inheritance, whereas JPA makes this dependency more implicit through annotations:

  • jOOQ – explicit behavioural dependency between entity and jOOQ logic
  • JPA – implicit behavioural dependency between entity and JPA entity manager

In principle, the distinction is just a matter of taste, a programming style: Explicit vs. declarative.

But from a practical perspective, the JPA implementation lacks an important feature when it comes to synching the state back to the database. It cannot reflect change, only modification.

How to synch the state back to the database?

Let’s assume we have a customer called John Doe:

INSERT INTO customer (first_name, last_name)
VALUES ('John', 'Doe');

And that customer now changes their names to John Smith. We have several options of sending that update to the database, through “PATCH” or “PUT” semantics – terminology used by Morgan Tocker in another tweet in that discussion:

-- PATCH
UPDATE customer SET last_name = 'Smith' WHERE id = ? 

-- PUT
UPDATE customer 
SET first_name = 'John',
    last_name = 'Smith'
WHERE customer_id = ? 

A “PATCH” operation sends only the changed values back to the server, whereas a “PUT” operation sends the entire entity back to the server.

Discussion – Semantics.

In favour of PUT

The two operations are semantically very different. If another session attempts to rename this customer to Jane Doe concurrently (and without optimistic locking being in place), then the PATCH operation might result in an inconsistent outcome (Jane Smith), whereas the PUT operation would still produce one of the expected results, depending on what write is executed first:

-- PATCH result: Jane Smith
-- PATCH 1
UPDATE customer SET last_name = 'Smith' WHERE customer_id = ? 

-- PATCH 2
UPDATE customer SET first_name = 'Jane' WHERE customer_id = ? 

-- PUT result: Jane Doe
-- PUT 1
UPDATE customer 
SET first_name = 'John',
    last_name = 'Smith'
WHERE customer_id = ? 

-- PUT 2
UPDATE customer 
SET first_name = 'Jane',
    last_name = 'Doe'
WHERE customer_id = ? 

This is one of the reasons why Hibernate, as a JPA implementation, always implements PUT semantics by default, sending all the columns at once. You can opt out of this by using the @DynamicUpdate, which will only update modified values (not “changed” values, I’ll explain this distinction later).

This makes perfect sense in such a trivial setup, but it is a short-sighted solution, when the table has many more columns. We’ll see right away why:

In favour of PATCH

One size doesn’t fit all. Sometimes, you do want concurrent updates to happen, and you do want to implement PATCH semantics, because sometimes, two concurrent updates do not work against each other. Take the following example using an enhancement of the customer table.

Business is asking us to collect some aggregate metrics for each customer. The number of clicks they made on our website, as well as the number of purchases they made:

CREATE TABLE customer (
  customer_id SERIAL8     NOT NULL PRIMARY KEY,
  first_name  VARCHAR(50) NOT NULL,
  last_name   VARCHAR(50) NOT NULL,

  clicks      BIGINT      NOT NULL DEFAULT 0,
  purchases   BIGINT      NOT NULL DEFAULT 0
)

And, of course, once you agree that the above design is a suitable one, you’ll immediately agree that here, PATCH semantics is more desirable than PUT semantics:

-- Updating clicks
UPDATE customer SET clicks = clicks+1 WHERE customer_id = ? 

-- Updating purchases
UPDATE customer SET purchases = purchases+1 WHERE customer_id = ? 

Not only do we update only an individual column, we’re doing it entirely in SQL, including the calculation. With this approach, we do not even need optimistic locking to guarantee update correctness, as we’re not using any client side cached version of the customer record, which could be out of date and would need optimistic (or worse: pessimistic) locking.

If we implemented this differently, using client side calculation of the updated clicks / purchases counters…

-- Updating clicks
UPDATE customer 
SET clicks = ? 
WHERE customer_id = ? 

-- Updating purchases
UPDATE customer 
SET purchases = ? 
WHERE customer_id = ? 

… then we’d need one of these techniques:

  • Pessimistic locking: Nope, won’t work. We could still get incorrect updates
  • Optimistic locking: Indeed, any update would need to be done on a versioned customer record, so if there are two concurrent updates, one of them will fail and could try again. This guarantees data integrity, but will probably make this functionality very painful, because a lot of click updates are probably done in a short amount of time, and they would need to be repeated until they work!
  • Client side synchronisation: Of course, we could prevent concurrency for these updates on the client side, making sure that only one concurrent process ever updates click counts (for a given customer). We could implement a click count update queue for this.

All of the above options have significant drawbacks, the easiest solution is really to just increment the counter directly in the database.

And don’t forget, if you choose a bind-variable based solution, and opt for updating ALL the columns, rather than just the changed one, your first_name / last_name updates might conflict with these counter updates as well, making things even more complicated.

Partial PUT (or compound PATCH)

In fact, from a semantics perspective, if you do want to use an ORM to update an entity, you should think about a “partial PUT” semantics, which separates the different entity elements in “sub entities”. From a relational perspective, of course, no such thing as a subentity exists. The above example should be normalised into this, and we would have much less concurrency issues:

CREATE TABLE customer (
  customer_id SERIAL8     NOT NULL PRIMARY KEY,
  first_name  VARCHAR(50) NOT NULL,
  last_name   VARCHAR(50) NOT NULL
);

CREATE TABLE customer_clicks
  customer_id BIGINT NOT NULL PRIMARY KEY REFERENCES customer,
  clicks      BIGINT NOT NULL DEFAULT 0
);

CREATE TABLE customer_purchases
  customer_id BIGINT NOT NULL PRIMARY KEY REFERENCES customer,
  purchases   BIGINT NOT NULL DEFAULT 0
);

This way, the previously mentioned PUT semantics would not create situations where individual, semantically unrelated updates (updates to names, updates to clicks) would interfere with each other. We would only need to make sure that e.g. two competing updates to clicks are correctly serialised.

Practically, we often don’t design our databases this way, either for convenience reasons, for optimised storage, for optimised querying (see also our article when normalisation and surrogate keys hurt performance).

jOOQ’s “changed” value semantics

So that “sub entity” is really just a logical thing, which can be represented either as a logically separate entity in JPA, or we can use jOOQ, which works a bit differently here. In jOOQ, we can change an UpdatableRecord only partially, and that partial change is sent to the server:

CustomerRecord customer = ctx
    .selectFrom(CUSTOMER)
    .where(CUSTOMER.CUSTOMER_ID.eq(customerId))
    .fetchOne();

customer.setFirstName("John");
customer.setLastName("Smith");

assertTrue(customer.changed(CUSTOMER.FIRST_NAME));
assertTrue(customer.changed(CUSTOMER.LAST_NAME));
assertFalse(customer.changed(CUSTOMER.CLICKS));
assertFalse(customer.changed(CUSTOMER.PURCHASES));

customer.store();

assertFalse(customer.changed(CUSTOMER.FIRST_NAME));
assertFalse(customer.changed(CUSTOMER.LAST_NAME));
assertFalse(customer.changed(CUSTOMER.CLICKS));
assertFalse(customer.changed(CUSTOMER.PURCHASES));

This will send the following statement to the server:

UPDATE customer
SET first_name = ?,
    last_name = ?
WHERE customer_id = ?

Optionally, just as with JPA, you can turn on optimistic locking on this statement. The important thing here is that the clicks and purchases columns are left untouched, because they were not changed by the client code. This is different from JPA, which either sends all the values by default, or if you specify @DynamicUpdate in Hibernate, it would send only the last_name column, because while first_name was changed it was not modified.

My definition:

  • changed: The value is “touched”, its state is “dirty” and the state needs to be synched to the database, regardless of modification.
  • modified: The value is different from its previously known value. By necessity, a modified value is always changed.

As you can see, these are different things, and it is quite hard for a JPA-based API like Hibernate to implement changed semantics because of the annotation-based declarative nature of how entities are defined. We’d need some sophisticated instrumentation to intercept all data changes even when the values have not been modified (I didn’t make those attributes public by accident).

Without this distinction, however, it is unreasonable to use @DynamicUpdate in Hibernate, as we might run into that situation we didn’t want to run into, where we get a customer called “Jane Smith” – or we use optimistic locking, in case of which there’s not much point in using @DynamicUpdate.

The database perspective

From a database perspective, it is also important to distinguish between change and modification semantics. In the answer I gave on Stack Exchange, I’ve illustrated two situations:

INSERTs and DEFAULT values

Thus far, we’ve discussed only UPDATE statements, but similar reasoning may be made for INSERT as well. These two statements are the same:

INSERT INTO t (a, b)    VALUES (?, ?);
INSERT INTO t (a, b, c) VALUES (?, ?, DEFAULT);

This one, however, is different:

INSERT INTO t (a, b, c) VALUES (?, ?, ?);

In the first case, a DEFAULT clause (e.g. timestamp generation, identity generation, trigger value generation, etc.) may apply to the column c. In the second case, the value c is provided explicitly by the client.

Languages like Java do not have any way to represent this distinction between

  • NULL (which is usually, but not always, the DEFAULT) in SQL
  • an actual DEFAULT

This can only be achieved when an ORM implements changed semantics, like jOOQ does. When you create a customer with jOOQ, then clicks and purchases will have their DEFAULT applied:

CustomerRecord c1 = ctx.newRecord(CUSTOMER);
c1.setFirstName("John");
c1.setLastName("Doe");
c1.store();

CustomerRecord c2 = ctx.newRecord(CUSTOMER);
c2.setFirstName("Jane");
c2.setLastName("Smith");
c2.setClicks(1);
c2.setPurchases(1);
c2.store();

Resulting SQL:

-- c1.store();
INSERT INTO customer (first_name, last_name)
VALUES (?, ?);

-- c2.store();
INSERT INTO customer (first_name, last_name, clicks, purchases)
VALUES (?, ?, ?, ?);

In both cases, that’s what the user tells jOOQ to do, so jOOQ will generate a query accordingly.

Back to UPDATE statements

Consider the following example using Oracle triggers:

CREATE TABLE x (a INT PRIMARY KEY, b INT, c INT, d INT);

INSERT INTO x VALUES (1, 1, 1, 1);

CREATE OR REPLACE TRIGGER t
  BEFORE UPDATE OF c, d -- Doesn't fire on UPDATE OF b!
  ON x
BEGIN
  IF updating('c') THEN
    dbms_output.put_line('Updating c');
  END IF;
  IF updating('d') THEN
    dbms_output.put_line('Updating d');
  END IF;
END;
/

SET SERVEROUTPUT ON
UPDATE x SET b = 1 WHERE a = 1;
UPDATE x SET c = 1 WHERE a = 1;
UPDATE x SET d = 1 WHERE a = 1;
UPDATE x SET b = 1, c = 1, d = 1 WHERE a = 1;

It results in the following output:

table X created.
1 rows inserted.
TRIGGER T compiled
1 rows updated.
1 rows updated.
Updating c

1 rows updated.
Updating d

1 rows updated.
Updating c
Updating d

As you can see, the trigger doesn’t fire when we update only column b, which it is not interested in. Again, this goes in the direction of distinguishing between changed and modified values, where a trigger fires only when a value is changed (but not necessarily modified).

Now, if an ORM will always update all the columns, this trigger will not work correctly. Sure, we can compare :OLD.b and :NEW.b, but that would check for modification, not change, and it might be costly to do so for large strings!

Speaking of costs…

Performance

Statement caching: Weakly in favour of PUT

While one of the reasons the Hibernate team mentioned in favour of updating all the columns is improved cursor cache performance (fewer distinct SQL statements need to be parsed by the database as there are fewer distinct update configurations), I suggest that this “premature optimisation” is negligible. If a client application runs dynamic updates (in the jOOQ sense, where changed values are updated, not just modified values), then chances that the possible SQL statements that need to be parsed will explode are slim to non-existent.

I would definitely like to see real-world benchmarks on this topic!

Batching: Weakly in favour of PUT

When you want to batch tons of update statements from JDBC, then indeed, you will need to ensure that they all have the exact same SQL string. However, this is not a good argument in favour of using PUT semantics and updating all columns.

I’m saying “not good”, because such a batched update should still only consider a subset of the columns for update, not all the columns. And that subset should be determined on aggregated changed flags, not data modification.

Index updates: In favour of PATCH (depending on the database)

Most databases optimise index updates to ignore indexes whose columns have not been changed. Oracle also doesn’t update indexes whose columns have not been modified, in case of which PUT and PATCH semantics both work the same way from an indexing perspective. Other databases may not work this way, where PATCH semantics is favourable.

But even if the optimisation is in place, the old and the new values have to be compared for equality (i.e. to see if a modification took place). You don’t want to compare millions of strings per second if there’s no need to do so! Check out Morgan Tocker’s interesting answer on Stack Exchange, from a MySQL perspective

So, why not just prevent expensive modification checks by telling the database what has changed, instead?

UNDO overhead: In favour of PATCH

Every statement has a footprint on the UNDO / REDO logs. As I’ve shown above, the statements are semantically different in many ways, so if your statement is bigger (more columns are updated), then the impact on the UNDO / REDO log is bigger as well. This can have drastic effects depending on the size of your table / columns:

Don’t forget that this can also affect backup performance!

More performance related information in this blog post:

https://jonathanlewis.wordpress.com/2007/01/02/superfluous-updates

Note: While these bits of information were mostly Oracle-specific, common sense dictates that other RDBMS will behave in similar ways.

Conclusion

With all these negative aspects to including unnecessary columns for update through an ORM compared to the almost negligible benefits, I’d say that users should move forward and completely avoid this mess. Here’s how:

  • jOOQ optimises this out of the box, if users set the changed values explicitly. Beware that when you “load” a POJO into a Record, it will set all the columns to changed, which may or may not be the desired effect!
  • Hibernate allows for @DynamicUpdate, which may work incorrectly as we have minimal “PATCH” semantics based on modified values, not on changed values. However, JPA allows for declaring more than one entity per table, which might certainly be a valid option for this kind of problem
  • Normalisation is always an option, with its own trade offs. The clicks and purchases columns could be externalised in separate tables, if this benefits the overall design.
  • More often than not, writing an UPDATE with SQL directly is the best choice. As we’ve seen in this article, the counters should be updated with expressions of the form clicks = clicks + 1, which circumvents most problems exposed in this article.

In short, as Michael Simons said:

And we all do feel very dirty when we write SELECT *, right? So we should at least be wary of updating all the columns as well.

When to Use Bind Values, and When to Use Inline Values in SQL

Users of jOOQ, PL/SQL, T-SQL are spoiled as they hardly ever need to worry about bind values. Consider the following statements:

Using jOOQ

public int countActors(String firstName, String lastName) {
    return ctx.selectCount()
              .from(ACTOR)
              .where(ACTOR.FIRST_NAME.eq(firstName))
              .and(ACTOR.LAST_NAME.eq(lastName))
              .fetchOneInto(int.class);
    );
}

The method parameters firstName and lastName will be automatically mapped to bind values in the generated SQL statement. Here’s the debug log output when running the above, where the first statement is sent to the JDBC driver and then to the database, wheas the second statement is generated for debugging purposes only:

Executing query          : 
    select count(*)
    from "SAKILA"."ACTOR"
    where (
      "SAKILA"."ACTOR"."FIRST_NAME" = ?
      and "SAKILA"."ACTOR"."LAST_NAME" = ?
    )
-> with bind values      : 
    select count(*)
    from "SAKILA"."ACTOR"
    where (
      "SAKILA"."ACTOR"."FIRST_NAME" = 'SUSAN'
      and "SAKILA"."ACTOR"."LAST_NAME" = 'DAVIS'
    )
Fetched result           : +-----+
                         : |count|
                         : +-----+
                         : |    2|
                         : +-----+

With this API design, users don’t have to worry about binding variables explicitly at all, nor about remembering bind variable indexes or the data type of a bind value. This works because the overloaded Field<T>.eq(T) method (as well as pretty much every other method that works in a similar way) internally delegates to the more generic Field<T>.eq(Field<T>) method by wrapping the argument value in an explicit bind variable expression DSL.val(T) in the jOOQ SQL expression tree.

Using PL/SQL

The same is true when you’re using PL/SQL (or any other stored procedure language of another database), for instance:

SET SERVEROUTPUT ON
DECLARE
  
  FUNCTION count_actors (
    p_first_name VARCHAR2, 
    p_last_name VARCHAR2
  ) RETURN NUMBER IS 
    v_result NUMBER(10);
  BEGIN
    SELECT count(*)
    INTO v_result
    FROM actor
    WHERE first_name = p_first_name
    AND last_name = p_last_name;
    
    RETURN v_result;
  END count_actors;
  
BEGIN
  dbms_output.put_line(count_actors('SUSAN', 'DAVIS'));
END;
/

To be sure what happened, let’s consider the execution plan:

SELECT p.*
FROM (
  SELECT *
  FROM v$sql
  WHERE upper(sql_text) LIKE 'SELECT COUNT(*) FROM ACTOR%'
  ORDER BY last_active_time DESC
  FETCH NEXT 1 ROW ONLY
) s 
CROSS APPLY TABLE(dbms_xplan.display_cursor(sql_id => s.sql_id)) p;

As you can see below, bind variables were generated for the SQL query that was embedded in the PL/SQL function:

SQL_ID  9dgammbskx5tx, child number 0
-------------------------------------
SELECT COUNT(*) FROM ACTOR WHERE FIRST_NAME = :B2 AND LAST_NAME = :B1
 
Plan hash value: 3384208144
 
----------------------------------------------------
| Id  | Operation         | Name           | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT  |                |       |
|   1 |  SORT AGGREGATE   |                |     1 |
|*  2 |   INDEX RANGE SCAN| IDX_ACTOR_NAME |     1 |
----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("LAST_NAME"=:B1 AND "FIRST_NAME"=:B2)

Why is it important? 1: SQL injection

I’ve noticed this time and again when talking to Java developers: Many developers are aware of the risk of SQL injection when not using bind variables. This can happen when we would write dynamic SQL like this, e.g. using JDBC or jOOQ’s plain SQL API:

String sql = 
  "SELECT count(*) "
+ "FROM actor "
+ "WHERE 1 = 1 "
+ (firstName != null ? "AND first_name = " + firstName + " " : "")
+ (lastName != null ? "AND last_name = " + lastName + " " : "");

// JDBC
try (Statement s = connection.createStatement();
     ResultSet rs = s.executeQuery(sql)) {

    ...
}

// jOOQ
Result<?> result = ctx.fetch(sql);

Seriously. Don’t do this. Ever! Always use bind values for user input. There’s hardly any reason at all why you should inline the values. I mean, of course, you could if you always remember to manually escape all strings, e.g.:

public static String escape(String string) {
    // TODO: Handle MySQL's non-standard backslash escaping, too
    return string == null ? null : string.replace("'", "''");
}

And then:

String sql = 
  "SELECT count(*) "
+ "FROM actor "
+ "WHERE 1 = 1 "
+ (firstName != null ? "AND first_name = " + escape(firstName) + " " : "")
+ (lastName != null ? "AND last_name = " + escape(lastName) + " " : "");

Notice that there’s still a vulnerability risk in MySQL, which doesn’t necessarily conform to standard SQL string literal escaping. A very unfortunate MySQL “feature”, which is handled correctly by jOOQ:
https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_no_backslash_escapes

But why run the risk? It’s so much easier with bind values.

Why is it important? 2: Performance!

So, most Java developers, luckily, are aware of SQL injection vulnerabilities and mostly get this right. But there’s another thing that few Java developers are aware of, unfortunately. And that’s the performance aspect of using bind values. Let’s assume we’re not using bind values for the above query. Check this out:

SELECT count(*)
FROM actor
WHERE first_name = 'SUSAN'
AND last_name = 'DAVIS';

SELECT count(*)
FROM actor
WHERE first_name = 'NICK'
AND last_name = 'WAHLBERG';

And now, let’s find execution plans:

SELECT p.*
FROM (
  SELECT *
  FROM v$sql
  WHERE upper(sql_text) LIKE 'SELECT COUNT(*) FROM ACTOR%'
  ORDER BY last_active_time DESC
  FETCH NEXT 2 ROWS ONLY
) s 
CROSS APPLY TABLE(dbms_xplan.display_cursor(sql_id => s.sql_id)) p;

Result:

SQL_ID  12r8afykqkzbd, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE first_name = 'NICK' AND last_name = 'WAHLBERG'
 
Plan hash value: 3384208144
 
----------------------------------------------------
| Id  | Operation         | Name           | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT  |                |       |
|   1 |  SORT AGGREGATE   |                |     1 |
|*  2 |   INDEX RANGE SCAN| IDX_ACTOR_NAME |     1 |
----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("LAST_NAME"='WAHLBERG' AND "FIRST_NAME"='NICK')
 
SQL_ID  gfppqr9gpjqws, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE first_name = 'SUSAN' AND last_name = 'DAVIS'
 
Plan hash value: 3384208144
 
----------------------------------------------------
| Id  | Operation         | Name           | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT  |                |       |
|   1 |  SORT AGGREGATE   |                |     1 |
|*  2 |   INDEX RANGE SCAN| IDX_ACTOR_NAME |     1 |
----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("LAST_NAME"='DAVIS' AND "FIRST_NAME"='SUSAN')

As you can see, the two almost identical queries produced two times the exact same execution plan (same plan hash value), but they are stored under two distinct SQL_ID values. Yes, they’re distinct SQL statements!

Why?

Most sophisticated databases (including Oracle, SQL Server, DB2 and others) implement an execution plan cache. The reason for this is simple. Calculating an execution plan with cost based optimisation is expensive. Not for these trivial statements, but imagine you have dozens of JOINs, semi joins, unions, and what not. There are thousands of candidate execution plans, and the database needs to find the best one for you. That can be tons of work and we don’t want to let the database do that work every time.

So, the database will run a soft-parse (Oracle speak) to identify a SQL query and translate its SQL string to a SQL_ID. It will then check if there’s already a suitable plan available for that particular SQL_ID, and if so, it will avoid the so-called hard-parse (Oracle speak) to calculate a new plan.

Let me repeat this one more time: Making best use of this plan cache (Oracle speak: Cursor cache) is extremely important. You can severely overload a system up to the point of bringing it down, if you’re not using bind variables.

There’s a workaround to use CURSOR_SHARING=FORCE, which will transform inline values to bind values, but I’m not even going to explain how it works, because most Oracle experts advise you not to use that feature due to the significant drawbacks it will bring.

What if I want to inline values?

As we’ve seen above, when using jOOQ or PL/SQL, the above problems are really non-discussions, because it is quite unlikely that you run into a situation of accidentally inlining your bind values:

  • In jOOQ, you’d have to use plain SQL
  • In PL/SQL, you’d have to use dynamic SQL using DBMS_SQL

Anyway, in rare cases, users want to inline their bind variables for them to appear directly in the resulting SQL statement. This is never the case for user input, for ID values, or for ordinary search values. But it can be the case when you query for discriminators or “enum” values (enforced through a CHECK constraint, for instance), or when you run a report only once a year (plan is never available from the cache).

In these cases, it can be of advantage to not use bind values specifically to prevent the database from re-using a cached execution plan, because you know that the cached plan, which works well for 90% of the queries, won’t work well for this particular bind value (or in the case of the once-per-year report, you might get a slightly better plan by giving the database more information).

An example, let’s insert 1×1 and 99999×0 into a table:

CREATE TABLE skewed (
  v NUMBER(1)
);

INSERT INTO skewed
SELECT decode(level, 1, 1, 0)
FROM dual
CONNECT BY level <= 100000;

CREATE INDEX skewed_i ON skewed(v);

EXEC dbms_stats.gather_table_stats('TEST', 'SKEWED');

Now, clearly we see that when querying the table SKEWED for values V = 0, the index is useless, whereas it is very useful for values V = 1. Let’s run this statement here:

SET SERVEROUTPUT ON
DECLARE
  v_bind NUMBER(1);
  v_result NUMBER(10);
BEGIN
  v_bind := 0;
  
  SELECT count(*)
  INTO v_result
  FROM skewed
  WHERE v = v_bind;
  
  FOR rec IN (
    SELECT * FROM TABLE(dbms_xplan.display_cursor)
  ) LOOP
    dbms_output.put_line(rec.plan_table_output);
  END LOOP;

  v_bind := 1;
  
  SELECT count(*)
  INTO v_result
  FROM skewed
  WHERE v = v_bind;
  
  FOR rec IN (
    SELECT * FROM TABLE(dbms_xplan.display_cursor)
  ) LOOP
    dbms_output.put_line(rec.plan_table_output);
  END LOOP;
  
  SELECT count(*)
  INTO v_result
  FROM skewed
  WHERE v = 1;
  
  FOR rec IN (
    SELECT * FROM TABLE(dbms_xplan.display_cursor)
  ) LOOP
    dbms_output.put_line(rec.plan_table_output);
  END LOOP;
END;

The above block runs the exact same statement three times:

  1. With a bind variable of 0 (can’t really use the index)
  2. With a bind variable of 1 (should be using the index)
  3. With an inline value of 1 (should be using the index)

Here’s the result:

SQL_ID  1q0qjm8za06w3, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 4055318479
 
--------------------------------------------------
| Id  | Operation             | Name     | Rows  |
--------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |
|   1 |  SORT AGGREGATE       |          |     1 |
|*  2 |   INDEX FAST FULL SCAN| SKEWED_I | 99999 |
--------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("V"=:B1)
 
SQL_ID  1q0qjm8za06w3, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 4055318479
 
--------------------------------------------------
| Id  | Operation             | Name     | Rows  |
--------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |
|   1 |  SORT AGGREGATE       |          |     1 |
|*  2 |   INDEX FAST FULL SCAN| SKEWED_I | 99999 |
--------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("V"=:B1)
 
SQL_ID  bdpjxzqpg2416, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = 1
 
Plan hash value: 276090370
 
----------------------------------------------
| Id  | Operation         | Name     | Rows  |
----------------------------------------------
|   0 | SELECT STATEMENT  |          |       |
|   1 |  SORT AGGREGATE   |          |     1 |
|*  2 |   INDEX RANGE SCAN| SKEWED_I |     1 |
----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V"=1)

Observe the estimated numbers of rows above (in red):

  1. In the first case, the estimate is correct. With V = 0, we’ll get 99999 rows, so we might just as well scan the entire index to calculate the count value
  2. In the second case, we’ve inherited the cached execution plan from the first run, including the estimate of 99999 rows, which is clearly wrong in this case. The database should’ve estimated 1 row here
  3. In the third case, the estimate is again correct (note: different SQL_ID, and we get an optimal plan

We’re just out of luck. If we had inversed the order of queries, we would have gotten the right estimate for V = 1 but a wrong estimate for V = 0

Adaptive Cursor Sharing

Oracle knows some features to remedy the above problems. Oracle 11g introduced adaptive cursor sharing, which means that if we re-execute the above statements, the database will have already figured out that in this particular case, the plans should depend on the actual bind variable, because in hindsight, the second plan was wrong for V = 1

On a second execution of the previous block, we’ll see:

SQL_ID  1q0qjm8za06w3, child number 1
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 4055318479
 
--------------------------------------------------
| Id  | Operation             | Name     | Rows  |
--------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |
|   1 |  SORT AGGREGATE       |          |     1 |
|*  2 |   INDEX FAST FULL SCAN| SKEWED_I | 99999 |
--------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("V"=:B1)
 
SQL_ID  1q0qjm8za06w3, child number 2
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 276090370
 
----------------------------------------------
| Id  | Operation         | Name     | Rows  |
----------------------------------------------
|   0 | SELECT STATEMENT  |          |       |
|   1 |  SORT AGGREGATE   |          |     1 |
|*  2 |   INDEX RANGE SCAN| SKEWED_I |     1 |
----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V"=:B1)
 
SQL_ID  bdpjxzqpg2416, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = 1
 
Plan hash value: 276090370
 
----------------------------------------------
| Id  | Operation         | Name     | Rows  |
----------------------------------------------
|   0 | SELECT STATEMENT  |          |       |
|   1 |  SORT AGGREGATE   |          |     1 |
|*  2 |   INDEX RANGE SCAN| SKEWED_I |     1 |
----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V"=1)

As you can see above, the first two executions are still using the same plan, but they now have a new “child number” value, indicating that for any given SQL_ID, there are now different candidate plan hash values, depending on the bind variable input that we’re getting.

This often works well with the following caveats:

  • It only works after the plan was wrong at least once. This translates to at least one poor user who suffers from a slow query
  • It only works if the plans remain in the cache (which isn’t the case for rarely executed queries). Once the plans are purged from the cache, we have to run the query again at least twice for the alternative “child numbers” to appear
  • It stops working well if your query is much more complex, i.e. when complicated correlations exist between the bind variables that should each produce different plans

So, if in doubt, in this particular case, it’s not a bad idea to simply use an inline value / constant literal in the query directly, to help the optimiser make the right choice.

Another interesting article about this topic is here, where the optimiser will always be fooled by bind variables, and where you should always use inline values: When querying discriminators from views.

Adaptive Execution Plans

Oracle 12c took adaptive cursor sharing one step further and now supports adaptive execution plans, meaning that certain execution plans are known to be “shaky” in advance and the optimiser already pre-calculated a fallback plan that applies if the estimates are wrong for a given execution. In that case, the plan can be changed “in flight” and an alternative plan, which is better for a particular execution, is applied.

This feature is still not too stable in Oracle 12cR1, which is why some Oracle experts generally recommend turning it off.

Conclusion

You should always use bind values by default. In 99% of all cases, they’re the right choice for two reasons:

  • SQL injection prevention (obvious)
  • Statement caching optimisation (less obvious)

The latter reason is one that is not really well known among developers, because it’s not a problem that appears on development environments. It’s a production-only problem that happens under heavy load. Yet, you should be aware of this problem, and always remember to avoid generating too many distinct SQL strings (see also this related article about IN lists)

In rare cases, it is better to use inline values / literals, as this will help the optimiser make a much better choice in a predictable manner. These cases include:

  • Querying skewed data (unless adaptive features can be expected to kick in)
  • Querying discriminators (in this case, it’s always advisable to use inline values)

Using languages like PL/SQL, T-SQL, pgplsql, or APIs like jOOQ definitely helps you get this right, because you don’t have to think about this problem anymore. You’ll get it right automatically.

Side-note: Hibernate’s Criteria Query

Like jOOQ, Hibernate supports a type safe DSL for constructing JPQL queries, which to some extent cover basic SQL functionality when querying entities. Hibernate historically chose quite interesting defaults (as of version 5.2.10):

  • String values are always transformed to bind values, regardless if you’re using implicit values, explicit parameters, or explicit literals. The goal of this is to prevent SQL injection (because currently, Hibernate doesn’t auto-escape inline string literals)
  • Implicit integer values are always inlined
  • Explicit parameters are always sent as parameters
  • Explicit literals are usually sent as literals (unless they’re strings)

The above implementation is unfortunate as we’ve seen before for these reasons:

  • Bind values should always be the default, especially when using ID values, which are integers, and as such, often inlined in the current implementation. Luckily, this has been recognised as a bug and will be fixed, soon: https://hibernate.atlassian.net/browse/HHH-9576
  • Literals should be sent to the server as literals, because when users need literals (e.g. in the above edge cases), they don’t want the API to override this behaviour through bind values. This might also be solved soon: https://hibernate.atlassian.net/browse/HHH-11778

I’m currently working with the team to remedy these problems, such that the criteria API won’t inhibit users from a performance perspective:

Applying Queueing Theory to Dynamic Connection Pool Sizing with FlexyPool

I’m very happy to have another interesting blog post by Vlad Mihalcea on the jOOQ blog, this time about his Open Source library flexypool. Read his previous jOOQ Tuesdays post on Hibernate here.

Vlad is a Hibernate developer advocate and he’s the author of the popular book High Performance Java Persistence, and he knows 1-2 things about connection pooling.

vladmihalcea

Introduction

Back in 2014, I was working as a software architect, and our team was building a real-estate platform which was composed of multiple nodes, as depicted in the following diagram:

databaseintegrationpoint

This is a classic enterprise architecture layout. The database is replicated to provide better throughout and availability in case of node failures. There are front-end nodes that deliver the website content. There are also many back-end nodes as well, like email schedulers or data import batch processors.

All these nodes require database connectivity, either to a Master node, for read-write transactions or to the Slave nodes, for read-only transactions.

Because acquiring database connections is an expensive process, each system node uses its own connection pool. By reusing physical database connections, the connection acquisition is very fast, therefore reducing the overall transaction response time.

Not only that a connection pool can reduce transaction response time, but it can level up traffic spikes as well. Without a connection pool, during a traffic spike, a front-end nodes might acquire all database connections, leaving the back-end processors with no database connectivity.

The connection pool, having a maximum number of database connections, allows the connections to queue whenever a traffic spike is happening. Therefore, during a traffic spike, the transaction response time will increase due to the queuing mechanism, but this is way better than taking down the whole system.

For these two reasons, the connection pool is a very good choice in many enterprise systems.

Based on the underlying hardware resources, a relational database can only offer a limited number of connections. For this reason, we must be very careful when choosing the pool size for each particular system node.

Connection pool sizing

I was the lucky person to get the task of figuring out how many connections should we allocate for each system node in our real-estate platform. Since I graduated Electronics and Telecommunications, I remembered that we learned about a similar problem when having to provision telecommunications networks. Agner Krarup Erlang invented Queuing theory for solving this problem, and I was curious if we could also find the right pool size by applying Erlang queuing models.

I was not the only one trying to apply the Queuing theory principles to software systems. Percona has a very interesting study: Forecasting MySQL Scalability with the actual service time in a system that is affected by a myriad of variables.

In the end, I realized that the best way to tackle this problem is to constant measuring and adjustments. For this reason, I needed a tool to capture database connection metrics, as well as a way to adjust a given connection pool while the enterprise system is running.

And, that’s how FlexyPool was born.

Basically, FlexyPool is a DataSource Proxy that stands in front of the actual JDBC DataSource or other proxies (e.g. statement logging).

datasourceproxyarchitecture

FlexyPool supports a great variety of stand-alone connection pools:

And it collects the following metrics:

  • concurrent connections histogram
  • concurrent connection requests histogram
  • data source connection acquiring time histogram
  • connection lease time histogram
  • maximum pool size histogram
  • total connection acquiring time histogram
  • overflow pool size histogram
  • retries attempts histogram

For instance, the concurrent connection count metric gives you an insight into how many connections are required by a certain application under a given traffic load:

concurrentconnectioncount

The connection acquisition metric tells you how much time it takes to obtain a database connection from the pool:

connectionacquire

The connection lease time allows you to spot long-running transactions, which are undesirable in high-performance OLTP applications:

connectionlease

For the stand-alone connection pools, FlexyPool can increment the pool size beyond the maximum capacity, as it offers an overflow buffer. The benefit of this overflow buffer is that it allows you to increase the pool size only when the incoming traffic causes a certain connection acquisition timeout.

Although FlexyPool can also monitor Java EE connection pools, it cannot increase the pool size in Java EE environments since the DataSource is an application server managed resource.

Conclusion

Because enterprise systems evolve, so does the underlying data access patterns. For this reasons, monitoring the underlying database connection usage is a very important metric, which needs to be monitored on a regular basis. FlexyPool builds on top of CodaHale and Dropwizard Metrics, so you can easily integrate it with well-known Application Performance Monitoring tools, such as Graphite or Grafana.

FlexyPool is open-source, and it uses an Apache license 2.0. You can find it the project repository on GitHub, and all the released dependencies are available on Maven Central, so it’s very easy to integrate it in your own project.

FkexyPool is powering many enterprise systems, like Etuovi, Mitch&Mates, and ScentBird. If you decide to use it in your current enterprise system, and you are willing to provide a testimonial, you can win a free copy of my High-Performance Java Persistence book.

jOOQ Tuesdays: Thorben Janssen Shares his Hibernate Performance Secrets

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

thorben-janssen

I’m very excited to feature today Thorben Janssen who has spent most of his professional life with Hibernate.

Thorben, with your blog and training, you are one of the few daring “annotatioficionados” as we like to call them, who risks diving deep into JPA’s more sophisticated annotations – like @SqlResultSetMapping. What is your experience with JPA’s advanced, declarative programming style?

From my point of view, the declarative style of JPA is great and a huge problem at the same time.

If you know what you’re doing, you just add an annotation, set a few properties and your JPA implementation takes care of the rest. That makes it very easy to use complex features and avoids a lot of boilerplate code.

But it can also become a huge issue, when someone is not that familiar with JPA and just copies a few annotations from stack overflow and hopes that it works.

It will work in most of the cases. JPA and Hibernate are highly optimized and handle suboptimal code and annotations quite well. At least as long as it is tested with one user on a local machine. But that changes quickly when the code gets deployed to production and several hundred or thousand users use it in parallel. These issues get then often posted on stack overflow or other forums together with a complaint about the bad performance of Hibernate…

Your training goes far beyond these rather esoteric use-cases and focuses on JPA / Hibernate performance. What are three things every ORM user should know about JPA / SQL performance?

Only three things? I could talk about a lot more things related to JPA and Hibernate performance.

The by far most important one is to remember that your ORM framework is using SQL to store your data in a relational database. That seems to be pretty obvious, but you can avoid the most common performance issues by analyzing and optimizing the executed SQL statements. One example for that is the popular n+1 select issue which you can easily find and fix as I show in my free, 3-part video course.

Another important thing is that no framework or specification provides a good solution for every problem. JPA and Hibernate make it very easy to insert and update data into a relational database. And they provide a set of advanced features for performance optimizations, like caching or the ordering of statements to improve the efficiency of JDBC batches.

But Hibernate and JPA are not a good fit for applications that have to perform a lot of very complex queries for reporting or data mining use cases. The feature set of JPQL is too limited for these use cases. You can, of course, use native queries to execute plain SQL, but you should have a look at other frameworks if you need a lot of these queries.

So, always make sure that your preferred framework is a good fit for your project.

The third thing you should keep in mind is that you should prefer lazy fetching for the relationships between your entities. This prevents Hibernate from executing additional SQL queries to initialize the relationships to other entities when it gets an entity from the database. Most use cases don’t need the related entities, and the additional queries slow down the application. And if one of your use cases uses the relationships, you can use FETCH JOIN statements or entity graphs to initialize them with the initial query.

This approach avoids the overhead of unnecessary SQL queries for most of your use cases and allows you to initialize the relationships if you need them.

These are the 3 most important things you should keep in mind, if you want to avoid performance problems with Hibernate. If you want to dive deeper into this topic, have a look at my Hibernate Performance Tuning Online Training. The next one starts on 23th July.

What made you focus your training mostly on Hibernate, rather than also on EclipseLink / OpenJPA, or just plain SQL / jOOQ? Do you have plans to extend to those topics?

To be honest, that decision was quite easy for me. I’m working with Hibernate for about 15 years now and used it in a lot of different projects with very different requirements. That gives me the experience and knowledge about the framework, which you need if you want to optimize its performance. I also tried EclipseLink but not to the same extent as Hibernate.

And I also asked my readers which JPA implementation they use, and most of them told me that they either use plain JPA or Hibernate. That made it pretty easy to focus on Hibernate.

I might integrate jOOQ into one of my future trainings. Because as I said before, Hibernate and JPA are a good solution if you want to create or update data or if your queries are not too complex. As soon as your queries get complex, you have to use native queries with plain SQL. In these cases, jOOQ can provide some nice benefits.

What’s the advantage of your online training over a more classic training format, where people meet physically – both for you and for your participants?

The good thing about a classroom training is that you can discuss your questions with other students and the instructor. But it also requires you to be in a certain place at a certain time which creates additional costs, requires you to get out of your current projects and keeps you away from home.

With the Hibernate Performance Tuning Online Training, I want to provide a similar experience to a classroom training in which you study with other students and ask your questions but without having to travel somewhere. You can watch my training videos and do the exercises from your office or home and meet with me, and other students in the forum or group coaching calls to discuss your questions.

So you get the best of both worlds without declaring any travel expenses 😉

Your blog also includes a weekly digest of all things happening in the Java ecosystem called Java Weekly. What are the biggest insights into our ecosystem that you’ve gotten out of this work, yourself?

The Java ecosystem is always changing and improving, and you need to learn constantly if you want to stay up to date. One way to do that is to read good blog posts. And there are A LOT of great, small blogs out there written by very experienced Java developers who like to share their knowledge. You just have to find them. That’s probably the biggest insight I got.

I read a lot about Java and Java EE each week (that’s probably the only advantage of a 1.5-hour commute with public transportation) and present the most interesting ones every Monday in a new issue of Java Weekly.

It is all about the JDBC Basics

We’re very happy to announce a guest post by Marco Behler, who has been blogging about jOOQ in the past.

img31Marco started out in programming (reverse-engineering, actually) and now mainly programmes on the JVM in his day-to-day work. He also always had a sweet tooth for strategy and marketing. Marco Behler GmbH is the result of that hybrid role.

It is all about the JDBC Basics

It is one of the days.

You are reading the Spring documentation’s @Transactional section and still don’t understand the difference between logical and physical transaction scopes. Simultaneously your app throws an
LazyInitializationException and you have no idea why. To top it off you see spontaneous database deadlocks in production and you suspect your connection pool is leaking connections..somehow.

Know what most likely would have helped instead of banging your head against the wall? Spending a couple (literally) of hours on learning the JDBC basics. Let’s find out why:

What are the JDBC basics?

The basics are opening up/closing database connections and then working with transactions. Also understanding how deadlocks, pessimistic and optimistic locking work on a plain JDBC level. A bit of isolation levels and savepoints and then directly on to connection pools and jdbc driver logging. That’s it. Seriously.

Why are the basics so important?

Everything you will encounter in frameworks like Spring, Hibernate, jOOQ etc. builds up on these basics. For example, there are a gazillion topics on the internet regarding Hibernate’s LazyInitializationException and I was scared of that particular exception myself many years ago. But what else would you expect trying to query the database without having a connection to the database open (which is basically all that this exeception is) ?

The same with Spring’s “transaction framework”. There is so much content, or shall we say (F)ear/(U)ncertainty/(D)oubt, out there on how to open up transactions with spring, be it programmatically, with annotations or xml. But what if you knew that under the hood, there is only one way (and actually one line of code) to open up transactions in the JDBC world?

Let me not even get started on the various (mis)configurations of connection pools you see in production in the wild. Or the unawareness of JDBC (driver) logging, which usually leads to debugging in the wild. All basics, which you can master in a couple of hours and which will help you for a lifetime!

Why do people not just learn the basics?

In every middle-sized project there is a ton of technologies involved and there usually is no clear-cut path on how to learn all of them or how they all work together. It simply takes a lot of time and effort to dig through everything.

There’s JPA sessions and JDBC connections and then Spring somehow provides those transactional proxies in 5 different ways and then some other colleague just put jOOQ into the mix, but then somehow my session doesn’t flush and my objects don’t get persisted and the HibernateTransactionManager is not working as expected.

With all of this, I would also hope for my database transactions just to commit – god forbid what happens on rollback 🙂

But in the end, everything technology mentioned is just a layer on top of JDBC. If you understand transactions or deadlocks or savepoints on the basic level, then Spring or Hibernate or jOOQ will not throw you off.

So what do you recommend ?

If you want to get miles ahead in your day-to-day database programming, you have to start with the basics. Step-by-Step. And then you will see most of your problems automatically evaporate.

Out of my extensive database programming experience, I created an ebook with a ton of ready-to-run exercises, which will take you from Java database novice to expert. At your own pace. You can literally copy the source code of every chapter into your IDE, run it and (hopefully) learn from it. It covers plain JDBC, Spring, Hibernate, jOOQ (soon) and also distributed transactions.

You can read the whole book for free online here, and I would love to get your feedback! I would really like to let the community feedback flow back into future editions of the book. In addition, If you like what you see and the exercises help you, you can also show your support by getting a paid digital version (pdf, epub, mobi).

In any case…

…learn your JDBC basics – and you will profit from them for a lifetime!

Type Safe Queries for JPA’s Native Query API

When you’re using JPA – sometimes – JPQL won’t do the trick and you’ll have to resort to native SQL. From the very beginning, ORMs like Hibernate kept an open “backdoor” for these cases and offered a similar API to Spring’s JdbcTemplate, to Apache DbUtils, or to jOOQ for plain SQL. This is useful as you can continue using your ORM as your single point of entry for database interaction.

However, writing complex, dynamic SQL using string concatenation is tedious and error-prone, and an open door for SQL injection vulnerabilities. Using a type safe API like jOOQ would be very useful, but you may find it hard to maintain two different connection, transaction, session models within the same application just for 10-15 native queries.

But the truth is:

You an use jOOQ for your JPA native queries!

That’s true! There are several ways to achieve this.

Fetching tuples (i.e. Object[])

The simplest way will not make use of any of JPA’s advanced features and simply fetch tuples in JPA’s native Object[] form for you. Assuming this simple utility method:

public static List<Object[]> nativeQuery(
    EntityManager em, 
    org.jooq.Query query
) {

    // Extract the SQL statement from the jOOQ query:
    Query result = em.createNativeQuery(query.getSQL());

    // Extract the bind values from the jOOQ query:
    List<Object> values = query.getBindValues();
    for (int i = 0; i < values.size(); i++) {
        result.setParameter(i + 1, values.get(i));
    }

    return result.getResultList();
}

Using the API

This is all you need to bridge the two APIs in their simplest form to run “complex” queries via an EntityManager:

List<Object[]> books =
nativeQuery(em, DSL.using(configuration)
    .select(
        AUTHOR.FIRST_NAME, 
        AUTHOR.LAST_NAME, 
        BOOK.TITLE
    )
    .from(AUTHOR)
    .join(BOOK)
        .on(AUTHOR.ID.eq(BOOK.AUTHOR_ID))
    .orderBy(BOOK.ID));

books.forEach((Object[] book) -> 
    System.out.println(book[0] + " " + 
                       book[1] + " wrote " + 
                       book[2]));

Agreed, not a lot of type safety in the results – as we’re only getting an Object[]. We’re looking forward to a future Java that supports tuple (or even record) types like Scala or Ceylon.

So a better solution might be the following:

Fetching entities

Let’s assume you have the following, very simple entities:

@Entity
@Table(name = "book")
public class Book {

    @Id
    public int id;

    @Column(name = "title")
    public String title;

    @ManyToOne
    public Author author;
}

@Entity
@Table(name = "author")
public class Author {

    @Id
    public int id;

    @Column(name = "first_name")
    public String firstName;

    @Column(name = "last_name")
    public String lastName;

    @OneToMany(mappedBy = "author")
    public Set<Book> books;
}

And let’s assume, we’ll add an additional utility method that also passes a Class reference to the EntityManager:

public static <E> List<E> nativeQuery(
    EntityManager em, 
    org.jooq.Query query,
    Class<E> type
) {

    // Extract the SQL statement from the jOOQ query:
    Query result = em.createNativeQuery(
        query.getSQL(), type);

    // Extract the bind values from the jOOQ query:
    List<Object> values = query.getBindValues();
    for (int i = 0; i < values.size(); i++) {
        result.setParameter(i + 1, values.get(i));
    }

    // There's an unsafe cast here, but we can be sure
    // that we'll get the right type from JPA
    return result.getResultList();
}

Using the API

This is now rather slick, just put your jOOQ query into that API and get JPA entities back from it – the best of both worlds, as you can easily add/remove nested collections from the fetched entities as if you had fetched them via JPQL:

List<Author> authors =
nativeQuery(em,
    DSL.using(configuration)
       .select()
       .from(AUTHOR)
       .orderBy(AUTHOR.ID)
, Author.class); // This is our entity class here

authors.forEach(author -> {
    System.out.println(author.firstName + " " + 
                       author.lastName + " wrote");
    
    books.forEach(book -> {
        System.out.println("  " + book.title);

        // Manipulate the entities here. Your
        // changes will be persistent!
    });
});

Fetching EntityResults

If you’re extra-daring and have a strange affection for annotations, or you just want to crack a joke for your coworkers just before you leave on vacation, you can also resort to using JPA’s javax.persistence.SqlResultSetMapping. Imagine the following mapping declaration:

@SqlResultSetMapping(
    name = "bookmapping",
    entities = {
        @EntityResult(
            entityClass = Book.class,
            fields = {
                @FieldResult(name = "id", column = "b_id"),
                @FieldResult(name = "title", column = "b_title"),
                @FieldResult(name = "author", column = "b_author_id")
            }
        ),
        @EntityResult(
            entityClass = Author.class,
            fields = {
                @FieldResult(name = "id", column = "a_id"),
                @FieldResult(name = "firstName", column = "a_first_name"),
                @FieldResult(name = "lastName", column = "a_last_name")
            }
        )
    }
)

Essentially, the above declaration maps database columns (@SqlResultSetMapping -> entities -> @EntityResult -> fields -> @FieldResult -> column) onto entities and their corresponding attributes. With this powerful technique, you can generate entity results from any sort of SQL query result.

Again, we’ll be creating a small little utility method:

public static <E> List<E> nativeQuery(
    EntityManager em, 
    org.jooq.Query query,
    String resultSetMapping
) {

    // Extract the SQL statement from the jOOQ query:
    Query result = em.createNativeQuery(
        query.getSQL(), resultSetMapping);

    // Extract the bind values from the jOOQ query:
    List<Object> values = query.getBindValues();
    for (int i = 0; i < values.size(); i++) {
        result.setParameter(i + 1, values.get(i));
    }

    // This implicit cast is a lie, but let's risk it
    return result.getResultList();
}

Note that the above API makes use of an anti-pattern, which is OK in this case, because JPA is not a type safe API in the first place.

Using the API

Now, again, you can pass your type safe jOOQ query to the EntityManager via the above API, passing the name of the SqlResultSetMapping along like so:

List<Object[]> result =
nativeQuery(em,
    DSL.using(configuration
       .select(
           AUTHOR.ID.as("a_id"),
           AUTHOR.FIRST_NAME.as("a_first_name"),
           AUTHOR.LAST_NAME.as("a_last_name"),
           BOOK.ID.as("b_id"),
           BOOK.AUTHOR_ID.as("b_author_id"),
           BOOK.TITLE.as("b_title")
       )
       .from(AUTHOR)
       .join(BOOK).on(BOOK.AUTHOR_ID.eq(AUTHOR.ID))
       .orderBy(BOOK.ID)), 
    "bookmapping" // The name of the SqlResultSetMapping
);

result.forEach((Object[] entities) -> {
    JPAAuthor author = (JPAAuthor) entities[1];
    JPABook book = (JPABook) entities[0];

    System.out.println(author.firstName + " " + 
                       author.lastName + " wrote " + 
                       book.title);
});

The result in this case is again an Object[], but this time, the Object[] doesn’t represent a tuple with individual columns, but it represents the entities as declared by the SqlResultSetMapping annotation.

This approach is intriguing and probably has its use when you need to map arbitrary results from queries, but still want managed entities. We can only recommend Thorben Janssen‘s interesting blog series about these advanced JPA features, if you want to know more:

Conclusion

Choosing between an ORM and SQL (or between Hibernate and jOOQ, in particular) isn’t always easy.

  • ORMs shine when it comes to applying object graph persistence, i.e. when you have a lot of complex CRUD, involving complex locking and transaction strategies.
  • SQL shines when it comes to running bulk SQL, both for read and write operations, when running analytics, reporting.

When you’re “lucky” (as in – the job is easy), your application is only on one side of the fence, and you can make a choice between ORM and SQL. When you’re “lucky” (as in – ooooh, this is an interesting problem), you will have to use both. (See also Mike Hadlow’s interesting article on the subject)

The message here is: You can! Using JPA’s native query API, you can run complex queries leveraging the full power of your RDBMS, and still map results to JPA entities. You’re not restricted to using JPQL.

Side-note

While we’ve been critical with some aspects of JPA in the past (read How JPA 2.1 has become the new EJB 2.0 for details), our criticism has been mainly focused on JPA’s (ab-)use of annotations. When you’re using a type safe API like jOOQ, you can provide the compiler with all the required type information easily to construct results. We’re convinced that a future version of JPA will engage more heavily in using Java’s type system, allowing a more fluent integration of SQL, JPQL, and entity persistence.

jOOQ Tuesdays: Vlad Mihalcea Gives Deep Insight into SQL and Hibernate

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

vlad_mihalcea

We have the pleasure of talking to Vlad Mihalcea in this third edition who will be telling us about the skills developers need to acquire when working with Java, SQL, and Hibernate.

Hi Vlad – You’re blog explodes with excellent posts about Hibernate. It looks like you love digging deep into the most popular persistence API in the market, right?

I really mean when saying that “teaching is my way of learning” and to master a certain technology, you have to go beyond the reference documentation. Hibernate has been around for 10 years now and there’s a plethora of projects built on top of it. The Hibernate Master Class focuses on some proven ORM design patterns, like concurrency control, caching and batching.

You’ve recently told me about your realisation of the lack of SQL insight in our industry. How did that come to be?

The Object-Relational mismatch is only the tip of the iceberg, when it comes to accessing data. The biggest problem we face in enterprise systems, is the Enterprise-Database developer mismatch.

A developer knows about the programming languages, design patterns and application architecturing, but database skills are always attributed to the Database Administrator role. This is a very dangerous assumption.

It’s as if we developed on Linux without ever wanting to learn how the operating system works, relying solely on the System Administrator knowledge. If you develop enterprise applications, you have no escape but learning how a database works. Reading the excellent “SQL Performance Explained” book, made me realize how little I knew about the inner-workings of relational database systems. This book is meant for developers and it’s a must-read for every enterprise developer professional.

What can we do to improve the situation for our industry? Is there a chance for a tighter integration of JPA and SQL? Or specifically, of Hibernate and jOOQ?

First, it’s the mindset that needs to change. We need to acknowledge that there’s no such thing as a one-size-fits-all framework, and that applies to database access as well. When I write unit tests, I don’t limit myself to JUnit. I also use Mockito and Hamcrest, a testing stack being a better alternative.

JPA excels when writing data, because you can the INSERT/UPDATE statements are automatically updated, whenever the persistence model changes. The implicit and explicit locking allow us to protect against lost updates, especially in long conversation workflows.

But while abstracting the SQL write statements is a doable task, when it comes to reading data, nothing can beat native SQL. The most commonly-used RDBMS have implemented non-standard data access techniques (window functions, Common Table Expressions, PIVOT) and the SQL-92 JPA abstraction layer can only focus on common functionalities. That’s why native querying is unavoidable on almost any enterprise system.

jOOQ has done a very good job promoting SQL knowledge into the Java ecosystem. Java is ruling the enterprise software development and SQL skills have always been the Achilles heel of most enterprise development teams.

While you can fire native queries from JPA, there’s no support for dynamic native query building. jOOQ allows you to build type-safe dynamic native queries, strengthening your application against SQL-injection attacks. jOOQ can be integrated with JPA, as I already proven on my blog, and the JPA-jOOQ combo can provide a solid data access stack.

Tell us a little bit about your Hibernate Master Class, and your personal blogging strategy.

The Hibernate Master Class blog series is actually a book in the making. Because I work a full-time job, it’s difficult to commit to a fixed writing schedule, so I can only write as much as my spare times allows me.

Once all topics are covered, I’ll turn all this info into a book, that I’m going to self-publish, following the “SQL Performance Explained” example.

[ Edit ] The book has been finished and is available here:

https://leanpub.com/high-performance-java-persistence

Where will you be in 5 years?

I enjoy both software architecture, as well as writing about it. I will continue on this journey and see where the wind will carry me.