When to Use Bind Values, and When to Use Inline Values in SQL

Users of jOOQ, PL/SQL, T-SQL are spoiled as they hardly ever need to worry about bind values. Consider the following statements:

Using jOOQ

public int countActors(String firstName, String lastName) {
    return ctx.selectCount()
              .from(ACTOR)
              .where(ACTOR.FIRST_NAME.eq(firstName))
              .and(ACTOR.LAST_NAME.eq(lastName))
              .fetchOneInto(int.class);
    );
}

The method parameters firstName and lastName will be automatically mapped to bind values in the generated SQL statement. Here’s the debug log output when running the above, where the first statement is sent to the JDBC driver and then to the database, wheas the second statement is generated for debugging purposes only:

Executing query          : 
    select count(*)
    from "SAKILA"."ACTOR"
    where (
      "SAKILA"."ACTOR"."FIRST_NAME" = ?
      and "SAKILA"."ACTOR"."LAST_NAME" = ?
    )
-> with bind values      : 
    select count(*)
    from "SAKILA"."ACTOR"
    where (
      "SAKILA"."ACTOR"."FIRST_NAME" = 'SUSAN'
      and "SAKILA"."ACTOR"."LAST_NAME" = 'DAVIS'
    )
Fetched result           : +-----+
                         : |count|
                         : +-----+
                         : |    2|
                         : +-----+

With this API design, users don’t have to worry about binding variables explicitly at all, nor about remembering bind variable indexes or the data type of a bind value. This works because the overloaded Field<T>.eq(T) method (as well as pretty much every other method that works in a similar way) internally delegates to the more generic Field<T>.eq(Field<T>) method by wrapping the argument value in an explicit bind variable expression DSL.val(T) in the jOOQ SQL expression tree.

Using PL/SQL

The same is true when you’re using PL/SQL (or any other stored procedure language of another database), for instance:

SET SERVEROUTPUT ON
DECLARE
  
  FUNCTION count_actors (
    p_first_name VARCHAR2, 
    p_last_name VARCHAR2
  ) RETURN NUMBER IS 
    v_result NUMBER(10);
  BEGIN
    SELECT count(*)
    INTO v_result
    FROM actor
    WHERE first_name = p_first_name
    AND last_name = p_last_name;
    
    RETURN v_result;
  END count_actors;
  
BEGIN
  dbms_output.put_line(count_actors('SUSAN', 'DAVIS'));
END;
/

To be sure what happened, let’s consider the execution plan:

SELECT p.*
FROM (
  SELECT *
  FROM v$sql
  WHERE upper(sql_text) LIKE 'SELECT COUNT(*) FROM ACTOR%'
  ORDER BY last_active_time DESC
  FETCH NEXT 1 ROW ONLY
) s 
CROSS APPLY TABLE(dbms_xplan.display_cursor(sql_id => s.sql_id)) p;

As you can see below, bind variables were generated for the SQL query that was embedded in the PL/SQL function:

SQL_ID  9dgammbskx5tx, child number 0
-------------------------------------
SELECT COUNT(*) FROM ACTOR WHERE FIRST_NAME = :B2 AND LAST_NAME = :B1
 
Plan hash value: 3384208144
 
----------------------------------------------------
| Id  | Operation         | Name           | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT  |                |       |
|   1 |  SORT AGGREGATE   |                |     1 |
|*  2 |   INDEX RANGE SCAN| IDX_ACTOR_NAME |     1 |
----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("LAST_NAME"=:B1 AND "FIRST_NAME"=:B2)

Why is it important? 1: SQL injection

I’ve noticed this time and again when talking to Java developers: Many developers are aware of the risk of SQL injection when not using bind variables. This can happen when we would write dynamic SQL like this, e.g. using JDBC or jOOQ’s plain SQL API:

String sql = 
  "SELECT count(*) "
+ "FROM actor "
+ "WHERE 1 = 1 "
+ (firstName != null ? "AND first_name = " + firstName + " " : "")
+ (lastName != null ? "AND last_name = " + lastName + " " : "");

// JDBC
try (Statement s = connection.createStatement();
     ResultSet rs = s.executeQuery(sql)) {

    ...
}

// jOOQ
Result<?> result = ctx.fetch(sql);

Seriously. Don’t do this. Ever! Always use bind values for user input. There’s hardly any reason at all why you should inline the values. I mean, of course, you could if you always remember to manually escape all strings, e.g.:

public static String escape(String string) {
    // TODO: Handle MySQL's non-standard backslash escaping, too
    return string == null ? null : string.replace("'", "''");
}

And then:

String sql = 
  "SELECT count(*) "
+ "FROM actor "
+ "WHERE 1 = 1 "
+ (firstName != null ? "AND first_name = " + escape(firstName) + " " : "")
+ (lastName != null ? "AND last_name = " + escape(lastName) + " " : "");

Notice that there’s still a vulnerability risk in MySQL, which doesn’t necessarily conform to standard SQL string literal escaping. A very unfortunate MySQL “feature”, which is handled correctly by jOOQ:
https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_no_backslash_escapes

But why run the risk? It’s so much easier with bind values.

Why is it important? 2: Performance!

So, most Java developers, luckily, are aware of SQL injection vulnerabilities and mostly get this right. But there’s another thing that few Java developers are aware of, unfortunately. And that’s the performance aspect of using bind values. Let’s assume we’re not using bind values for the above query. Check this out:

SELECT count(*)
FROM actor
WHERE first_name = 'SUSAN'
AND last_name = 'DAVIS';

SELECT count(*)
FROM actor
WHERE first_name = 'NICK'
AND last_name = 'WAHLBERG';

And now, let’s find execution plans:

SELECT p.*
FROM (
  SELECT *
  FROM v$sql
  WHERE upper(sql_text) LIKE 'SELECT COUNT(*) FROM ACTOR%'
  ORDER BY last_active_time DESC
  FETCH NEXT 2 ROWS ONLY
) s 
CROSS APPLY TABLE(dbms_xplan.display_cursor(sql_id => s.sql_id)) p;

Result:

SQL_ID  12r8afykqkzbd, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE first_name = 'NICK' AND last_name = 'WAHLBERG'
 
Plan hash value: 3384208144
 
----------------------------------------------------
| Id  | Operation         | Name           | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT  |                |       |
|   1 |  SORT AGGREGATE   |                |     1 |
|*  2 |   INDEX RANGE SCAN| IDX_ACTOR_NAME |     1 |
----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("LAST_NAME"='WAHLBERG' AND "FIRST_NAME"='NICK')
 
SQL_ID  gfppqr9gpjqws, child number 0
-------------------------------------
SELECT count(*) FROM actor WHERE first_name = 'SUSAN' AND last_name = 'DAVIS'
 
Plan hash value: 3384208144
 
----------------------------------------------------
| Id  | Operation         | Name           | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT  |                |       |
|   1 |  SORT AGGREGATE   |                |     1 |
|*  2 |   INDEX RANGE SCAN| IDX_ACTOR_NAME |     1 |
----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("LAST_NAME"='DAVIS' AND "FIRST_NAME"='SUSAN')

As you can see, the two almost identical queries produced two times the exact same execution plan (same plan hash value), but they are stored under two distinct SQL_ID values. Yes, they’re distinct SQL statements!

Why?

Most sophisticated databases (including Oracle, SQL Server, DB2 and others) implement an execution plan cache. The reason for this is simple. Calculating an execution plan with cost based optimisation is expensive. Not for these trivial statements, but imagine you have dozens of JOINs, semi joins, unions, and what not. There are thousands of candidate execution plans, and the database needs to find the best one for you. That can be tons of work and we don’t want to let the database do that work every time.

So, the database will run a soft-parse (Oracle speak) to identify a SQL query and translate its SQL string to a SQL_ID. It will then check if there’s already a suitable plan available for that particular SQL_ID, and if so, it will avoid the so-called hard-parse (Oracle speak) to calculate a new plan.

Let me repeat this one more time: Making best use of this plan cache (Oracle speak: Cursor cache) is extremely important. You can severely overload a system up to the point of bringing it down, if you’re not using bind variables.

There’s a workaround to use CURSOR_SHARING=FORCE, which will transform inline values to bind values, but I’m not even going to explain how it works, because most Oracle experts advise you not to use that feature due to the significant drawbacks it will bring.

What if I want to inline values?

As we’ve seen above, when using jOOQ or PL/SQL, the above problems are really non-discussions, because it is quite unlikely that you run into a situation of accidentally inlining your bind values:

  • In jOOQ, you’d have to use plain SQL
  • In PL/SQL, you’d have to use dynamic SQL using DBMS_SQL

Anyway, in rare cases, users want to inline their bind variables for them to appear directly in the resulting SQL statement. This is never the case for user input, for ID values, or for ordinary search values. But it can be the case when you query for discriminators or “enum” values (enforced through a CHECK constraint, for instance), or when you run a report only once a year (plan is never available from the cache).

In these cases, it can be of advantage to not use bind values specifically to prevent the database from re-using a cached execution plan, because you know that the cached plan, which works well for 90% of the queries, won’t work well for this particular bind value (or in the case of the once-per-year report, you might get a slightly better plan by giving the database more information).

An example, let’s insert 1×1 and 99999×0 into a table:

CREATE TABLE skewed (
  v NUMBER(1)
);

INSERT INTO skewed
SELECT decode(level, 1, 1, 0)
FROM dual
CONNECT BY level <= 100000;

CREATE INDEX skewed_i ON skewed(v);

EXEC dbms_stats.gather_table_stats('TEST', 'SKEWED');

Now, clearly we see that when querying the table SKEWED for values V = 0, the index is useless, whereas it is very useful for values V = 1. Let’s run this statement here:

SET SERVEROUTPUT ON
DECLARE
  v_bind NUMBER(1);
  v_result NUMBER(10);
BEGIN
  v_bind := 0;
  
  SELECT count(*)
  INTO v_result
  FROM skewed
  WHERE v = v_bind;
  
  FOR rec IN (
    SELECT * FROM TABLE(dbms_xplan.display_cursor)
  ) LOOP
    dbms_output.put_line(rec.plan_table_output);
  END LOOP;

  v_bind := 1;
  
  SELECT count(*)
  INTO v_result
  FROM skewed
  WHERE v = v_bind;
  
  FOR rec IN (
    SELECT * FROM TABLE(dbms_xplan.display_cursor)
  ) LOOP
    dbms_output.put_line(rec.plan_table_output);
  END LOOP;
  
  SELECT count(*)
  INTO v_result
  FROM skewed
  WHERE v = 1;
  
  FOR rec IN (
    SELECT * FROM TABLE(dbms_xplan.display_cursor)
  ) LOOP
    dbms_output.put_line(rec.plan_table_output);
  END LOOP;
END;

The above block runs the exact same statement three times:

  1. With a bind variable of 0 (can’t really use the index)
  2. With a bind variable of 1 (should be using the index)
  3. With an inline value of 1 (should be using the index)

Here’s the result:

SQL_ID  1q0qjm8za06w3, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 4055318479
 
--------------------------------------------------
| Id  | Operation             | Name     | Rows  |
--------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |
|   1 |  SORT AGGREGATE       |          |     1 |
|*  2 |   INDEX FAST FULL SCAN| SKEWED_I | 99999 |
--------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("V"=:B1)
 
SQL_ID  1q0qjm8za06w3, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 4055318479
 
--------------------------------------------------
| Id  | Operation             | Name     | Rows  |
--------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |
|   1 |  SORT AGGREGATE       |          |     1 |
|*  2 |   INDEX FAST FULL SCAN| SKEWED_I | 99999 |
--------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("V"=:B1)
 
SQL_ID  bdpjxzqpg2416, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = 1
 
Plan hash value: 276090370
 
----------------------------------------------
| Id  | Operation         | Name     | Rows  |
----------------------------------------------
|   0 | SELECT STATEMENT  |          |       |
|   1 |  SORT AGGREGATE   |          |     1 |
|*  2 |   INDEX RANGE SCAN| SKEWED_I |     1 |
----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V"=1)

Observe the estimated numbers of rows above (in red):

  1. In the first case, the estimate is correct. With V = 0, we’ll get 99999 rows, so we might just as well scan the entire index to calculate the count value
  2. In the second case, we’ve inherited the cached execution plan from the first run, including the estimate of 99999 rows, which is clearly wrong in this case. The database should’ve estimated 1 row here
  3. In the third case, the estimate is again correct (note: different SQL_ID, and we get an optimal plan

We’re just out of luck. If we had inversed the order of queries, we would have gotten the right estimate for V = 1 but a wrong estimate for V = 0

Adaptive Cursor Sharing

Oracle knows some features to remedy the above problems. Oracle 11g introduced adaptive cursor sharing, which means that if we re-execute the above statements, the database will have already figured out that in this particular case, the plans should depend on the actual bind variable, because in hindsight, the second plan was wrong for V = 1

On a second execution of the previous block, we’ll see:

SQL_ID  1q0qjm8za06w3, child number 1
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 4055318479
 
--------------------------------------------------
| Id  | Operation             | Name     | Rows  |
--------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |
|   1 |  SORT AGGREGATE       |          |     1 |
|*  2 |   INDEX FAST FULL SCAN| SKEWED_I | 99999 |
--------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("V"=:B1)
 
SQL_ID  1q0qjm8za06w3, child number 2
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = :B1
 
Plan hash value: 276090370
 
----------------------------------------------
| Id  | Operation         | Name     | Rows  |
----------------------------------------------
|   0 | SELECT STATEMENT  |          |       |
|   1 |  SORT AGGREGATE   |          |     1 |
|*  2 |   INDEX RANGE SCAN| SKEWED_I |     1 |
----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V"=:B1)
 
SQL_ID  bdpjxzqpg2416, child number 0
-------------------------------------
SELECT COUNT(*) FROM SKEWED WHERE V = 1
 
Plan hash value: 276090370
 
----------------------------------------------
| Id  | Operation         | Name     | Rows  |
----------------------------------------------
|   0 | SELECT STATEMENT  |          |       |
|   1 |  SORT AGGREGATE   |          |     1 |
|*  2 |   INDEX RANGE SCAN| SKEWED_I |     1 |
----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V"=1)

As you can see above, the first two executions are still using the same plan, but they now have a new “child number” value, indicating that for any given SQL_ID, there are now different candidate plan hash values, depending on the bind variable input that we’re getting.

This often works well with the following caveats:

  • It only works after the plan was wrong at least once. This translates to at least one poor user who suffers from a slow query
  • It only works if the plans remain in the cache (which isn’t the case for rarely executed queries). Once the plans are purged from the cache, we have to run the query again at least twice for the alternative “child numbers” to appear
  • It stops working well if your query is much more complex, i.e. when complicated correlations exist between the bind variables that should each produce different plans

So, if in doubt, in this particular case, it’s not a bad idea to simply use an inline value / constant literal in the query directly, to help the optimiser make the right choice.

Another interesting article about this topic is here, where the optimiser will always be fooled by bind variables, and where you should always use inline values: When querying discriminators from views.

Adaptive Execution Plans

Oracle 12c took adaptive cursor sharing one step further and now supports adaptive execution plans, meaning that certain execution plans are known to be “shaky” in advance and the optimiser already pre-calculated a fallback plan that applies if the estimates are wrong for a given execution. In that case, the plan can be changed “in flight” and an alternative plan, which is better for a particular execution, is applied.

This feature is still not too stable in Oracle 12cR1, which is why some Oracle experts generally recommend turning it off.

Conclusion

You should always use bind values by default. In 99% of all cases, they’re the right choice for two reasons:

  • SQL injection prevention (obvious)
  • Statement caching optimisation (less obvious)

The latter reason is one that is not really well known among developers, because it’s not a problem that appears on development environments. It’s a production-only problem that happens under heavy load. Yet, you should be aware of this problem, and always remember to avoid generating too many distinct SQL strings (see also this related article about IN lists)

In rare cases, it is better to use inline values / literals, as this will help the optimiser make a much better choice in a predictable manner. These cases include:

  • Querying skewed data (unless adaptive features can be expected to kick in)
  • Querying discriminators (in this case, it’s always advisable to use inline values)

Using languages like PL/SQL, T-SQL, pgplsql, or APIs like jOOQ definitely helps you get this right, because you don’t have to think about this problem anymore. You’ll get it right automatically.

Side-note: Hibernate’s Criteria Query

Like jOOQ, Hibernate supports a type safe DSL for constructing JPQL queries, which to some extent cover basic SQL functionality when querying entities. Hibernate historically chose quite interesting defaults (as of version 5.2.10):

  • String values are always transformed to bind values, regardless if you’re using implicit values, explicit parameters, or explicit literals. The goal of this is to prevent SQL injection (because currently, Hibernate doesn’t auto-escape inline string literals)
  • Implicit integer values are always inlined
  • Explicit parameters are always sent as parameters
  • Explicit literals are usually sent as literals (unless they’re strings)

The above implementation is unfortunate as we’ve seen before for these reasons:

  • Bind values should always be the default, especially when using ID values, which are integers, and as such, often inlined in the current implementation. Luckily, this has been recognised as a bug and will be fixed, soon: https://hibernate.atlassian.net/browse/HHH-9576
  • Literals should be sent to the server as literals, because when users need literals (e.g. in the above edge cases), they don’t want the API to override this behaviour through bind values. This might also be solved soon: https://hibernate.atlassian.net/browse/HHH-11778

I’m currently working with the team to remedy these problems, such that the criteria API won’t inhibit users from a performance perspective:

Applying Queueing Theory to Dynamic Connection Pool Sizing with FlexyPool

I’m very happy to have another interesting blog post by Vlad Mihalcea on the jOOQ blog, this time about his Open Source library flexypool. Read his previous jOOQ Tuesdays post on Hibernate here.

Vlad is a Hibernate developer advocate and he’s the author of the popular book High Performance Java Persistence, and he knows 1-2 things about connection pooling.

vladmihalcea

Introduction

Back in 2014, I was working as a software architect, and our team was building a real-estate platform which was composed of multiple nodes, as depicted in the following diagram:

databaseintegrationpoint

This is a classic enterprise architecture layout. The database is replicated to provide better throughout and availability in case of node failures. There are front-end nodes that deliver the website content. There are also many back-end nodes as well, like email schedulers or data import batch processors.

All these nodes require database connectivity, either to a Master node, for read-write transactions or to the Slave nodes, for read-only transactions.

Because acquiring database connections is an expensive process, each system node uses its own connection pool. By reusing physical database connections, the connection acquisition is very fast, therefore reducing the overall transaction response time.

Not only that a connection pool can reduce transaction response time, but it can level up traffic spikes as well. Without a connection pool, during a traffic spike, a front-end nodes might acquire all database connections, leaving the back-end processors with no database connectivity.

The connection pool, having a maximum number of database connections, allows the connections to queue whenever a traffic spike is happening. Therefore, during a traffic spike, the transaction response time will increase due to the queuing mechanism, but this is way better than taking down the whole system.

For these two reasons, the connection pool is a very good choice in many enterprise systems.

Based on the underlying hardware resources, a relational database can only offer a limited number of connections. For this reason, we must be very careful when choosing the pool size for each particular system node.

Connection pool sizing

I was the lucky person to get the task of figuring out how many connections should we allocate for each system node in our real-estate platform. Since I graduated Electronics and Telecommunications, I remembered that we learned about a similar problem when having to provision telecommunications networks. Agner Krarup Erlang invented Queuing theory for solving this problem, and I was curious if we could also find the right pool size by applying Erlang queuing models.

I was not the only one trying to apply the Queuing theory principles to software systems. Percona has a very interesting study: Forecasting MySQL Scalability with the actual service time in a system that is affected by a myriad of variables.

In the end, I realized that the best way to tackle this problem is to constant measuring and adjustments. For this reason, I needed a tool to capture database connection metrics, as well as a way to adjust a given connection pool while the enterprise system is running.

And, that’s how FlexyPool was born.

Basically, FlexyPool is a DataSource Proxy that stands in front of the actual JDBC DataSource or other proxies (e.g. statement logging).

datasourceproxyarchitecture

FlexyPool supports a great variety of stand-alone connection pools:

And it collects the following metrics:

  • concurrent connections histogram
  • concurrent connection requests histogram
  • data source connection acquiring time histogram
  • connection lease time histogram
  • maximum pool size histogram
  • total connection acquiring time histogram
  • overflow pool size histogram
  • retries attempts histogram

For instance, the concurrent connection count metric gives you an insight into how many connections are required by a certain application under a given traffic load:

concurrentconnectioncount

The connection acquisition metric tells you how much time it takes to obtain a database connection from the pool:

connectionacquire

The connection lease time allows you to spot long-running transactions, which are undesirable in high-performance OLTP applications:

connectionlease

For the stand-alone connection pools, FlexyPool can increment the pool size beyond the maximum capacity, as it offers an overflow buffer. The benefit of this overflow buffer is that it allows you to increase the pool size only when the incoming traffic causes a certain connection acquisition timeout.

Although FlexyPool can also monitor Java EE connection pools, it cannot increase the pool size in Java EE environments since the DataSource is an application server managed resource.

Conclusion

Because enterprise systems evolve, so does the underlying data access patterns. For this reasons, monitoring the underlying database connection usage is a very important metric, which needs to be monitored on a regular basis. FlexyPool builds on top of CodaHale and Dropwizard Metrics, so you can easily integrate it with well-known Application Performance Monitoring tools, such as Graphite or Grafana.

FlexyPool is open-source, and it uses an Apache license 2.0. You can find it the project repository on GitHub, and all the released dependencies are available on Maven Central, so it’s very easy to integrate it in your own project.

FkexyPool is powering many enterprise systems, like Etuovi, Mitch&Mates, and ScentBird. If you decide to use it in your current enterprise system, and you are willing to provide a testimonial, you can win a free copy of my High-Performance Java Persistence book.

jOOQ Tuesdays: Thorben Janssen Shares his Hibernate Performance Secrets

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

thorben-janssen

I’m very excited to feature today Thorben Janssen who has spent most of his professional life with Hibernate.

Thorben, with your blog and training, you are one of the few daring “annotatioficionados” as we like to call them, who risks diving deep into JPA’s more sophisticated annotations – like @SqlResultSetMapping. What is your experience with JPA’s advanced, declarative programming style?

From my point of view, the declarative style of JPA is great and a huge problem at the same time.

If you know what you’re doing, you just add an annotation, set a few properties and your JPA implementation takes care of the rest. That makes it very easy to use complex features and avoids a lot of boilerplate code.

But it can also become a huge issue, when someone is not that familiar with JPA and just copies a few annotations from stack overflow and hopes that it works.

It will work in most of the cases. JPA and Hibernate are highly optimized and handle suboptimal code and annotations quite well. At least as long as it is tested with one user on a local machine. But that changes quickly when the code gets deployed to production and several hundred or thousand users use it in parallel. These issues get then often posted on stack overflow or other forums together with a complaint about the bad performance of Hibernate…

Your training goes far beyond these rather esoteric use-cases and focuses on JPA / Hibernate performance. What are three things every ORM user should know about JPA / SQL performance?

Only three things? I could talk about a lot more things related to JPA and Hibernate performance.

The by far most important one is to remember that your ORM framework is using SQL to store your data in a relational database. That seems to be pretty obvious, but you can avoid the most common performance issues by analyzing and optimizing the executed SQL statements. One example for that is the popular n+1 select issue which you can easily find and fix as I show in my free, 3-part video course.

Another important thing is that no framework or specification provides a good solution for every problem. JPA and Hibernate make it very easy to insert and update data into a relational database. And they provide a set of advanced features for performance optimizations, like caching or the ordering of statements to improve the efficiency of JDBC batches.

But Hibernate and JPA are not a good fit for applications that have to perform a lot of very complex queries for reporting or data mining use cases. The feature set of JPQL is too limited for these use cases. You can, of course, use native queries to execute plain SQL, but you should have a look at other frameworks if you need a lot of these queries.

So, always make sure that your preferred framework is a good fit for your project.

The third thing you should keep in mind is that you should prefer lazy fetching for the relationships between your entities. This prevents Hibernate from executing additional SQL queries to initialize the relationships to other entities when it gets an entity from the database. Most use cases don’t need the related entities, and the additional queries slow down the application. And if one of your use cases uses the relationships, you can use FETCH JOIN statements or entity graphs to initialize them with the initial query.

This approach avoids the overhead of unnecessary SQL queries for most of your use cases and allows you to initialize the relationships if you need them.

These are the 3 most important things you should keep in mind, if you want to avoid performance problems with Hibernate. If you want to dive deeper into this topic, have a look at my Hibernate Performance Tuning Online Training. The next one starts on 23th July.

What made you focus your training mostly on Hibernate, rather than also on EclipseLink / OpenJPA, or just plain SQL / jOOQ? Do you have plans to extend to those topics?

To be honest, that decision was quite easy for me. I’m working with Hibernate for about 15 years now and used it in a lot of different projects with very different requirements. That gives me the experience and knowledge about the framework, which you need if you want to optimize its performance. I also tried EclipseLink but not to the same extent as Hibernate.

And I also asked my readers which JPA implementation they use, and most of them told me that they either use plain JPA or Hibernate. That made it pretty easy to focus on Hibernate.

I might integrate jOOQ into one of my future trainings. Because as I said before, Hibernate and JPA are a good solution if you want to create or update data or if your queries are not too complex. As soon as your queries get complex, you have to use native queries with plain SQL. In these cases, jOOQ can provide some nice benefits.

What’s the advantage of your online training over a more classic training format, where people meet physically – both for you and for your participants?

The good thing about a classroom training is that you can discuss your questions with other students and the instructor. But it also requires you to be in a certain place at a certain time which creates additional costs, requires you to get out of your current projects and keeps you away from home.

With the Hibernate Performance Tuning Online Training, I want to provide a similar experience to a classroom training in which you study with other students and ask your questions but without having to travel somewhere. You can watch my training videos and do the exercises from your office or home and meet with me, and other students in the forum or group coaching calls to discuss your questions.

So you get the best of both worlds without declaring any travel expenses 😉

Your blog also includes a weekly digest of all things happening in the Java ecosystem called Java Weekly. What are the biggest insights into our ecosystem that you’ve gotten out of this work, yourself?

The Java ecosystem is always changing and improving, and you need to learn constantly if you want to stay up to date. One way to do that is to read good blog posts. And there are A LOT of great, small blogs out there written by very experienced Java developers who like to share their knowledge. You just have to find them. That’s probably the biggest insight I got.

I read a lot about Java and Java EE each week (that’s probably the only advantage of a 1.5-hour commute with public transportation) and present the most interesting ones every Monday in a new issue of Java Weekly.

It is all about the JDBC Basics

We’re very happy to announce a guest post by Marco Behler, who has been blogging about jOOQ in the past.

img31Marco started out in programming (reverse-engineering, actually) and now mainly programmes on the JVM in his day-to-day work. He also always had a sweet tooth for strategy and marketing. Marco Behler GmbH is the result of that hybrid role.

It is all about the JDBC Basics

It is one of the days.

You are reading the Spring documentation’s @Transactional section and still don’t understand the difference between logical and physical transaction scopes. Simultaneously your app throws an
LazyInitializationException and you have no idea why. To top it off you see spontaneous database deadlocks in production and you suspect your connection pool is leaking connections..somehow.

Know what most likely would have helped instead of banging your head against the wall? Spending a couple (literally) of hours on learning the JDBC basics. Let’s find out why:

What are the JDBC basics?

The basics are opening up/closing database connections and then working with transactions. Also understanding how deadlocks, pessimistic and optimistic locking work on a plain JDBC level. A bit of isolation levels and savepoints and then directly on to connection pools and jdbc driver logging. That’s it. Seriously.

Why are the basics so important?

Everything you will encounter in frameworks like Spring, Hibernate, jOOQ etc. builds up on these basics. For example, there are a gazillion topics on the internet regarding Hibernate’s LazyInitializationException and I was scared of that particular exception myself many years ago. But what else would you expect trying to query the database without having a connection to the database open (which is basically all that this exeception is) ?

The same with Spring’s “transaction framework”. There is so much content, or shall we say (F)ear/(U)ncertainty/(D)oubt, out there on how to open up transactions with spring, be it programmatically, with annotations or xml. But what if you knew that under the hood, there is only one way (and actually one line of code) to open up transactions in the JDBC world?

Let me not even get started on the various (mis)configurations of connection pools you see in production in the wild. Or the unawareness of JDBC (driver) logging, which usually leads to debugging in the wild. All basics, which you can master in a couple of hours and which will help you for a lifetime!

Why do people not just learn the basics?

In every middle-sized project there is a ton of technologies involved and there usually is no clear-cut path on how to learn all of them or how they all work together. It simply takes a lot of time and effort to dig through everything.

There’s JPA sessions and JDBC connections and then Spring somehow provides those transactional proxies in 5 different ways and then some other colleague just put jOOQ into the mix, but then somehow my session doesn’t flush and my objects don’t get persisted and the HibernateTransactionManager is not working as expected.

With all of this, I would also hope for my database transactions just to commit – god forbid what happens on rollback 🙂

But in the end, everything technology mentioned is just a layer on top of JDBC. If you understand transactions or deadlocks or savepoints on the basic level, then Spring or Hibernate or jOOQ will not throw you off.

So what do you recommend ?

If you want to get miles ahead in your day-to-day database programming, you have to start with the basics. Step-by-Step. And then you will see most of your problems automatically evaporate.

Out of my extensive database programming experience, I created an ebook with a ton of ready-to-run exercises, which will take you from Java database novice to expert. At your own pace. You can literally copy the source code of every chapter into your IDE, run it and (hopefully) learn from it. It covers plain JDBC, Spring, Hibernate, jOOQ (soon) and also distributed transactions.

You can read the whole book for free online here, and I would love to get your feedback! I would really like to let the community feedback flow back into future editions of the book. In addition, If you like what you see and the exercises help you, you can also show your support by getting a paid digital version (pdf, epub, mobi).

In any case…

…learn your JDBC basics – and you will profit from them for a lifetime!

Type Safe Queries for JPA’s Native Query API

When you’re using JPA – sometimes – JPQL won’t do the trick and you’ll have to resort to native SQL. From the very beginning, ORMs like Hibernate kept an open “backdoor” for these cases and offered a similar API to Spring’s JdbcTemplate, to Apache DbUtils, or to jOOQ for plain SQL. This is useful as you can continue using your ORM as your single point of entry for database interaction.

However, writing complex, dynamic SQL using string concatenation is tedious and error-prone, and an open door for SQL injection vulnerabilities. Using a type safe API like jOOQ would be very useful, but you may find it hard to maintain two different connection, transaction, session models within the same application just for 10-15 native queries.

But the truth is:

You an use jOOQ for your JPA native queries!

That’s true! There are several ways to achieve this.

Fetching tuples (i.e. Object[])

The simplest way will not make use of any of JPA’s advanced features and simply fetch tuples in JPA’s native Object[] form for you. Assuming this simple utility method:

public static List<Object[]> nativeQuery(
    EntityManager em, 
    org.jooq.Query query
) {

    // Extract the SQL statement from the jOOQ query:
    Query result = em.createNativeQuery(query.getSQL());

    // Extract the bind values from the jOOQ query:
    List<Object> values = query.getBindValues();
    for (int i = 0; i < values.size(); i++) {
        result.setParameter(i + 1, values.get(i));
    }

    return result.getResultList();
}

Using the API

This is all you need to bridge the two APIs in their simplest form to run “complex” queries via an EntityManager:

List<Object[]> books =
nativeQuery(em, DSL.using(configuration)
    .select(
        AUTHOR.FIRST_NAME, 
        AUTHOR.LAST_NAME, 
        BOOK.TITLE
    )
    .from(AUTHOR)
    .join(BOOK)
        .on(AUTHOR.ID.eq(BOOK.AUTHOR_ID))
    .orderBy(BOOK.ID));

books.forEach((Object[] book) -> 
    System.out.println(book[0] + " " + 
                       book[1] + " wrote " + 
                       book[2]));

Agreed, not a lot of type safety in the results – as we’re only getting an Object[]. We’re looking forward to a future Java that supports tuple (or even record) types like Scala or Ceylon.

So a better solution might be the following:

Fetching entities

Let’s assume you have the following, very simple entities:

@Entity
@Table(name = "book")
public class Book {

    @Id
    public int id;

    @Column(name = "title")
    public String title;

    @ManyToOne
    public Author author;
}

@Entity
@Table(name = "author")
public class Author {

    @Id
    public int id;

    @Column(name = "first_name")
    public String firstName;

    @Column(name = "last_name")
    public String lastName;

    @OneToMany(mappedBy = "author")
    public Set<Book> books;
}

And let’s assume, we’ll add an additional utility method that also passes a Class reference to the EntityManager:

public static <E> List<E> nativeQuery(
    EntityManager em, 
    org.jooq.Query query,
    Class<E> type
) {

    // Extract the SQL statement from the jOOQ query:
    Query result = em.createNativeQuery(
        query.getSQL(), type);

    // Extract the bind values from the jOOQ query:
    List<Object> values = query.getBindValues();
    for (int i = 0; i < values.size(); i++) {
        result.setParameter(i + 1, values.get(i));
    }

    // There's an unsafe cast here, but we can be sure
    // that we'll get the right type from JPA
    return result.getResultList();
}

Using the API

This is now rather slick, just put your jOOQ query into that API and get JPA entities back from it – the best of both worlds, as you can easily add/remove nested collections from the fetched entities as if you had fetched them via JPQL:

List<Author> authors =
nativeQuery(em,
    DSL.using(configuration)
       .select()
       .from(AUTHOR)
       .orderBy(AUTHOR.ID)
, Author.class); // This is our entity class here

authors.forEach(author -> {
    System.out.println(author.firstName + " " + 
                       author.lastName + " wrote");
    
    books.forEach(book -> {
        System.out.println("  " + book.title);

        // Manipulate the entities here. Your
        // changes will be persistent!
    });
});

Fetching EntityResults

If you’re extra-daring and have a strange affection for annotations, or you just want to crack a joke for your coworkers just before you leave on vacation, you can also resort to using JPA’s javax.persistence.SqlResultSetMapping. Imagine the following mapping declaration:

@SqlResultSetMapping(
    name = "bookmapping",
    entities = {
        @EntityResult(
            entityClass = Book.class,
            fields = {
                @FieldResult(name = "id", column = "b_id"),
                @FieldResult(name = "title", column = "b_title"),
                @FieldResult(name = "author", column = "b_author_id")
            }
        ),
        @EntityResult(
            entityClass = Author.class,
            fields = {
                @FieldResult(name = "id", column = "a_id"),
                @FieldResult(name = "firstName", column = "a_first_name"),
                @FieldResult(name = "lastName", column = "a_last_name")
            }
        )
    }
)

Essentially, the above declaration maps database columns (@SqlResultSetMapping -> entities -> @EntityResult -> fields -> @FieldResult -> column) onto entities and their corresponding attributes. With this powerful technique, you can generate entity results from any sort of SQL query result.

Again, we’ll be creating a small little utility method:

public static <E> List<E> nativeQuery(
    EntityManager em, 
    org.jooq.Query query,
    String resultSetMapping
) {

    // Extract the SQL statement from the jOOQ query:
    Query result = em.createNativeQuery(
        query.getSQL(), resultSetMapping);

    // Extract the bind values from the jOOQ query:
    List<Object> values = query.getBindValues();
    for (int i = 0; i < values.size(); i++) {
        result.setParameter(i + 1, values.get(i));
    }

    // This implicit cast is a lie, but let's risk it
    return result.getResultList();
}

Note that the above API makes use of an anti-pattern, which is OK in this case, because JPA is not a type safe API in the first place.

Using the API

Now, again, you can pass your type safe jOOQ query to the EntityManager via the above API, passing the name of the SqlResultSetMapping along like so:

List<Object[]> result =
nativeQuery(em,
    DSL.using(configuration
       .select(
           AUTHOR.ID.as("a_id"),
           AUTHOR.FIRST_NAME.as("a_first_name"),
           AUTHOR.LAST_NAME.as("a_last_name"),
           BOOK.ID.as("b_id"),
           BOOK.AUTHOR_ID.as("b_author_id"),
           BOOK.TITLE.as("b_title")
       )
       .from(AUTHOR)
       .join(BOOK).on(BOOK.AUTHOR_ID.eq(AUTHOR.ID))
       .orderBy(BOOK.ID)), 
    "bookmapping" // The name of the SqlResultSetMapping
);

result.forEach((Object[] entities) -> {
    JPAAuthor author = (JPAAuthor) entities[1];
    JPABook book = (JPABook) entities[0];

    System.out.println(author.firstName + " " + 
                       author.lastName + " wrote " + 
                       book.title);
});

The result in this case is again an Object[], but this time, the Object[] doesn’t represent a tuple with individual columns, but it represents the entities as declared by the SqlResultSetMapping annotation.

This approach is intriguing and probably has its use when you need to map arbitrary results from queries, but still want managed entities. We can only recommend Thorben Janssen‘s interesting blog series about these advanced JPA features, if you want to know more:

Conclusion

Choosing between an ORM and SQL (or between Hibernate and jOOQ, in particular) isn’t always easy.

  • ORMs shine when it comes to applying object graph persistence, i.e. when you have a lot of complex CRUD, involving complex locking and transaction strategies.
  • SQL shines when it comes to running bulk SQL, both for read and write operations, when running analytics, reporting.

When you’re “lucky” (as in – the job is easy), your application is only on one side of the fence, and you can make a choice between ORM and SQL. When you’re “lucky” (as in – ooooh, this is an interesting problem), you will have to use both. (See also Mike Hadlow’s interesting article on the subject)

The message here is: You can! Using JPA’s native query API, you can run complex queries leveraging the full power of your RDBMS, and still map results to JPA entities. You’re not restricted to using JPQL.

Side-note

While we’ve been critical with some aspects of JPA in the past (read How JPA 2.1 has become the new EJB 2.0 for details), our criticism has been mainly focused on JPA’s (ab-)use of annotations. When you’re using a type safe API like jOOQ, you can provide the compiler with all the required type information easily to construct results. We’re convinced that a future version of JPA will engage more heavily in using Java’s type system, allowing a more fluent integration of SQL, JPQL, and entity persistence.

jOOQ Tuesdays: Vlad Mihalcea Gives Deep Insight into SQL and Hibernate

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

vlad_mihalcea

We have the pleasure of talking to Vlad Mihalcea in this third edition who will be telling us about the skills developers need to acquire when working with Java, SQL, and Hibernate.

Hi Vlad – You’re blog explodes with excellent posts about Hibernate. It looks like you love digging deep into the most popular persistence API in the market, right?

I really mean when saying that “teaching is my way of learning” and to master a certain technology, you have to go beyond the reference documentation. Hibernate has been around for 10 years now and there’s a plethora of projects built on top of it. The Hibernate Master Class focuses on some proven ORM design patterns, like concurrency control, caching and batching.

You’ve recently told me about your realisation of the lack of SQL insight in our industry. How did that come to be?

The Object-Relational mismatch is only the tip of the iceberg, when it comes to accessing data. The biggest problem we face in enterprise systems, is the Enterprise-Database developer mismatch.

A developer knows about the programming languages, design patterns and application architecturing, but database skills are always attributed to the Database Administrator role. This is a very dangerous assumption.

It’s as if we developed on Linux without ever wanting to learn how the operating system works, relying solely on the System Administrator knowledge. If you develop enterprise applications, you have no escape but learning how a database works. Reading the excellent “SQL Performance Explained” book, made me realize how little I knew about the inner-workings of relational database systems. This book is meant for developers and it’s a must-read for every enterprise developer professional.

What can we do to improve the situation for our industry? Is there a chance for a tighter integration of JPA and SQL? Or specifically, of Hibernate and jOOQ?

First, it’s the mindset that needs to change. We need to acknowledge that there’s no such thing as a one-size-fits-all framework, and that applies to database access as well. When I write unit tests, I don’t limit myself to JUnit. I also use Mockito and Hamcrest, a testing stack being a better alternative.

JPA excels when writing data, because you can the INSERT/UPDATE statements are automatically updated, whenever the persistence model changes. The implicit and explicit locking allow us to protect against lost updates, especially in long conversation workflows.

But while abstracting the SQL write statements is a doable task, when it comes to reading data, nothing can beat native SQL. The most commonly-used RDBMS have implemented non-standard data access techniques (window functions, Common Table Expressions, PIVOT) and the SQL-92 JPA abstraction layer can only focus on common functionalities. That’s why native querying is unavoidable on almost any enterprise system.

jOOQ has done a very good job promoting SQL knowledge into the Java ecosystem. Java is ruling the enterprise software development and SQL skills have always been the Achilles heel of most enterprise development teams.

While you can fire native queries from JPA, there’s no support for dynamic native query building. jOOQ allows you to build type-safe dynamic native queries, strengthening your application against SQL-injection attacks. jOOQ can be integrated with JPA, as I already proven on my blog, and the JPA-jOOQ combo can provide a solid data access stack.

Tell us a little bit about your Hibernate Master Class, and your personal blogging strategy.

The Hibernate Master Class blog series is actually a book in the making. Because I work a full-time job, it’s difficult to commit to a fixed writing schedule, so I can only write as much as my spare times allows me.

Once all topics are covered, I’ll turn all this info into a book, that I’m going to self-publish, following the “SQL Performance Explained” example.

[ Edit ] The book has been finished and is available here:

https://leanpub.com/high-performance-java-persistence

Where will you be in 5 years?

I enjoy both software architecture, as well as writing about it. I will continue on this journey and see where the wind will carry me.

jOOQ vs. Hibernate: When to Choose Which

Hibernate has become a de-facto standard in the Java ecosystem, and after the fact, also an actual JavaEE standard implementation if standards matter to you, and if you put the JCP on the same level with ISO, ANSI, IEEE, etc.

This article does not intended to discuss standards, but visions. Hibernate shares JPA’s vision of ORM. jOOQ shares SQL’s vision of powerful querying, so for the sake of the argument, let’s use Hibernate / JPA / ORM interchangeably, much like jOOQ / JDBC / SQL.

The question why should anyone not use Hibernate these days always shows up frequently – precisely because Hibernate is a de-facto standard, and the first framework choice in many other frameworks such as Grails (which uses GORM, which again uses Hibernate).

However, even Gavin King, the creator of Hibernate, doesn’t believe that Hibernate should be used for everything:

gavin-king

If that’s the case, are there any objective decision helping points that you could consider, when to use an ORM and when to use SQL?

Discussing on a high level

First off, let’s bring this discussion to a higher level. Instead of deciding between Hibernate and jOOQ as concrete implementations of their own domains, let’s think about ORM vs. SQL, and their different use-cases.

When deciding between an ORM (e.g. Hibernate) and SQL (e.g. jOOQ), the driving question that you should ask yourself is not the question of project complexity. Some of our most demanding customers are using jOOQ on medium-sized schemas with thousands of tables / views. Often, those schemas are extremely normalised and sometimes even deployed on as many as six different RDBMS. jOOQ was specifically designed to work in these scenarios, while keeping the simple use-case in mind as well.

So, instead of thinking about project complexity, ask yourself the following questions:

  • 1. Will your data model drive your application design, or will your application design drive your data model(s)?

    A main aspect here is the question whether you “care” about your database in the sense of whether it might survive your application. Very often, applications come and go. They may be re-written in Python / JavaScript, etc. 5 years down the line. Or you have multiple applications accessing the same database: Your Java application, some Perl scripts, stored procedures, etc. If this is the case, database design is a priority in your project, and jOOQ works extremely well in these setups.

    If you don’t necessarily “care” about your database in the sense that you just want to “persist” your Java domain somewhere, and this happens to be a relational database, then Hibernate might be a better choice – at least in early stages of your project, because you can easily generate your database schema from your Entity model.

  • 2. Will you do mostly complex reading and simple writing, or will you engage in complex writing?

    SQL really shines when reading is complex. When you join many tables, when you aggregate data in your database, when you do reporting, when you do bulk reading and writing. You think of your data in terms of set theory, e.g. your data as a whole. Writing CRUD with SQL is boring, though. This is why jOOQ also provides you with an ActiveRecord-style API that handles the boring parts, when you’re operating on single tables (Jason mentioned this).

    If, however, your writing becomes complex, i.e. you have to load a complex object graph with 20 entities involved into memory, perform optimistic locking on it, modify it in many different ways and then persist it again in one go, then SQL / jOOQ will not help you. This is what Hibernate has originally been created for.

Opinion

I believe that data is forever. You should *always* assume that your database survives your application. It is much easier to rewrite (parts of) an application than to migrate a database. Having a clean and well-designed database schema will always pay off down the line of a project, specifically of a complex project. See also our previous article about the fallacy of “schemaless” databases.

Also, most projects really do 90% reading and 10% writing, writing often not being complex (2-3 tables modified within a transaction). This means that most of the time, the complexity solved by Hibernate / JPA’s first and second level caches is not needed. People often misunderstand these features and simply turn off caching, flushing Hibernate’s cache to the server all the time, and thus using Hibernate in the wrong way.

If, however, you’re undecided about the above two axes of decision, you can go the middle way and use jOOQ only for reporting, batch processing, etc. and use Hibernate for your CRUD – in a CQRS (Command Query Responsibility Segregation: http://martinfowler.com/bliki/CQRS.html) style. There are also quite a few jOOQ users who have chosen this path.

Further reading