Turn Around. Don’t Use JPA’s loadgraph and fetchgraph Hints. Use SQL Instead.


Thorben Janssen (also known from our jOOQ Tuesdays series) recently published an interesting wrap-up of what’s possible with Hibernate / JPA query hints. The full article can be seen here:
http://www.thoughts-on-java.org/11-jpa-hibernate-query-hints-every-developer-know

Some JPA hints aren’t really hints, they’re really full-blown query specifications, just like JPQL queries, or SQL queries. They tell JPA how to fetch your entities. Let’s look at javax.persistence.loadgraph and javax.persistence.fetchgraph.

The example given in Oracle’s Java EE 7 tutorial is this:

You have a default entity graph, which is hard-wired to your entity class using annotations (or XML in the old days):

@Entity
public class EmailMessage implements Serializable {
    @Id
    String messageId;
    @Basic(fetch=EAGER)
    String subject;
    String body;
    @Basic(fetch=EAGER)
    String sender;
    @OneToMany(mappedBy="message", fetch=LAZY)
    Set<EmailAttachment> attachments;
    ...
}

Notice how the above entity graph mixes formal graph meta information (@Entity, @Id, @OneToMany, …) with query default information (fetch=EAGER, fetch=LAZY).

EAGER or LAZY?

Now, the problem with the above is that these defaults are hard-wired and cannot be changed for ad-hoc usage (thank you annotations). Remember, SQL is an ad-hoc query language with all of benefits that derive from this ad-hoc-ness. You can materialise new result sets whose type was not previously known on the fly. Excellent tool for reporting, but also for ordinary data processing, because it is so easy to change a SQL query if you have new requirements, and if you’re using languages like PL/SQL or libraries like jOOQ, you can even do that in a type safe, precompiled way.

Unlike in JPA, whose annotations are not “ad-hoc”, just like SQL’s DDL is not “ad-hoc”. Can you ever switch from EAGER to LAZY? Or from LAZY to EAGER? Without breaking half of your application? Truth is: You don’t know!

The problem is: choosing EAGER will prematurely materialise your entire entity graph (even if you needed only an E-Mail message’s subject and body), resulting in too much database traffic (see also “EAGER fetching is a code smell” by Vlad Mihalcea). Choosing LAZY will result in N+1 problems in case you really do need to materialise the relationship, because for each parent (“1”), you have to individually fetch each child (“N”) lazily, later on.

Do SQL people suffer from N+1?

As a SQL person, this sounds ridiculous to me. Imagine specifying in your foreign key constraint whether you always want to auto-fetch your relationship…

ALTER TABLE email_attachment
ADD CONSTRAINT fk_email_attachment_email
FOREIGN KEY (message_id)
REFERENCES email_message(message_id)
WITH FETCH OPTION LAZY -- meh...

Of course you don’t do that. The point of normalising your schema is to have the data sit there without duplicating it. That’s it. It is the query language’s responsibility to help you decide whether you want to materialise the relationship or not. For instance, trivially:

-- Materialise the relationship
SELECT *
FROM email_message m
JOIN email_attachment a
USING (message_id)

-- Don't materialise the relationship
SELECT *
FROM email_message m

Duh, right?

Are JOINs really that hard to type?

Now, obviously, typing all these joins all the time can be tedious, and that’s where JPA seems to offer help. Unfortunately, it doesn’t help, because otherwise, we wouldn’t have tons of performance problems due to the eternal EAGER vs LAZY discussion. It is a GOOD THING to think about your individual joins every time because if you don’t, you will structurally neglect your performance (as if SQL performance wasn’t hard enough already) and you’ll notice this only in production, because on your developer machine, you don’t have the problem. Why?

Works on my machine ಠ_ಠ

One way of solving this with JPA is to use the JOIN FETCH syntax in JPQL (which is essentially the same thing as what you would be doing in SQL, so you don’t win anything over SQL except for automatic mapping. See also this example where the query is run with jOOQ and the mapping is done with JPA).

Another way of solving this with JPA is to use these javax.persistence.fetchgraph or javax.persistence.loadgraph hints, but that’s even worse. Check out the code that is needed in Oracle’s Java EE 7 tutorial just to indicate that you want this and that column / attribute from a given entity:

EntityGraph<EmailMessage> eg = em.createEntityGraph(EmailMessage.class);
eg.addAttributeNodes("body");
...
Properties props = new Properties();
props.put("javax.persistence.fetchgraph", eg);
EmailMessage message = em.find(EmailMessage.class, id, props);

With this graph, you can now indicate to your JPA implementation that in fact you don’t really want to get just a single E-Mail message, you also want all the specified JOINs to be materialised (interestingly, the example doesn’t do that, though).

You can pass this graph specification also to a JPA Query that does a bit more complex stuff than just fetching a single tuple by ID – but then, why not just use JPA’s query language to express that explicitly? Why use a hint?

Let me ask you again, why not just specify a sophisticated query language? Let’s call that language… SQL? The above example is solved trivially as such:

SELECT body
FROM email_message
WHERE message_id = :id

That’s not too much typing, is it? You know exactly what’s going on, everyone can read this, it’s simple to debug, and you don’t need to wrestle with first and second level caches because all the caching that is really needed is buffer caching in your database (i.e. caching frequent data in database memory to prevent excessive I/O).

The cognitive overhead of getting everything right and tuning stuff in JPA is so big compared to writing just simple SQL statements (and don’t forget, you may know why you put that hint, but your coworker may so easily overlook it!), let me ask you: Are you very sure you actually profit from JPA (you really need entity graph persistence, including caching)? Or are you wrestling the above just because JPA is your default choice?

When JPA is good

JPA (and its implementations) is excellent when you have the object graph persistence problem. This means: When you do need to load a big graph, modify it in your client application, possibly in a distributed and cached and long-conversational manner, and then store the whole graph back into the database without having to wrestle with locking, caching, lost updates, and all sorts of other problems, then JPA does help you. A lot. You don’t want to do that with SQL.

Do note that the key aspect here is storing the graph back into the database. 80% of JPA’s value is in writing stuff, not reading stuff.

But frankly, you probably don’t have this problem. You’re doing mostly simple CRUD and probably complex querying. SQL is the best language for that. And Java 8 functional programming idioms help you do the mapping, easily.

Conclusion

Don’t use loadgraph and fetchgraph hints. Chances are very low that you’re really on a good track. Chances are very high that migrating off to SQL will greatly simplify your application.

8 thoughts on “Turn Around. Don’t Use JPA’s loadgraph and fetchgraph Hints. Use SQL Instead.

  1. Let me ask you again, why not just specify a sophisticated query language?

    Why to bother the user? Nearly always, the same query gets repeated over and over and nearly always, the same fields get used. You usually don’t really care about the speed of the first execution, what matters is the sustained speed. So execute the first query lazily or eagerly at will, but remember what fields are needed and fetch them the next time. Do it for joins, too, so if

    Song song = queryFactory().get(Song.class, songId);
    for (Artist a : song.album().artists()) result.add(a.surname());
    

    gets executed the first time, use dynamic proxies and let them keep track of all the accesses. The next time, do all the joins needed.

    Things are a bit more complicated with query reuse, but the is solvable adding the context (e.g., the path info in case of a HTTP server) in order to allow different plans. Sometimes fields get used conditionally, but then either the common superset is small enough or the condition should be made to a part of the context.

    It’s surely non-trivial, but not much when compared to JPA. What do you think?

    • Hmm, JIT for SQL backends, then, right? I think we’re very far away from that – if it’s even possible. But I like the idea. You might like to delve into JINQ, perhaps? http://www.jinq.org. It was covered here on this blog before:
      https://blog.jooq.org/2014/03/21/java-8-friday-java-8-will-revolutionize-database-access

      The interesting thing in this context is that Ming (the author) first implemented JINQ to translate imperative code structures to SQL, in a similar way as you proposed. Obviously, translating the Java 8 Stream is more natural, though. And suddenly, you’re back with a query language (Java 8 Streams), albeit with a very trivial one.

      The SQL query language is so powerful, once you do engage in using it to the max, you will not look back to the manual implementation of algorithms.

      Of course, each of us expresses their opinion with the background / context they’re in. In my contracting work, I’m helping banks write performant, yet complex online analytics over billions of rows. I can’t imagine this to be done with any general-purpose, non-query language, not even with Java 8 streams.

      • Hmm, JIT for SQL backends, then, right?

        I was thinking mostly about ORM. Looking at the monstrosity called EntityGraph, I concluded that it’s hard to specify what columns are needed in a concise way. So let’s pretend we always fetch everything.

        I think we’re very far away from that – if it’s even possible.

        Really? Then I must have been either unclear or missing some big problem. I believe that Hibernate already does the harder part: mapping to objects and proxying. All what’s missing is

        – using the proxies to keep track what columns are needed
        – using them to fetch transparently what’s missing (*)
        – using the tracked information for the the future SQL generation

        This sounds like something doable in a few weeks. In the most trivial form, it’d decide between lazy and eager fetching based on whether the data were actually used when the same query was executed before.

        (*) If my idea works, there should be hardy ever be anything missing. The
        1+N problem should disappear, except for the first execution.

        Of course, each of us expresses their opinion with the background / context they’re in.

        Sure. I guess most people deal with smaller data sets than you. For such people, code performance is less important than coding performance and my idea could bring more performance to ORM without bothering them too much. Sure, many claim that ORM belongs to the problem set rather than to the solution set, but many disagree.

        I’ll surely look into JINQ, thanks for the link.

        The SQL query language is so powerful, once you do engage in using it to the max, you will not look back to the manual implementation of algorithms.

        Maybe with jOOQ, I could start liking it, but reading SQL makes me feel dizzy. Algorithms are rarely a problem in a typical application; if I could ignore persistency, everything would get much simpler.

        • (btw, it’s called <blockquote> 🙂 )

          “Doable in a few weeks” – Everytime I hear this, I’m thinking of Dilbert. Good luck! 🙂 It would’ve been done if it were that simple (including all possible caveats), no? But it’s interesting. Why don’t you suggest it to the Hibernate team? I’m curious about their answer.

          if I could ignore persistency, everything would get much simpler.

          The big question is: What part of “persistency” would you like to ignore? Persistent state? Or a language that gives you access to sophisticated predicate logic? I don’t see what’s objectively wrong with the latter, apart from a matter of taste for syntax / programming paradigms. And the great thing is: With a sophisticated predicate logic language (let’s call it SQL), algorithms are of your least concern. The database will figure them out for you.

          • It would’ve been done if it were that simple (including all possible caveats), no?

            No.

            If everyone followed this line of thought, we’d never get anything new. Btw., did you read this before you started jOOQ? 😀

            Why don’t you suggest it to the Hibernate team?

            Because the idea came just yesterday. I guess, I will.

            What part of “persistency” would you like to ignore? Persistent state?

            Yes, this is something, a framework is needed for (bridging to SQL, NoSQL, whatever). If I could make the customer believe, the data in RAM would last forever, I could save a lot of time.

            Or a language that gives you access to sophisticated predicate logic?

            A simple project could live without it. I guess, that I could replace all SQL an average project needs by some Java without much effort: As all data of an average project can be kept in memory, all you need is a bunch of maps simulating the indexes. Keeping them all in sync is non-trivial, but doable. Multithreading would be hard, but probably unnecessary (with all the data in memory, things can get damn fast: http://martinfowler.com/articles/lmax.html). Atomicity would be deal with “persistent” (in this sense: http://blog.higher-order.net/2009/09/08/understanding-clojures-persistenthashmap-deftwice.html) data structures (which could possibly support multithreading, too).

            Sure, there’re drawbacks: 1. Without ad-hoc queries, you could hardly inspect the data. 2. The project might grow too big for this.

            apart from a matter of taste for syntax / programming paradigms

            Yes, the taste. And also a mismatch to other parts of the code, although I agree that nearly all data manipulation could be done in SQL.

          • Btw., did you read this before you started jOOQ?😀

            Touché 😛

            If I could make the customer believe, the data in RAM would last forever, I could save a lot of time.

            That is a realistic scenario since the last couple of years, now that RAM has become so incredibly cheap – and people start working on “persistent” (i.e. “fail-safe”) RAM. But you will not get rid of the fact that you have state. And state is always difficult, no matter how you “serialise” it.

            But cheap RAM also benefits RDBMS. The best example can be seen here: http://stackexchange.com/performance. All of Stack Overflow runs on a single SQL Server instance (plus caching, of course). The whole tool chain is very useful for mid-size to large projects. You couldn’t possibly run operations without ad-hoc SQL.

            In my opinion, all these RAM arguments are not arguments against SQL. YAGNI might be reasonable in a lot of projects, though. But I personally believe that YAGNI (with respect to SQL) really applies only to very trivial projects.

            Yes, the taste. And also a mismatch to other parts of the code, although I agree that nearly all data manipulation could be done in SQL.

            But where’s the mismatch? Syntax? JDBC?

          • Fail-Safe RAM – sound cool!

            In my opinion, all these RAM arguments are not arguments against SQL.

            Agreed, but I’m not really arguing againt it. It’s just that it seems to open alternatives. But I might be just reinventing some in-memory object database.

            But where’s the mismatch? Syntax? JDBC?

            I guess, the biggest problem is just embedding another language, so that’s not SQL’s fault.

            I really hate the syntax, but that’s just subjective. And pure JDBC is a pure pain. It could look like

            .execute("UPDATE Employees SET age = ? WHERE id = ?", age, id);
            

            and do the binding and cache the PreparedStatement. For more complicated queries,

            .start("UPDATE Employees SET age =", age).add("WHERE id =", id).execute();
            

            would be less error-prone (even less than named parameters). But there are still field names as strings there, which is especially bad in a SELECT, which gets pretty redundant (and using “SELECT *” is bad for performance and often inapplicable).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s