How Adding a UNIQUE Constraint on a OneToOne Relationship Helps Performance


A lot of people use SQL constraints mainly to enforce data integrity, and that’s already a very good thing. A UNIQUE constraint, for instance, makes sure that there is at most one instance of any possible value (or tuple, in the case of a composite constraint) in a table. For instance:

CREATE TABLE x (
  a NUMBER(10),
  UNIQUE (a)
);

-- This works:
INSERT INTO x VALUES (1);

-- This fails:
INSERT INTO x VALUES (1);

Constraints are also good for (SELECT) performance

One thing that people often do not think about, though, is the fact that constraints can also be used as very valuable meta information by the database optimiser to make a better decision when finding the most optimal execution plan. It is a big difference for an optimiser…

  • To be unaware of how many different values there are in any given column (worst case)
  • To be able to estimate the different values in any given column (statistics are present)
  • To know that each value can appear at most once

In the last case, a UNIQUE (or PRIMARY KEY) constraint tells the optimiser exactly that. Once the optimiser knows that a certain operation returns at most one row (and not 10, 100, or even 10M rows), a nested loop suddenly becomes extremely cheap, as it can abort as soon as one row was found.

A Java analogy

Compare this with the following Java code:

// This is much "harder" ...
List<Object> objects = unknownSize();
for (Object object : objects) {
    doSomethingWith(object);
}

// ... than this, where we "know" the list
// only contains one value
for (Object object : Collections.singletonList("abc")) {
    doSomethingWith(object);
}

If Java had such optimisation capabilities (bonus question: can the JIT do it?), then the second loop could be optimised as such:

// The loop can be avoided:
doSomethingWith("abc");

Let’s look at Oracle

Let’s look at a concrete example with Oracle:

CREATE TABLE x1 (
  a NUMBER(10) NOT NULL,
  data VARCHAR2(100),
  CONSTRAINT pk_y1 PRIMARY KEY (a)
);

CREATE TABLE y1 (
  a NUMBER(10) NOT NULL, 
  optional_data VARCHAR2(100),
  CONSTRAINT fk_y1 FOREIGN KEY (a) REFERENCES x1(a)
);

CREATE INDEX i1 ON y1(a);
CREATE INDEX i1data ON x1(data);

INSERT INTO x1
SELECT level, dbms_random.string('a', 100)
FROM dual
CONNECT BY level <= 10000;

INSERT INTO y1
SELECT level, dbms_random.string('a', 100)
FROM dual
CONNECT BY level <= 5000;

This is a typical one-to-many relationship between x1 and y1. With the constraints in place, there can be between 0 and N rows in y1 for each row in x1. As a user, we know that there is only one value in y1 for any value in x1, but we don’t enforce this knowledge with a constraint.

Let’s look at the following query:

SELECT count(*)
FROM x1
JOIN y1 USING (a)
WHERE data LIKE 'a%';

What we’re doing here is we want all values in x1 whose data starts with the letter ‘a’, and for which we also have any optional_data. The execution plan is

-----------------------------------------------------
| Operation                | Name   | Rows  | Cost  |
-----------------------------------------------------
| SELECT STATEMENT         |        |       |    29 |
|  SORT AGGREGATE          |        |     1 |       |
|   HASH JOIN              |        |   176 |    29 |
|    VIEW                  |        |   176 |    24 |
|     HASH JOIN            |        |       |       |
|      INDEX RANGE SCAN    | I1DATA |   176 |     4 |
|      INDEX FAST FULL SCAN| PK_Y1  |   176 |    24 |
|    INDEX FAST FULL SCAN  | I1     |  5000 |     5 |
-----------------------------------------------------

As you can see, Oracle chooses to run a hash join operation, which means that all the values from x1 starting with ‘a’ are fetched (around 176), and joined in a hashmap with the entire set of values in y1, fetched from the index i1 (5000 values).

How does this compare with using a UNIQUE constraint?

We’ll create almost the exact same schema as follows:

CREATE TABLE x2 (
  a NUMBER(10) NOT NULL,
  data VARCHAR2(100),
  CONSTRAINT pk_x2 PRIMARY KEY (a)
);

CREATE TABLE y2 (
  a NUMBER(10) NOT NULL, 
  optional_data VARCHAR2(100),
  CONSTRAINT uk_y2 UNIQUE (a),
  CONSTRAINT fk_y2 FOREIGN KEY (a) REFERENCES x2(a)
);

CREATE INDEX i2data ON x2(data);

INSERT INTO x2
SELECT * FROM x1;

INSERT INTO y2
SELECT * FROM y1;

BEGIN
  dbms_stats.gather_table_stats('TEST', 'X2');
  dbms_stats.gather_table_stats('TEST', 'Y2');
END;
/

The data is exactly the same, but now we enforce a UNIQUE constraint on y2’s foreign key, making this effectively a one-to-one relationship. Check out what happens when we run the exact same query…

SELECT count(*)
FROM x2
JOIN y2 USING (a)
WHERE data LIKE 'a%';

Its execution plan is now:

-----------------------------------------------------
| Operation                | Name   | Rows  | Cost  |
-----------------------------------------------------
| SELECT STATEMENT         |        |       |    25 |
|  SORT AGGREGATE          |        |     1 |       |
|   NESTED LOOPS           |        |   176 |    25 |
|    VIEW                  |        |   176 |    25 |
|     HASH JOIN            |        |       |       |
|      INDEX RANGE SCAN    | I2DATA |   176 |     5 |
|      INDEX FAST FULL SCAN| PK_X2  |   176 |    24 |
|    INDEX UNIQUE SCAN     | UK_Y2  |     1 |     0 |
-----------------------------------------------------

As you can see, the overall cost has decreased from 29 to 25 as we’re no longer using a hash join, but a nested loop join operation, which is probably faster if our statistics are not way off, as we only have to look up the single value in y2 corresponding to x2 for each of x2’s estimated 176 rows that start with the letter ‘a’.

But let’s not trust the execution plan, let’s benchmark things (as discussed in a previous article). Run the following code in SQL Developer, for instance:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM x1
      JOIN y1 USING (a)
      WHERE data LIKE 'a%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Without UNIQUE constraint: '
    || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM x2
      JOIN y2 USING (a)
      WHERE data LIKE 'a%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('With    UNIQUE constraint: '
    || (SYSTIMESTAMP - v_ts));
END;
/

The above benchmark repeats each individual query 1000 times. The results speak for themselves:

Without UNIQUE constraint: +000000000 00:00:04.250000000
With    UNIQUE constraint: +000000000 00:00:02.847000000

Remark

The lack of a UNIQUE constraint may happen in situations where you prefer using a surrogate primary key in the referencing table (which I omitted in the examples for brevity). If you’re “sharing” the primary key or use natural keys, this issue won’t happen to you, of course.

Conclusion

The execution planner can make a more informed decision if it has formal knowledge about your data via an additional UNIQUE constraint. This formal knowledge is by far more powerful than any statistics that might indicate the same thing. In the absence of a formal UNIQUE constraint, the database will always have to make sure there is not another row once it has found one. With a formal UNIQUE constraint, it can stop looking as soon as that unique row was found. This can drastically speed up queries. As we’ve seen in the above example, this improves things by a factor of 1.5, so the second query is 50% faster!

Always tell your database as much as you can. Your SELECT performance will greatly increase, at the small cast of a little overhead when inserting data.

Quantified Comparison Predicates – Some of SQL’s Rarest Species


A recent Tweet by Aaron Bertrand (whom you’ve certainly encountered on Stack Overflow) has triggered my interest

Indeed, few people I’ve met and who’ve visited my SQL masterclass have heard about the ANY and ALL quantifiers, which you can use in SQL, let alone used them on a regular basis (or ever used them at all).

What are these things?

The formal definition is rather simple. Let

R <comp op> <quantifier> S

… be a quantified comparison predicate where

  • R is a row
  • <comp op> is a comparison operator (like =, !=, etc.)
  • <quantifier> is ANY or ALL
  • S is a subquery

You can now write things like:

SELECT *
FROM author AS a
WHERE a.author_id = ANY (
  SELECT b.author_id
  FROM book AS b
)

The above reads relatively intuitively (depending on your understanding of “intuitive”):

I want all authors whose author_id is equal to “ANY” of the author_id values contained in book

In fact, the SQL standard defines the IN predicate as being just syntax sugar for the = ANY() quantified comparison predicate.

8.4 <in predicate>

Let RVC be the <row value predicand> and 
let IPV be the <in predicate value>.

...

The expression

  RVC IN IPV

is equivalent to

  RVC = ANY IPV

OK, so why even use these weird operators?

The interesting case for these operators is not where the comparison operator is an equality (“=”) or non-equality (“!=” or “”). The interesting case is where we’re comparing things with less-than or greater-than operators, such as:

SELECT *
FROM author AS a
WHERE a.age > ALL (
  SELECT c.age
  FROM customer AS c
)

Or, in plain English:

I want all authors whose age is greater than the age of all customers

That’s neat, isn’t it? Of course, there are (at least) two alternative ways of tackling the same query, namely:

Comparing only with the MAX()

SELECT *
FROM author AS a
WHERE a.age > (
  SELECT MAX(c.age)
  FROM customer AS c
)

This solution compares the author’s age only with the maximum customer age, which obviously yields the same result. Once we’ve found an author older than the oldest customer, then that author is older than ALL the customers.

Instead of using MAX(), we can use the following, semantically equivalent query:

Comparing only with the “first”

SELECT *
FROM author AS a
WHERE a.age > (
  SELECT c.age
  FROM customer AS c
  ORDER BY c.age DESC
  LIMIT 1
)

I’m using PostgreSQL syntax (LIMIT 1), although most databases have some way of returning only one row from a query (see the jOOQ manual for details). This is the same as using MAX(). Some databases recognise the fact that we’re only returning a single row, and they won’t sort the entire result set first in O(N log N), but yield the maximum value only in O(N), and only once for all authors, instead of once per author. Whether this optimisation is done in your database, I can’t say. You should measure this yourself, and check out execution plans.

So, again, why use quantifiers?

Aaron from the twitter conversation would advise against using this “unintuitive” syntax:

Perhaps. Ultimately, you should always choose performance first, and then – most certainly – intuitiveness second (because some poor soul might need to maintain your query). But personally, I find these quantifiers quite elegant for three reasons:

  1. They express the quantification right where it belongs. With the comparison operator. Compare this with the solution using LIMIT, which may be far away, visually, from the greater-than operator. Quantifiers are much more concise, even than when using MAX() (in my opinion)
  2. They’re very set oriented. I like thinking in terms of sets when I work with SQL. Whenever I can omit the ORDER BY clause, I will. If only to avoid potentially slow operations (in case the database doesn’t optimise this, and a full O(N log N) sort operation is invoked)
  3. Quantified comparison predicates work on rows too, not just on single values.

Check this out:

SELECT (c1.last_name, c1.first_name) >= ALL (
  SELECT c2.last_name, c2.first_name
  FROM customer AS c2
)
FROM customer AS c1
WHERE c1.id = 1

The above query will yield a single column containing a “TRUE” or “FALSE” value (e.g. in PostgreSQL, which supports this exact syntax). The idea is that we’re running a query for customer with id = 1 and we want to see if that customer’s (last_name, first_name) tuple is “after” all the other customers’. Or in plain English:

Am I at the end of the phone book?

Again, you could do this with LIMIT:

SELECT (c1.last_name, c1.first_name) >= (
  SELECT c2.last_name, c2.first_name
  FROM customer AS c2
  ORDER BY c2.last_name DESC, c2.first_name DESC
  LIMIT 1
)
FROM customer AS c1
WHERE c1.id = 1

… but I find this much less elegant.

Unfortunately, this time there’s no way to solve this problem with MAX(). No database (that I’m aware of) supports using row value expressions (tuples) for aggregate functions like MAX(). This would be cool:

SELECT (c1.last_name, c1.first_name) >= (
  SELECT MAX((c2.last_name, c2.first_name))
  FROM customer AS c2
)
FROM customer AS c1
WHERE c1.id = 1

Or this:

SELECT (c1.last_name, c1.first_name) >= (
  SELECT MAX(ROW(c2.last_name, c2.first_name))
  FROM customer AS c2
)
FROM customer AS c1
WHERE c1.id = 1

Of course, the above is just a constructed example. Yes, you could also use window functions instead (e.g. LEAD()).

Conclusion

Quantified comparison predicates are very rarely seen in the wild. One reason is, few people know about them, or think of them when they’re solving a problem. Another reason might be, they’re not well optimised in some databases, so it’s actually a bad idea to use them.

Nonetheless, I find them to be very interesting SQL trivia to know about. And who knows, perhaps one day, you do run into a problem that is indeed best solved with quantified comparison predicates. Then you’ll shine with the most optimal solution.

Further reading

“What Java ORM do You Prefer, and Why?” – SQL of Course!


Catchy headline, yes. But check out this Stack Overflow question by user Mike:

(I’m duplicating it here on the blog, as it might be deleted soon)

It’s a pretty open ended question. I’ll be starting out a new project and am looking at different ORMs to integrate with database access.

Do you have any favorites? Are there any you would advise staying clear of?

And the top voted answer (164 points by David Crawshaw is: “Just use SQL”:

I have stopped using ORMs.

The reason is not any great flaw in the concept. Hibernate works well. Instead, I have found that queries have low overhead and I can fit lots of complex logic into large SQL queries, and shift a lot of my processing into the database.

So consider just using the JDBC package.

The second answer (66 points by user simon) is, again: “Just use SQL”:

None, because having an ORM takes too much control away with small benefits. The time savings gained are easily blown away when you have to debug abnormalities resulting from the use of the ORM. Furthermore, ORMs discourage developers from learning SQL and how relational databases work and using this for their benefit.

The third answer (51 points by myself) is saying, once more: “Use SQL” (and use it with jOOQ).

The best way to write SQL in Java

Only the fourth answer (46 points by Abdullah Jibaly) mentiones Hibernate, the most popular ORM in the Java ecosystem.

The truth is, as we’ve shown numerous times on this blog: Hibernate/JPA/ORMs are good tools to get rid of boring (and complex) CRUD. But that’s just boilerplate logic with little value to your business logic. The interesting stuff – the queries, the batch and bulk processing, the analytics, the reporting, they’re all best done with SQL. Here are some additional articles:

Stay tuned as we’re entering an era of programming where object orientation fades, and functional / declarative programming makes data processing extremely easy and lean again.

Say NO to Venn Diagrams When Explaining JOINs


In recent times, there have been a couple of tremendously popular blog posts explaining JOINs using Venn Diagrams. After all, relational algebra and SQL are set oriented theories and languages, so it only makes sense to illustrate set operations like JOINs using Venn Diagrams. Right?

Google seems to say so:

venn-google

Everyone uses Venn Diagrams to explain JOINs. But that’s…

PLAIN WRONG!

Venn Diagrams are perfect to illustrate … actual set operations! SQL knows three of them:

  • UNION
  • INTERSECT
  • EXCEPT

And they can be explained as such:

venn-union

venn-intersection

venn-difference

(all of these slides are taken from our Data Geekery SQL Training, do check it out!)

Most of you use UNION occasionally. INTERSECT and EXCEPT are more exotic, but do come in handy every now and then.

The point here is: these set operations operate on sets of elements (tuples), which are all of the same type. As in the examples above, all elements are people with first and last names. This is also why INTERSECT and EXCEPT are more exotic, because they’re usually not very useful. JOIN is much more useful. For instance, you want to combine the set of actors with their corresponding set of films.

A JOIN is really a cartesian product (also cross product) with a filter. Here’s a nice illustration of a cartesian product:

venn-cross-product

So, what’s a better way to illustrate JOIN operations?

JOIN diagrams! Let’s look at CROSS JOIN first, because all other JOIN types can be derived from CROSS JOIN:

venn-cross-join

Remember, in a cross join (in SQL also written with a comma separated table list, historically) is just taking every item on the left side, and combines it with every item on the right side. When you CROSS JOIN a table of 3 rows with a table of 4 rows, you will get 3×4=12 result rows. See, I’m using an “x” character to write the multiplication. I.e. a “cross”.

INNER JOIN

All other joins are still based on cross joins, but with additional filters, and perhaps unions. Here’s an explanation of each individual JOIN type.

venn-join

In plain text, an INNER JOIN is a CROSS JOIN in which only those combinations are retained which fulfil a given predicate. For instance:

-- "Classic" ANSI JOIN syntax
SELECT *
FROM author a
JOIN book b ON a.author_id = b.author_id

-- "Nice" ANSI JOIN syntax
SELECT *
FROM author a
JOIN book b USING (author_id)

-- "Old" syntax using a "CROSS JOIN"
SELECT *
FROM author a, book b
WHERE a.author_id = b.author_id

OUTER JOIN

OUTER JOIN types help where we want to retain those rows from either the LEFT side or the RIGHT or both (FULL) sides, for which there was no matching row where the predicate yielded true.

A LEFT OUTER JOIN in relational algebra is defined as such:

dd81ee1373d922122ce1b3e0da74cb28

Or more verbosely in SQL:

SELECT *
FROM author a
LEFT JOIN book b USING (author_id)

This will produce all the authors and their books, but if an author doesn’t have any book, we still want to get the author with NULL as their only book value. So, it’s the same as writing:

SELECT *
FROM author a
JOIN book b USING (author_id)

UNION

SELECT a.*, NULL, NULL, NULL, ..., NULL
FROM (
  SELECT a.*
  FROM author a
  
  EXCEPT
  
  SELECT a.*
  FROM author a
  JOIN book b USING (author_id)
) a

But no one wants to write that much SQL, so OUTER JOIN was implemented.

Conclusion: Say NO to Venn Diagrams

JOINs are relatively easy to understand intuitively. And they’re relatively easy to explain using Venn Diagrams. But whenever you do that, remember, that you’re making a wrong analogy. A JOIN is not strictly a set operation that can be described with Venn Diagrams. A JOIN is always a cross product with a predicate, and possibly a UNION to add additional rows to the OUTER JOIN result.

So, if in doubt, please use JOIN diagrams rather than Venn Diagrams. They’re more accurate and visually more useful.

venn-google-say-no

(Remember, all of these slides are taken from our Data Geekery SQL Training, do get in touch, if you’re interested)

How Functional Programming will (Finally) do Away With the GoF Patterns


A recent article about various ways to implement structural pattern matching in Java has triggered my interest:
http://blog.higher-order.com/blog/2009/08/21/structural-pattern-matching-in-java

The article mentions a Scala example where a tree data structure can be traversed very easily and neatly using Scala’s match keyword, along with using algebraic data types (more specifically, a sum type):

def depth(t: Tree): Int = t match {
  case Empty => 0
  case Leaf(n) => 1
  case Node(l, r) => 1 + max(depth(l), depth(r))
}

Even if you’re not used to the syntax, it is relatively easy to understand what it does:

  • There’s a function depth that calculates the (maximum) depth of a tree structure
  • It does so by checking if the input argument is empty, a leaf node, or any other node
  • If it is any other node, then it adds 1 to the maximum of the remaining tree, recursively

The elegant thing here is that the Scala type system helps the author of the above code get this right from a formal point of view, by offering formal type checking. The closest we can do in Java as illustrated by the article is this

public static int depth(Tree t) {
  if (t instanceof Empty)
    return 0;
  if (t instanceof Leaf)
    return 1;
  if (t instanceof Node)
    return 1 + max(depth(((Node) t).left), depth(((Node) t).right));
  throw new RuntimeException("Inexhaustive pattern match on Tree.");
}

But these instanceof checks do smell kind of fishy…

For more details, read the full article here, highly recommended:
http://blog.higher-order.com/blog/2009/08/21/structural-pattern-matching-in-java

How does this compare to the GoF design patterns?

In our object-orientation-brainwashed Java ecosystem (which inherited the OO brainwash from C++), the above instanceof logic would most likely be refactored into an implementation using the visitor pattern from the GoF design patterns book. This refactoring would be done by The Team Architect™ himself, as they are supervising the object oriented quality of your software. The 7 lines of code using instanceof would quickly bloat up to roughly 200 lines of weird interfaces, abstract classes, and cryptic accept() and visit() methods. When in fact, the functional programming approach was so much leaner, even in its imperfect Java instanceof form!

A lot of the GoF design patterns stem from a time when EVERYTHING needed to be an object. Object orientation was the new holy grail, and people even wanted to push objects down into databases. Object databases were invented (luckily, they’re all dead) and the SQL standard was enhanced with ORDBMS features (only really implemented in Oracle, PostgreSQL, and Informix, and maybe some other minor DBs), most of which – also luckily – were never widely adopted.

Since Java 8, finally, we’re starting to recover from the damage that was made in early days of object orientation in the 90s, and we can move back to a more data-centric, functional, immutable programming model where data processing languages like SQL are appreciated rather than avoided, and Java will see more and more of these patterns, hopefully.

If you’re not convinced by the above visitor pattern vs pattern matching example, do read this very interesting series of articles by Mario Fusco:

You will see that with functional programming, many patterns lose their meaning as you’re just starting to pass around functions, making code very simple and easy to understand.

As a wrap up, as Mario presented the content at Voxxed Days Ticino:

Happy functional programming!

With Commercial Licensing, Invest in Innovation, not Protection


When people start creating commercially licensed software (like we did, in 2013 with jOOQ), there is always the big looming question:

What do I do about piracy?

I’ve had numerous discussions with fellow entrepreneurs about this topic, and this fear is omnipresent. There has also been a recent discussion on reddit, titled “prevent sharing of a Java library”. I felt reminded of the early commercial jOOQ days, when I discussed the various options / modalities of the new commercial jOOQ license with the Data Geekery legal counsel – which was clearly the biggest financial investment in early company days.

Build licensing around trust, not fear

One thing that bothered me most about jOOQ’s dual license in its early days is that our paying customers will have less rights than our continued Open Source users. Obviously. If you’re using jOOQ with an Open Source database, you can use the jOOQ Open Source Edition for free under the terms of the very permissive Apache License 2.0. You can do pretty much anything with jOOQ including forking it, rewriting it, creating and distributing a new jOOQ (just don’t name it jOOQ, we own the trademark). The only limitation is: it doesn’t work with commercial databases, but you don’t care about that if you’re using MySQL or PostgreSQL for the next 10 years.

If you’re using jOOQ with a commercial database, however, you need to purchase a jOOQ Professional Edition or jOOQ Enterprise Edition license. Of course, there are costs, but that’s not the problem, because jOOQ is awesome and delivers 50x the value it costs.

The problem is that:

  • Interested developers, architects, etc. now have to go through the hassle of convincing their employer’s legal / purchasing / compliance / … departments and do all sorts of paperwork.
  • Paying customers (at the beginning) could no longer patch jOOQ if they found a bug. They had to wait for us, the vendor, to deliver a fix.

These things were remedied rather quickly in the commercial license text. The commercial jOOQ license now grants access to the commercial source code and allows users to modify jOOQ themselves (of course warranty is then disclaimed), without needing to wait for the vendor to deliver the fix. It is still not allowed to distribute the modification’s source code, although we’re looking into possible legal solutions to allow that as well, such that commercial customers can share patches for commercial parts of jOOQ as well.

In other words: We want our commercial customers feel as if jOOQ were Open Source for their every day job.

Isn’t that crazy?

So, people get the entire source code and can build jOOQ. Aren’t we afraid that our commercial, “cracked” jOOQ distributions wind up on warez sites? Of course we are. And it happens. And we’re maintaining a list of companies that “obviously” don’t comply with our terms (e.g. they’ve been using the free trial in production for 2 years). But they’re only very few. And even if they weren’t few, should we introduce tracking logic in jOOQ to check if customers / trial users are compliant? Should we collect IP addresses? User E-Mails? Count workstations? Shut down jOOQ, if non compliant? Shut it down where, on production servers, perhaps?

The truth is, most companies are honest. We’ve had many customers frequently upgrade their contracts. E.g. every couple of months, they’ve purchased new licenses. We’ve had other customers reduce their contracts. The team started with 5 licenses and now consists only of 1 person doing maintenance work. We’ve had customers not touching their contracts, using jOOQ with e.g. 10 licenses for a long time. Some of these are overlicensed, yes. Some of these are underlicensed. It happens. It’s not good, but it’s also not the end of the world. We’re in constant touch with them to see if their license count is still up to date. In the end, we trust them. They trust us.

The worst thing we could do now is introduce some sort of license checker that might be buggy and accidentally shuts down our honest customers’ production system! For the slight chance that we might be catching someone who “cracked” our software (and who probably manages to “crack” also our license checker anyway).

Even if we’re not shutting down the software but logging messages somewhere, the end user (our customers’ customer) might get a very weird feeling when they see it. We would be indirectly damaging our customers’ reputation for what was probably just an oversight. Or, again, a bug in our license checker.

Is that really worth the trouble? No way!

(an exception is the free trial version, which must be deleted after one month)

Different types of software

The problem is: There are different types of software. Essentially, there are:

  • Tools
  • Libraries
  • Servers
  • SaaS

The difference is:

  • Tools run on premise but are independent of the end user application. The customer doesn’t depend on the tool.
    For instance: IntelliJ IDEA
  • Libraries run on premise and are embedded in the end user application. The customer completely depends on the library.
    For instance: jOOQ
  • Servers run on premise and are linked to the end user application but run independently. The customer can often replace the server, but cannot “open” it. It’s a black box.
    For instance: Oracle Database
  • SaaS runs in the cloud and is completely independent of the end user application. The customer cannot influence it in any way.
    For instance: Salesforce.com

As you can see, tools are “not critical” for the customer (being “just” UIs giving access to “the real system”), and while servers and SaaS are critical, all three of them run independently. It is easy for a vendor to implement license checkers inside of those because they’re in control of running the software. With libraries, it’s different. You cannot make any assumptions about how your library is “run”. Is it part of a single JVM? Or multiple JVMs? Run for a couple of seconds only, or for a long time? Used frequently or only very rarely? Is it allowed to spawn its own threads, collect its own data?

Too many open questions.

Long story short: Invest in innovation, not protection

Unless protection is really important in your area because you’re using extremely complex algorithms that no one should know about, then you shouldn’t worry about piracy too much from a technical perspective. In our case, jOOQ isn’t super secret. Anyone can build their own jOOQ, and there are many (much simpler) competitor frameworks. Our business is maintaining by far the best Java SQL DSL and apparently, no one else wants to compete with us in this niche, so why be afraid of theft.

If protection is important, well then run a SaaS, because then you don’t need to ship any software. For instance: Google, one of the biggest SaaS vendors out there, doesn’t share its search engine algorithms with you.

Once you stop worrying about who is going to steal from you, you can start investing all that time in awesome new features and quality to make those loyal and honest customers happy who happily pay for your software. And who knows. Perhaps some of your “pirate customers” will eventually switch jobs and work for someone who takes compliance more seriously. They have been happy “customers” too, and will also recommend your software to their new peers.

Turn Around. Don’t Use JPA’s loadgraph and fetchgraph Hints. Use SQL Instead.


Thorben Janssen (also known from our jOOQ Tuesdays series) recently published an interesting wrap-up of what’s possible with Hibernate / JPA query hints. The full article can be seen here:
http://www.thoughts-on-java.org/11-jpa-hibernate-query-hints-every-developer-know

Some JPA hints aren’t really hints, they’re really full-blown query specifications, just like JPQL queries, or SQL queries. They tell JPA how to fetch your entities. Let’s look at javax.persistence.loadgraph and javax.persistence.fetchgraph.

The example given in Oracle’s Java EE 7 tutorial is this:

You have a default entity graph, which is hard-wired to your entity class using annotations (or XML in the old days):

@Entity
public class EmailMessage implements Serializable {
    @Id
    String messageId;
    @Basic(fetch=EAGER)
    String subject;
    String body;
    @Basic(fetch=EAGER)
    String sender;
    @OneToMany(mappedBy="message", fetch=LAZY)
    Set<EmailAttachment> attachments;
    ...
}

Notice how the above entity graph mixes formal graph meta information (@Entity, @Id, @OneToMany, …) with query default information (fetch=EAGER, fetch=LAZY).

EAGER or LAZY?

Now, the problem with the above is that these defaults are hard-wired and cannot be changed for ad-hoc usage (thank you annotations). Remember, SQL is an ad-hoc query language with all of benefits that derive from this ad-hoc-ness. You can materialise new result sets whose type was not previously known on the fly. Excellent tool for reporting, but also for ordinary data processing, because it is so easy to change a SQL query if you have new requirements, and if you’re using languages like PL/SQL or libraries like jOOQ, you can even do that in a type safe, precompiled way.

Unlike in JPA, whose annotations are not “ad-hoc”, just like SQL’s DDL is not “ad-hoc”. Can you ever switch from EAGER to LAZY? Or from LAZY to EAGER? Without breaking half of your application? Truth is: You don’t know!

The problem is: choosing EAGER will prematurely materialise your entire entity graph (even if you needed only an E-Mail message’s subject and body), resulting in too much database traffic (see also “EAGER fetching is a code smell” by Vlad Mihalcea). Choosing LAZY will result in N+1 problems in case you really do need to materialise the relationship, because for each parent (“1”), you have to individually fetch each child (“N”) lazily, later on.

Do SQL people suffer from N+1?

As a SQL person, this sounds ridiculous to me. Imagine specifying in your foreign key constraint whether you always want to auto-fetch your relationship…

ALTER TABLE email_attachment
ADD CONSTRAINT fk_email_attachment_email
FOREIGN KEY (message_id)
REFERENCES email_message(message_id)
WITH FETCH OPTION LAZY -- meh...

Of course you don’t do that. The point of normalising your schema is to have the data sit there without duplicating it. That’s it. It is the query language’s responsibility to help you decide whether you want to materialise the relationship or not. For instance, trivially:

-- Materialise the relationship
SELECT *
FROM email_message m
JOIN email_attachment a
USING (message_id)

-- Don't materialise the relationship
SELECT *
FROM email_message m

Duh, right?

Are JOINs really that hard to type?

Now, obviously, typing all these joins all the time can be tedious, and that’s where JPA seems to offer help. Unfortunately, it doesn’t help, because otherwise, we wouldn’t have tons of performance problems due to the eternal EAGER vs LAZY discussion. It is a GOOD THING to think about your individual joins every time because if you don’t, you will structurally neglect your performance (as if SQL performance wasn’t hard enough already) and you’ll notice this only in production, because on your developer machine, you don’t have the problem. Why?

Works on my machine ಠ_ಠ

One way of solving this with JPA is to use the JOIN FETCH syntax in JPQL (which is essentially the same thing as what you would be doing in SQL, so you don’t win anything over SQL except for automatic mapping. See also this example where the query is run with jOOQ and the mapping is done with JPA).

Another way of solving this with JPA is to use these javax.persistence.fetchgraph or javax.persistence.loadgraph hints, but that’s even worse. Check out the code that is needed in Oracle’s Java EE 7 tutorial just to indicate that you want this and that column / attribute from a given entity:

EntityGraph<EmailMessage> eg = em.createEntityGraph(EmailMessage.class);
eg.addAttributeNodes("body");
...
Properties props = new Properties();
props.put("javax.persistence.fetchgraph", eg);
EmailMessage message = em.find(EmailMessage.class, id, props);

With this graph, you can now indicate to your JPA implementation that in fact you don’t really want to get just a single E-Mail message, you also want all the specified JOINs to be materialised (interestingly, the example doesn’t do that, though).

You can pass this graph specification also to a JPA Query that does a bit more complex stuff than just fetching a single tuple by ID – but then, why not just use JPA’s query language to express that explicitly? Why use a hint?

Let me ask you again, why not just specify a sophisticated query language? Let’s call that language… SQL? The above example is solved trivially as such:

SELECT body
FROM email_message
WHERE message_id = :id

That’s not too much typing, is it? You know exactly what’s going on, everyone can read this, it’s simple to debug, and you don’t need to wrestle with first and second level caches because all the caching that is really needed is buffer caching in your database (i.e. caching frequent data in database memory to prevent excessive I/O).

The cognitive overhead of getting everything right and tuning stuff in JPA is so big compared to writing just simple SQL statements (and don’t forget, you may know why you put that hint, but your coworker may so easily overlook it!), let me ask you: Are you very sure you actually profit from JPA (you really need entity graph persistence, including caching)? Or are you wrestling the above just because JPA is your default choice?

When JPA is good

JPA (and its implementations) is excellent when you have the object graph persistence problem. This means: When you do need to load a big graph, modify it in your client application, possibly in a distributed and cached and long-conversational manner, and then store the whole graph back into the database without having to wrestle with locking, caching, lost updates, and all sorts of other problems, then JPA does help you. A lot. You don’t want to do that with SQL.

Do note that the key aspect here is storing the graph back into the database. 80% of JPA’s value is in writing stuff, not reading stuff.

But frankly, you probably don’t have this problem. You’re doing mostly simple CRUD and probably complex querying. SQL is the best language for that. And Java 8 functional programming idioms help you do the mapping, easily.

Conclusion

Don’t use loadgraph and fetchgraph hints. Chances are very low that you’re really on a good track. Chances are very high that migrating off to SQL will greatly simplify your application.