10 Subtle Best Practices when Coding Java

This is a list of 10 best practices that are more subtle than your average Josh Bloch Effective Java rule. While Josh Bloch’s list is very easy to learn and concerns everyday situations, this list here contains less common situations involving API / SPI design that may have a big effect nontheless.

I have encountered these things while writing and maintaining jOOQ, an internal DSL modelling SQL in Java. Being an internal DSL, jOOQ challenges Java compilers and generics to the max, combining generics, varargs and overloading in a way that Josh Bloch probably wouldn’t recommend for the “average API”.

jOOQ is the best way to write SQL in Java

Let me share with you 10 Subtle Best Practices When Coding Java:

1. Remember C++ destructors

Remember C++ destructors? No? Then you might be lucky as you never had to debug through any code leaving memory leaks due to allocated memory not having been freed after an object was removed. Thanks Sun/Oracle for implementing garbage collection!

But nonetheless, destructors have an interesting trait to them. It often makes sense to free memory in the inverse order of allocation. Keep this in mind in Java as well, when you’re operating with destructor-like semantics:

  • When using @Before and @After JUnit annotations
  • When allocating, freeing JDBC resources
  • When calling super methods

There are various other use cases. Here’s a concrete example showing how you might implement some event listener SPI:

@Override
public void beforeEvent(EventContext e) {
    super.beforeEvent(e);
    // Super code before my code
}

@Override
public void afterEvent(EventContext e) {
    // Super code after my code
    super.afterEvent(e);
}

Another good example showing why this can be important is the infamous Dining Philosophers problem. More info about the dining philosophers can be seen in this awesome post:
http://adit.io/posts/2013-05-11-The-Dining-Philosophers-Problem-With-Ron-Swanson.html

The Rule: Whenever you implement logic using before/after, allocate/free, take/return semantics, think about whether the after/free/return operation should perform stuff in the inverse order. tweet this

2. Don’t trust your early SPI evolution judgement

Providing an SPI to your consumers is an easy way to allow them to inject custom behaviour into your library / code. Beware, though, that your SPI evolution judgement may trick you into thinking that you’re (not) going to need that additional parameter. True, no functionality should be added early. But once you’ve published your SPI and once you’ve decided following semantic versioning, you’ll really regret having added a silly, one-argument method to your SPI when you realise that you might need another argument in some cases:

interface EventListener {
    // Bad
    void message(String message);
}

What if you also need a message ID and a message source? API evolution will prevent you from adding that parameter easily, to the above type. Granted, with Java 8, you could add a defender method, to “defend” you bad early design decision:

interface EventListener {
    // Bad
    default void message(String message) {
        message(message, null, null);
    }
    // Better?
    void message(
        String message,
        Integer id,
        MessageSource source
    );
}

Note, unfortunately, the defender method cannot be made final.

But much better than polluting your SPI with dozens of methods, use a context object (or argument object) just for this purpose.

interface MessageContext {
    String message();
    Integer id();
    MessageSource source();
}

interface EventListener {
    // Awesome!
    void message(MessageContext context);
}

You can evolve the MessageContext API much more easily than the EventListener SPI as fewer users will have implemented it.

The Rule: Whenever you specify an SPI, consider using context / parameter objects instead of writing methods with a fixed amount of parameters. tweet this

Remark: It is often a good idea to also communicate results through a dedicated MessageResult type, which can be constructed through a builder API. This will add even more SPI evolution flexibility to your SPI.

3. Avoid returning anonymous, local, or inner classes

Swing programmers probably have a couple of keyboard shortcuts to generate the code for their hundreds of anonymous classes. In many cases, creating them is nice as you can locally adhere to an interface, without going through the “hassle” of thinking about a full SPI subtype lifecycle.

But you should not use anonymous, local, or inner classes too often for a simple reason: They keep a reference to the outer instance. And they will drag that outer instance to wherevery they go, e.g. to some scope outside of your local class if you’re not careful. This can be a major source for memory leaks, as your whole object graph will suddenly entangle in subtle ways.

The Rule: Whenever you write an anonymous, local or inner class, check if you can make it static or even a regular top-level class. Avoid returning anonymous, local or inner class instances from methods to the outside scope. tweet this

Remark: There has been some clever practice around double-curly braces for simple object instantiation:

new HashMap<String, String>() {{
    put("1", "a");
    put("2", "b");
}}

This leverages Java’s instance initializer as specified by the JLS §8.6. Looks nice (maybe a bit weird), but is really a bad idea. What would otherwise be a completely independent HashMap instance now keeps a reference to the outer instance, whatever that just happens to be. Besides, you’ll create an additional class for the class loader to manage.

4. Start writing SAMs now!

Java 8 is knocking on the door. And with Java 8 come lambdas, whether you like them or not. Your API consumers may like them, though, and you better make sure that they can make use of them as often as possible. Hence, unless your API accepts simple “scalar” types such as int, long, String, Date, let your API accept SAMs as often as possible.

What’s a SAM? A SAM is a Single Abstract Method [Type]. Also known as a functional interface, soon to be annotated with the @FunctionalInterface annotation. This goes well with rule number 2, where EventListener is in fact a SAM. The best SAMs are those with single arguments, as they will further simplify writing of a lambda. Imagine writing

listeners.add(c -> System.out.println(c.message()));

Instead of

listeners.add(new EventListener() {
    @Override
    public void message(MessageContext c) {
        System.out.println(c.message()));
    }
});

Imagine XML processing through jOOX, which features a couple of SAMs:

$(document)
    // Find elements with an ID
    .find(c -> $(c).id() != null)
    // Find their  child elements
    .children(c -> $(c).tag().equals("order"))
    // Print all matches
    .each(c -> System.out.println($(c)))

The Rule: Be nice with your API consumers and write SAMs / Functional interfaces already now. tweet this

Remarks: A couple of interesting blog posts about Java 8 Lambdas and improved Collections API can be seen here:

5. Avoid returning null from API methods

I’ve blogged about Java’s NULLs once or twice. I’ve also blogged about Java 8’s introduction of Optional. These are interesting topics both from an academic and from a practical point of view.

While NULLs and NullPointerExceptions will probably stay a major pain in Java for a while, you can still design your API in a way that users will not run into any issues. Try to avoid returning null from API methods whenever possible. Your API consumers should be able to chain methods whenever applicable:

initialise(someArgument).calculate(data).dispatch();

In the above snippet, none of the methods should ever return null. In fact, using null’s semantics (the absence of a value) should be rather exceptional in general. In libraries like jQuery (or jOOX, a Java port thereof), nulls are completely avoided as you’re always operating on iterable objects. Whether you match something or not is irrelevant to the next method call.

Nulls often arise also because of lazy initialisation. In many cases, lazy initialisation can be avoided too, without any significant performance impact. In fact, lazy initialisation should be used carefully, only. If large data structures are involved.

The Rule: Avoid returning nulls from methods whenever possible. Use null only for the “uninitialised” or “absent” semantics. tweet this

6. Never return null arrays or lists from API methods

While there are some cases when returning nulls from methods is OK, there is absolutely no use case of returning null arrays or null collections! Let’s consider the hideous java.io.File.list() method. It returns:

An array of strings naming the files and directories in the directory denoted by this abstract pathname. The array will be empty if the directory is empty. Returns null if this abstract pathname does not denote a directory, or if an I/O error occurs.

Hence, the correct way to deal with this method is

File directory = // ...

if (directory.isDirectory()) {
    String[] list = directory.list();

    if (list != null) {
        for (String file : list) {
            // ...
        }
    }
}

Was that null check really necessary? Most I/O operations produce IOExceptions, but this one returns null. Null cannot hold any error message indicating why the I/O error occurred. So this is wrong in three ways:

  • Null does not help in finding the error
  • Null does not allow to distinguish I/O errors from the File instance not being a directory
  • Everyone will keep forgetting about null, here

In collection contexts, the notion of “absence” is best implemented by empty arrays or collections. Having an “absent” array or collection is hardly ever useful, except again, for lazy initialisation.

The Rule: Arrays or Collections should never be null. tweet this

7. Avoid state, be functional

What’s nice about HTTP is the fact that it is stateless. All relevant state is transferred in each request and in each response. This is essential to the naming of REST: Representational State Transfer. This is awesome when done in Java as well. Think of it in terms of rule number 2 when methods receive stateful parameter objects. Things can be so much simpler if state is transferred in such objects, rather than manipulated from the outside. Take JDBC, for instance. The following example fetches a cursor from a stored procedure:

CallableStatement s =
  connection.prepareCall("{ ? = ... }");

// Verbose manipulation of statement state:
s.registerOutParameter(1, cursor);
s.setString(2, "abc");
s.execute();
ResultSet rs = s.getObject(1);

// Verbose manipulation of result set state:
rs.next();
rs.next();

These are the things that make JDBC such an awkward API to deal with. Each object is incredibly stateful and hard to manipulate. Concretely, there are two major issues:

  • It is very hard to correctly deal with stateful APIs in multi-threaded environments
  • It is very hard to make stateful resources globally available, as the state is not documented
State is like a box of chocolates
Theatrical poster for Forrest Gump, Copyright © 1994 by Paramount Pictures. All Rights Reserved. It is believed that the above usage fulfils what is known as Fair Use

The Rule: Implement more of a functional style. Pass state through method arguments. Manipulate less object state. tweet this

8. Short-circuit equals()

This is a low-hanging fruit. In large object graphs, you can gain significantly in terms of performance, if all your objects’ equals() methods dirt-cheaply compare for identity first:

@Override
public boolean equals(Object other) {
    if (this == other) return true;

    // Rest of equality logic...
}

Note, other short-circuit checks may involve null checks, which should be there as well:

@Override
public boolean equals(Object other) {
    if (this == other) return true;
    if (other == null) return false;

    // Rest of equality logic...
}

The Rule: Short-circuit all your equals() methods to gain performance. tweet this

9. Try to make methods final by default

Some will disagree on this, as making things final by default is quite the opposite of what Java developers are used to. But if you’re in full control of all source code, there’s absolutely nothing wrong with making methods final by default, because:

  • If you do need to override a method (do you really?), you can still remove the final keyword
  • You will never accidentally override any method anymore

This specifically applies for static methods, where “overriding” (actually, shadowing) hardly ever makes sense. I’ve come across a very bad example of shadowing static methods in Apache Tika, recently. Consider:

TikaInputStream extends TaggedInputStream and shadows its static get() method with quite a different implementation.

Unlike regular methods, static methods don’t override each other, as the call-site binds a static method invocation at compile-time. If you’re unlucky, you might just get the wrong method accidentally.

The Rule: If you’re in full control of your API, try making as many methods as possible final by default. tweet this

10. Avoid the method(T…) signature

There’s nothing wrong with the occasional “accept-all” varargs method that accepts an Object... argument:

void acceptAll(Object... all);

Writing such a method brings a little JavaScript feeling to the Java ecosystem. Of course, you probably want to restrict the actual type to something more confined in a real-world situation, e.g. String.... And because you don’t want to confine too much, you might think it is a good idea to replace Object by a generic T:

void acceptAll(T... all);

But it’s not. T can always be inferred to Object. In fact, you might as well just not use generics with the above methods. More importantly, you may think that you can overload the above method, but you cannot:

void acceptAll(T... all);
void acceptAll(String message, T... all);

This looks as though you could optionally pass a String message to the method. But what happens to this call here?

acceptAll("Message", 123, "abc");

The compiler will infer <? extends Serializable & Comparable<?>> for T, which makes the call ambiguous!

So, whenever you have an “accept-all” signature (even if it is generic), you will never again be able to typesafely overload it. API consumers may just be lucky enough to “accidentally” have the compiler chose the “right” most specific method. But they may as well be tricked into using the “accept-all” method or they may not be able to call any method at all.

The Rule: Avoid “accept-all” signatures if you can. And if you cannot, never overload such a method. tweet this

Conclusion

Java is a beast. Unlike other, fancier languages, it has evolved slowly to what it is today. And that’s probably a good thing, because already at the speed of development of Java, there are hundreds of caveats, which can only be mastered through years of experience.

jOOQ is the best way to write SQL in Java

Stay tuned for more top 10 lists on the subject!

Do You Put Trust in Vendors’ NoSQL Promises?

Disruptive times lead to disruptive technologies. But moreover, they lead to new buzzwords and thus new “value propositions” for an unsettled clientele who will buy into any promise these vendors make. If you believe those expensive reports that you can buy from companies like Gartner, you will think:

Cloud Computing

You need to run your business in the cloud

“You need to run your business in the cloud”. The advent of what marketers have come to call “The Cloud” (formerly known as “Web 2.0”, or “The Internet”) has started to transform the ways some people think about data. Many vendors trust in “Big Data” becoming the next “Big Paradigm” for software engineering. “Big Data” seems to cry for new data storage technologies, as advertised in this article here: http://www.ncs-london.com/blog/Five-Benefits-of-a-noSQL-approach-to-DBA-management Consider section 4:

4: An end to DBAs

Despite the many manageability enhancements claimed by RDBMS vendors over the years, high-end RDBMS systems can be upheld only with the help of lavish, highly trained DBAs. DBAs are intimately involved in the design, installation, and on-going tweaking of high-end RDBMS systems.

“lavish, highly trained DBAs”. This article assumes that moving away from SQL will make maintaining and managing large amounts of data a piece of cake that your average junior PHP script kiddie can handle.

“Blimey, Dear Rupert! Our customer fancies going to The Cloud with his 20 million transactions per day.”

“Good Lord, Rodney, I believe you are right. We ought to get Clive, our lovely intern youngster on the job. He only costs us a penny a minute.”

People often forget that an average iPhone with 64 GB of memory would have been considered quite “Big Data” 13 years ago, when the average Nokia 3310 could only store 100 phone numbers! 13 years changed a lot in terms of volume, but not that much in terms of software technology. Since then, software engineering has slowly transformed to manage ever larger sets of data, but few companies really need to scale in dimensions where RDBMS cannot scale to. In fact, some database vendors actually went to “The Cloud” with their RDBMS themselves, such as Google App Engine’s Cloud SQL.  And what about Twitter? Well, they’re actually using a MySQL sharded database, which serves them perfectly fine. Just as Instagram uses a sharded PostgreSQL database. Pinterest, another provider for a large “Cloud” application uses sharded MySQL (along with Solr, Memcache and Redis).

RDBMS might not be the optimal choice for some data models (hierarchical, unstructured, document-oriented models). But they are extremely powerful when it comes to manipulating relational data. SQL itself has evolved quite a bit since Nokia 3310 times, when 64 GB of memory was “Big Data”. The differences between Oracle 8i and Oracle 12c are amazing. The same applies to the differences between SQL Server 2000 and SQL Server 2012. Moreover, you can employ the very same DBA for the Oracle 12c job as you could for the 8i job back in the days.

While we are certainly living in exciting times where new technologies lead to new ways of thinking (and vice-versa), we should be sceptical of vendors who promise that we will migrate to the next paradigm within a blink of an eye. Your data might endure longer than the new technology you use to store it.

We should be sceptical of vendors who claim that highly trained maintenance people are the main problem we need to solve. They promise you to cut down on DBA and licensing costs, while hardly anyone even knows about future “NoDBA” and “NoLicensing” costs.

Do you put trust in those vendors’ NoSQL promises? Or do you believe SQL will catch up, and the old elephants can be taught new tricks? I’m curious to hear your thoughts and experiences.

Apache Derby About to Adopt the Awesome SQL:2003 MERGE Statement

Apache Derby is one out of three popular Java embeddable databases (apart from H2 and HSQLDB). It is very SQL and JDBC standards-compliant, but maybe a bit behind on developments of more advanced SQL features. Around 6 years after its first submission, there has now been some action on the Apache Derby DERBY-3155 ticket, recently. Rick Hillegas has attached a first, promising draft for the MERGE statement specification, which can be seen here:
https://issues.apache.org/jira/secure/attachment/12597795/MergeStatement.html

Among all 14 of the SQL databases supported by jOOQ, Derby would thus be the 8th to support the SQL:2003 MERGE statement, lining up with:

Other databases support proprietary versions of MERGE:

… or other forms of UPSERT:

Visit DERBY-3155 and show the maintainers some love for implementing this awesome and powerful SQL statement!

Java EE 7: JSRs That Make You Powerful

Tori Wieldt by Oracle has released an overview of all the goodies that are included in Java EE 7:
https://blogs.oracle.com/java/entry/java_ee_7_the_details

… with a couple of video presentations:
http://www.youtube.com/playlist?list=PL74xrT3oGQfCCLFJ2HCTR_iN5hV4penDz

jOOQ Newsletter August 2013

Subscribe to this newsletter here

Overwhelming feedback from the jOOQ User Survey

jOOQ has been the leading Java Open Source SQL building tool for the last three years. Feedback on social media such as Facebook, Twitter, reddit, Stack Overflow has been very positive.

The recent user survey was well received by an increasingly growing jOOQ community. It has shown a couple of interesting facts:

  • jOOQ users are highly proficient with Java and SQL
  • jOOQ users mainly concentrate on using H2, MySQL, Oracle, PostgreSQL, and SQL Server
  • jOOQ’s maturity is mostly appreciated for mission-critical applications
  • jOOQ users are generally satisfied or very satisfied with jOOQ’s mission, support and quality

Thank you for taking the time to complete the user survey. Your valuable comments will help jOOQ to take the right direction in the future, providing you with even more value.

SQL popularity still high

In recent years, many NoSQL vendors have surfaced to sell their alternative data stores and query languages. Some of them aim to solve entirely different problems than relational data stores. But as in pre-ANSI-SQL times, there is a huge amount of incompatible proprietary query languages and techniques.

SQL is standardised by ANSI / ISO and its vendors can still grow their market shares as the Big Data buzz is increasing. A clear sign of SQL’s popularity is the success that these Top 10 Mistakes articles had, recently, on DZone:

jOOQ 3.2 feature outlook

New jOOQ users have contributed new awesome ideas to the community. These include:

A new SPI to hook into query rendering and variable binding

This SPI will allow jOOQ users to perform arbitrary SQL transformations based on query rendering and variable binding events. A very common use-case is to listen for a given set of well-known tables and generate additional filtering predicates in the SQL WHERE clause, centrally restricting access to data.

This not only allows for adding an additional, transparent security layer to your SQL code generation, it will also allow for implementing multi-tenancy in shared-schema environments.

A new SPI to hook into CRUD operations

Many users profit from jOOQ’s ActiveRecord-style CRUD API. This API will be enhanced with a powerful SPI that allows to hook into store(), insert(), update(), delete(), refresh() operations.

A common use-case for this is data initialisation prior to inserts, or ID-generation.

What if every object was an array? No more NullPointerExceptions!

To NULL or not to NULL? Programming language designers inevitably have to decide whether they support NULLs or not. And they’ve proven to have a hard time getting this right. NULL is not intuitive in any language, because NULL is an axiom of that language, not a rule that can be derived from lower-level axioms. Take Java for instance, where

// This yields true:
null == null

// These throw an exception (or cannot be compiled)
null.toString();
int value = (Integer) null;

It’s not like there weren’t any alternatives. SQL, for instance, implements a more expressive but probably less intuitive three-value logic, which most developers get wrong in subtle ways once in a while.

At the same time, SQL doesn’t know “NULL” results, only “NULL” column values. From a set theory perspective, there are only empty sets, not NULL sets.

Other languages allow for dereferencing null through special operators, letting the compiler generate tedious null checks for you, behind the scenes. An example for this is Groovy with its null-safe dereferencing operator. This solution is far from being generally accepted, as can be seen in this discussion about a Scala equivalent. Scala uses Option, which Java 8 will imitate using Optional (or @Nullable).

Let’s think about a much broader solution

To me, nullability isn’t a first-class citizen. I personally dislike the fact that Scala’s Option[T] type pollutes my type system by introducing a generic wrapper type (even if it seems to implement similar array-features through the traversable trait). I don’t want to distinguish the types of Option[T] and T. This is specifically true when reasoning about types from a reflection API perspective, where Scala’s (and Java’s) legacy will forever keep me from accessing the type of T at runtime.

But much worse, most of the times, in my application I don’t really want to distinguish between “option” references and “some” references. Heck, I don’t even want to distinguish between having 1 reference and having dozens. jQuery got this quite right. One of the main reasons why jQuery is so popular is because everything you do, you do on a set of wrapped DOM elements. The API never distinguishes between matching 1 or 100 div’s. Check out the following code:

// This clearly operates on a single object or none
$('div#unique-id').html('new content')
                  .click(function() { ... });

// This possibly operates on several objects or none
$('div.any-class').html('new content')
                  .click(function() { ... });

This is possible because JavaScript allows you to override the prototype of the JavaScript Array type, modifying arrays in general, at least for the scope of the jQuery library. How more awesome can it get? .html() and .click() are actions performed on the array as a whole, no matter if you have zero, one, or 100 elements in your match. What would a more typesafe language look like, where everything behaves like an array (or an ArrayList)? Think about the following model:

class Customer {
  String firstNames;  // Read as String[] firstNames
  String lastName;    // Read as String[] lastName
  Order orders;       // Read as Order[] orders
}

class Order {
  int value;          // Read as int[] value
  boolean shipped() { // Read as boolean[] shipped
  }
}

Don’t rant (just yet). Let’s assume this wouldn’t lead to memory or computation overhead. Let’s continue thinking about the advantages of this. So, I want to see if a Customer’s orders have been shipped. Easy:

Customer customer = // ...
boolean shipped = customer.orders.shipped();

This doesn’t look spectacular (yet). But beware of the fact that a customer can have several orders, and the above check is really to see if all orders have been shipped. I really don’t want to write the loop, I find it quite obvious that I want to perform the shipped() check on every order. Consider:

// The length pseudo-field would still be
// present on orders
customer.orders.length;

// In fact, the length pseudo-field is also
// present on customer, in case there are several
customer.length;

// Let's add an order to the customer:
customer.orders.add(new Order());

// Let's reset order
customer.orders.clear();

// Let's calculate the sum of all values
// OO-style:
customer.orders.value.sum();
// Functional style:
sum(customer.orders.value);

Of course there would be a couple of caveats and the above choice of method names might not be the best one. But being able to deal with single references (nullable or non-nullable) or array references (empty, single-valued, multi-valued) in the same syntactic way is just pure syntax awesomeness. Null-checks would be replaced by length checks, but mostly you don’t even have to do those, because each method would always be called on every element in the array. The current single-reference vs. multi-reference semantics would be documented by naming conventions. Clearly, naming something “orders” indicates that multi-references are possible, whereas naming something “customer” indicates that multi-references are improbable.

As users have commented, this technique is commonly referred to as array programming, which is implemented in Matlab or R.

Convinced?

I’m curious to hear your thoughts!

10 More Common Mistakes Java Developers Make when Writing SQL

I was positively surprised to see how popular my recent listing about 10 Common Mistakes Java Developers Make when Writing SQL was, both on my own blog and on my syndication partner DZone. The popularity shows a couple of things:

jOOQ is the best way to write SQL in Java

Anyway, the common mistakes I listed previously are far from complete, so I will treat you to a sequel of 10 subtly less common, yet equally interesting mistakes Java developers make when writing SQL.

1. Not using PreparedStatements

Interestingly, this mistake or misbelief still surfaces blogs, forums and mailing lists many years after the appearance of JDBC, even if it is about a very simple thing to remember and to understand. It appears that some developers refrain from using PreparedStatements for any of these reasons:

  • They don’t know about PreparedStatements
  • They think that PreparedStatements are slower
  • They think that writing a PreparedStatement takes more effort

First off, let’s bust the above myths. In 96% of the cases, you’re better off writing a PreparedStatement rather than a static statement. Why? For simple reasons:

  • You can omit syntax errors originating from bad string concatenation when inlining bind values.
  • You can omit SQL injection vulnerabilities from bad string concatenation when inlining bind values.
  • You can avoid edge-cases when inlining more “sophisticated” data types, such as TIMESTAMP, binary data, and others.
  • You can keep open PreparedStatements around for a while, reusing them with new bind values instead of closing them immediately (useful in Postgres, for instance).
  • You can make use of adaptive cursor sharing (Oracle-speak) in more sophisticated databases. This helps prevent hard-parsing SQL statements for every new set of bind values.

Convinced? Yes. Note, there are some rare cases when you actually want to inline bind values in order to give your database’s cost-based optimiser some heads-up about what kind of data is really going to be affected by the query. Typically, this results in “constant” predicates such as:

  • DELETED = 1
  • STATUS = 42

But it shouldn’t result in “variable” predicates such as:

  • FIRST_NAME LIKE “Jon%”
  • AMOUNT > 19.95

Note that modern databases implement bind-variable peeking. Hence, by default, you might as well use bind values for all your query parameters. Note also that higher-level APIs such as JPA CriteriaQuery or jOOQ will help you generate PreparedStatements and bind values very easily and transparently when writing embedded JPQL or embedded SQL.

More background info:

The Cure:

By default, always use PreparedStatements instead of static statements. By default, never inline bind values into your SQL.

2. Returning too many columns

This mistake is quite frequent and can lead to very bad effects both in your database’s execution plan and in your Java application. Let’s look at the second effect first:

Bad effects on the Java application:

If you’re selecting * (star) or a “default” set of 50 columns, which you’re reusing among various DAOs, you’re transferring lots of data from the database into a JDBC ResultSet. Even if you’re not reading the data from the ResultSet, it has been transferred over the wire and loaded into your memory by the JDBC driver. That’s quite a waste of IO and memory if you know that you’re only going to need 2-3 of those columns.

This was obvious, but beware also of…

Bad effects on the database execution plan:

These effects may actually be much worse than the effects on the Java application. Sophisticated databases perform a lot of SQL transformation when calculating the best execution plan for your query. It may well be that some parts of your query can be “transformed away”, knowing that they won’t contribute to the projection (SELECT clause) or to the filtering predicates. I’ve recently blogged about this in the context of schema meta data:
How schema meta data impacts Oracle query transformations

Now, this is quite a beast. Think about a sophisticated SELECT that will join two views:

SELECT *
FROM   customer_view c
JOIN   order_view o
  ON   c.cust_id = o.cust_id

Each of the views that are joined to the above joined table reference might again join data from dozens of tables, such as CUSTOMER_ADDRESS, ORDER_HISTORY, ORDER_SETTLEMENT, etc. Given the SELECT * projection, your database has no choice but to fully perform the loading of all those joined tables, when in fact, the only thing that you were interested in was this:

SELECT c.first_name, c.last_name, o.amount
FROM   customer_view c
JOIN   order_view o
  ON   c.cust_id = o.cust_id

A good database will transform your SQL in a way that most of the “hidden” joins can be removed, which results in much less IO and memory consumption within the database.

The Cure:

Never execute SELECT *. Never reuse the same projection for various queries. Always try to reduce the projection to the data that you really need.

Note that this can be quite hard to achieve with ORMs.

3. Thinking that JOIN is a SELECT clause

This isn’t a mistake with a lot of impact on performance or SQL correctness, but nevertheless, SQL developers should be aware of the fact that the JOIN clause is not part of the SELECT statement per se. The SQL standard 1992 defines a table reference as such:

6.3 <table reference>

<table reference> ::=
    <table name> [ [ AS ] <correlation name>
      [ <left paren> <derived column list> <right paren> ] ]
  | <derived table> [ AS ] <correlation name>
      [ <left paren> <derived column list> <right paren> ]
  | <joined table>

The FROM clause and also the joined table can then make use of such table references:

7.4 <from clause>

<from clause> ::= 
    FROM <table reference> [ { <comma> <table reference> }... ]

7.5 <joined table>

<joined table> ::=
    <cross join>
  | <qualified join>
  | <left paren> <joined table> <right paren>

<cross join> ::=
    <table reference> CROSS JOIN <table reference>

<qualified join> ::=
    <table reference> [ NATURAL ] [ <join type> ] JOIN
      <table reference> [ <join specification> ]

Relational databases are very table-centric. Many operations are performed on physical, joined or derived tables in one way or another. To write SQL effectively, it is important to understand that the SELECT .. FROM clause expects a comma-separated list of table references in whatever form they may be provided.

Depending on the complexity of the table reference, some databases also accept sophisticated table references in other statements, such as INSERT, UPDATE, DELETE, MERGE. See Oracle’s manuals for instance, explaining how to create updatable views.

The Cure:

Always think of your FROM clause to expect a table reference as a whole. If you write a JOIN clause, think of this JOIN clause to be part of a complex table reference:

SELECT c.first_name, c.last_name, o.amount
FROM

    customer_view c
      JOIN order_view o
      ON c.cust_id = o.cust_id

4. Using pre-ANSI 92 JOIN syntax

Now that we’ve clarified how table references work (see the previous point), it should become a bit more obvious that the pre-ANSI 92 JOIN syntax should be avoided at all costs. To execution plans, it usually makes no difference if you specify join predicates in the JOIN .. ON clause or in the WHERE clause. But from a readability and maintenance perspective, using the WHERE clause for both filtering predicates and join predicates is a major quagmire. Consider this simple example:

SELECT c.first_name, c.last_name, o.amount
FROM   customer_view c,
       order_view o
WHERE  o.amount > 100
AND    c.cust_id = o.cust_id
AND    c.language = 'en'

Can you spot the join predicate? What if we joined dozens of tables? This gets much worse when applying proprietary syntaxes for outer join, such as Oracle’s (+) syntax.

The Cure:

Always use the ANSI 92 JOIN syntax. Never put JOIN predicates into the WHERE clause. There is absolutely no advantage to using the pre-ANSI 92 JOIN syntax.

5. Forgetting to escape input to the LIKE predicate

The SQL standard 1992 specifies the like predicate as such:

8.5 <like predicate>

<like predicate> ::=
    <match value> [ NOT ] LIKE <pattern>
      [ ESCAPE <escape character> ]

The ESCAPE keyword should be used almost always when allowing for user input to be used in your SQL queries. While it may be rare that the percent sign (%) is actually supposed to be part of the data, the underscore (_) might well be:

SELECT *
FROM   t
WHERE  t.x LIKE 'some!_prefix%' ESCAPE '!'

The Cure:

Always think of proper escaping when using the LIKE predicate.

6. Thinking that NOT (A IN (X, Y)) is the boolean inverse of A IN (X, Y)

This one is subtle but very important with respect to NULLs! Let’s review what A IN (X, Y) really means:

                  A IN (X, Y)
is the same as    A = ANY (X, Y)
is the same as    A = X OR A = Y

When at the same time, NOT (A IN (X, Y)) really means:

                  NOT (A IN (X, Y))
is the same as    A NOT IN (X, Y)
is the same as    A != ANY (X, Y)
is the same as    A != X AND A != Y

That looks like the boolean inverse of the previous predicate, but it isn’t! If any of X or Y is NULL, the NOT IN predicate will result in UNKNOWN whereas the IN predicate might still return a boolean value.

Or in other words, when A IN (X, Y) yields TRUE or FALSE, NOT(A IN (X, Y)) may still yield UNKNOWN instead of FALSE or TRUE. Note, that this is also true if the right-hand side of the IN predicate is a subquery.

Don’t believe it? See this SQL Fiddle for yourself. It shows that the following query yields no result:

SELECT 1
WHERE     1 IN (NULL)
UNION ALL
SELECT 2
WHERE NOT(1 IN (NULL))

More details can be seen in my previous blog post on that subject, which also shows some SQL dialect incompatibilities in that area.

The Cure:

Beware of the NOT IN predicate when nullable columns are involved!

7. Thinking that NOT (A IS NULL) is the same as A IS NOT NULL

Right, so we remembered that SQL implements three-valued logic when it comes to handling NULL. That’s why we can use the NULL predicate to check for NULL values. Right? Right.

But even the NULL predicate is subtle. Beware that the two following predicates are only equivalent for row value expressions of degree 1:

                   NOT (A IS NULL)
is not the same as A IS NOT NULL

If A is a row value expression with a degree of more than 1, then the truth table is transformed such that:

  • A IS NULL yields true only if all values in A are NULL
  • NOT(A IS NULL) yields false only if all values in A are NULL
  • A IS NOT NULL yields true only if all values in A are NOT NULL
  • NOT(A IS NOT NULL) yields false only if all values in A are NOT NULL

See more details in my previous blog post on that subject.

The Cure:

When using row value expressions, beware of the NULL predicate, which might not work as expected.

8. Not using row value expressions where they are supported

Row value expressions are an awesome SQL feature. When SQL is a very table-centric language, tables are also very row-centric. Row value expressions let you describe complex predicates much more easily, by creating local ad-hoc rows that can be compared with other rows of the same degree and row type. A simple example is to query customers for first names and last names at the same time.

SELECT c.address
FROM   customer c,
WHERE (c.first_name, c.last_name) = (?, ?)

As can be seen, this syntax is slightly more concise than the equivalent syntax where each column from the predicate’s left-hand side is compared with the corresponding column on the right-hand side. This is particularly true if many independent predicates are combined with AND. Using row value expressions allows you to combine correlated predicates into one. This is most useful for join expressions on composite foreign keys:

SELECT c.first_name, c.last_name, a.street
FROM   customer c
JOIN   address a
  ON  (c.id, c.tenant_id) = (a.id, a.tenant_id)

Unfortunately, not all databases support row value expressions in the same way. But the SQL standard had defined them already in 1992, and if you use them, sophisticated databases like Oracle or Postgres can use them for calculating better execution plans. This is explained on the popular Use The Index, Luke page.

The Cure:

Use row value expressions whenever you can. They will make your SQL more concise and possibly even faster.

9. Not defining enough constraints

So, I’m going to cite Tom Kyte and Use The Index, Luke again. You cannot have enough constraints in your meta data. First off, constraints help you keep your data from corrupting, which is already very useful. But to me, more importantly, constraints will help the database perform SQL transformations, as the database can decide that

  • Some values are equivalent
  • Some clauses are redundant
  • Some clauses are “void” (i.e. they will not return any values)

Some developers may think that constraints are slow. The opposite is the case, unless you insert lots and lots of data, in case of which you can either disable constraints for a large operation, or use a temporary “load table” without constraints, transferring data offline to the real table.

The Cure:

Define as many constraints as you can. They will help your database to perform better when querying.

10. Thinking that 50ms is fast query execution

The NoSQL hype is still ongoing, and many companies still think they’re Twitter or Facebook in dire need of faster, more scalable solutions, escaping ACID and relational models to scale horizontally. Some may succeed (e.g. Twitter or Facebook), others may run into this:

BNELF1GCUAExynU
Found here: https://twitter.com/codinghorror/status/347070841059692545

For the others who are forced (or chose) to stick with proven relational databases, don’t be tricked into thinking that modern databases are slow. They’re hyper fast. In fact, they’re so fast, they can parse your 20kb query text, calculate 2000-line execution plans, and actually execute that monster in less than a millisecond, if you and your DBA get along well and tune your database to the max.

They may be slow because of your application misusing a popular ORM, or because that ORM won’t be able to produce fast SQL for your complex querying logic. In that case, you may want to chose a more SQL-centric API like JDBC, jOOQ or MyBatis that will let you get back in control of your SQL.

So, don’t think that a query execution of 50ms is fast or even acceptable. It’s not. If you get these speeds at development time, make sure you investigate execution plans. Those monsters might explode in production, where you have more complex contexts and data.

Conclusion

SQL is a lot of fun, but also very subtle in various ways. It’s not easy to get it right as my previous blog post about 10 common mistakes has shown. But SQL can be mastered and it’s worth the trouble. Data is your most valuable asset. Treat data with respect and write better SQL.

jOOQ is the best way to write SQL in Java

Conference Gem: The MEDIT Symposium

The MEDIT Symposium is one of this year’s international conference gems that you shouldn’t miss, if you’re into Open Source! It features three very interesting tracks

  • Cloud Computing
  • Open Source
  • Mobile Development

I’m delighted to line up my own talk about jOOQ with Apache committers (Cassandra, Cordova, Isis), Groovy Language Toolkit developers, researchers and book authors. This mix of topics will certainly lead to inspiring discussions at lunch and at the later happy hour.

Apart from the fact that the admission fee is very reasonable, there’s nothing like an escape from cold northern October to Sicilly, southern Italy:

I’m looking forward to meeting you numerously at the MEDIT Symposium in October 2013!

Twitter and the JVM

Here’s an interesting read about Twitter and why they had chosen to migrate to the JVM for scalability. A must-read for everyone working on Java, Scala, or other JVM languages:

http://www.forbes.com/sites/oracle/2013/08/01/theres-java-in-your-tweets/

Use ModelMapper and jOOQ to Regain Control of your Domain Model

One of the things that Hibernate is quite good at is CRUD, i.e. persisting object graphs to the database. This is particularly true if your application runs in a Java domain-model-driven context. Your models are required to adhere to the standards set by JPA/Hibernate, of course. The same applies to mapping relational-model-driven data onto complex object graphs in memory. Again, you’ll have to adhere to the standards set by JPA/Hibernate.

If you’re operating on rather complex relational models, mapping data onto rather complex domain models, then you might want to get back in control of the mapping process, as auto-mapping will cause more headaches than it solves problems. An interesting approach has been shown recently on the ModelMapper website in an example integration with jOOQ. (note, there is also an example integration with JDBI). With the permission of the author Jonathan Halterman, I’m citing this interesting example:

jOOQ Integration

ModelMapper’s jOOQ integration allows you to map a jOOQ Record to a JavaBean.

Setup

To get started, add the modelmapper-jooq Maven dependency to your project:

<dependency>
  <groupId>org.modelmapper</groupId>
  <artifactId>modelmapper</artifactId>
  <version>0.6.1</version>
</dependency>
<dependency>
  <groupId>org.modelmapper</groupId>
  <artifactId>modelmapper-jooq</artifactId>
  <version>0.6.1</version>
</dependency>

Next, configure ModelMapper to support the RecordValueReader, which allows for values to be read and mapped from a jOOQ Record:

modelMapper.getConfiguration()
           .addValueReader(new RecordValueReader());

Example Mapping

Now let’s see an example mapping of a jOOQ record to a JavaBean. Consider the following record representing an order:

order_id customer_id customer_address_street customer_address_city
345 678 123 Main Street SF

We may need to map this to a more complex object model:

// Assume getters and setters are present

public class Order {
  private int id;
  private Customer customer;
}

public class Customer {
  private Address address;
}

public class Address {
  private String street;
  private String city;
}

Since the source Record’s fields in this example uses an underscore naming convention, we’ll need to configure ModelMapper to tokenize source property names by underscore:

modelMapper
  .getConfiguration()
  .setSourceNameTokenizer(NameTokenizers.UNDERSCORE);

With that set, mapping an order Record to an Order object is simple:

Order order =
  modelMapper.map(orderRecord, Order.class);

And we can assert that values are mapped as expected:

assertEquals(456, order.getId());
assertEquals(789, order.getCustomer().getId());
assertEquals("123 Main Street",
             order.getCustomer()
                  .getAddress()
                  .getStreet());
assertEquals("SF",
             order.getCustomer()
                  .getAddress()
                  .getCity());

Explicit Mapping

While ModelMapper will do its best to implicitly match Record values to destination properties, sometimes you may need to explicitly define mappings between properties.

Let’s map our Record’s customer_address_street to Order.customer.address.street:

PropertyMap<Record, Order> orderMap =
  new PropertyMap<Record, Order>() {
  protected void configure() {
    map(source("customer_address_street"))
        .getCustomer()
        .getAddress()
        .setStreet(null);
  }
};

Then we can add the mapping to our ModelMapper instance for the orderRecord:

modelMapper.createTypeMap(orderRecord, Order.class)
           .addMappings(orderMap);

(see the ModelMapper manual pages for more details about property mapping)

Things to Note

ModelMapper maintains a TypeMap for each source and destination type, containing the mappings bewteen the two types. For “generic” types such as Record this can be problematic since the structure of a Record can vary. In order to distinguish structurally different Records that map to the same destination type, we can provide a type map name to ModelMapper.

Continuing with the example above, let’s map another order Record, this one with a different structure, to the same Order class:

order_id order_customer_id order_customer_address_street order_customer_address_city
444 777 123 Main Street LA

Mapping this Record to an order is simple, but we’ll need to provide a type map name to distinguish this Record to Order mapping from the previous unnamed mapping:

Order order = modelMapper.map(
    longOrderRecord, Order.class, "long");

Example taken from:
http://modelmapper.org/user-manual/jooq-integration/

More Examples

When choosing ModelMapper, you’re not just chosing an API to map relational data to your domain model. ModelMapper is designed for arbitrary model transformation, which can make it a strategic choice for your stack.

Check out this marvelous Open Source gem on the ModelMapper website.