10 Common Mistakes Java Developers Make when Writing SQL

This article is part of a series. You might also like:

Java developers mix object-oriented thinking with imperative thinking, depending on their levels of:

  • Skill (anyone can code imperatively)
  • Dogma (some use the “Pattern-Pattern”, i.e. the pattern of applying patterns everywhere and giving them names)
  • Mood (true OO is more clumsy to write than imperative code. At first)

But when Java developers write SQL, everything changes. SQL is a declarative language that has nothing to do with either object-oriented or imperative thinking. It is very easy to express a query in SQL. It is not so easy to express it optimally or correctly. Not only do developers need to re-think their programming paradigm, they also need to think in terms of set theory.

Here are common mistakes that a Java developer makes when writing SQL through JDBC or jOOQ (in no particular order). For 10 More Common Mistakes, see this article here.

jOOQ is the best way to write SQL in Java

Here are common mistakes that a Java developer makes when writing SQL (in no particular order):

1. Forgetting about NULL

Misunderstanding NULL is probably the biggest mistake a Java developer can make when writing SQL. This is also (but not exclusively) due to the fact that NULL is also called UNKNOWN. If it were only called UNKNOWN, it would be easier to understand. Another reason is that JDBC maps SQL NULL to Java null when fetching data or when binding variables. This may lead to thinking that NULL = NULL (SQL) would behave the same way as null == null (Java)

One of the crazier examples of misunderstanding NULL is when NULL predicates are used with row value expressions.

Another, subtle problem appears when misunderstanding the meaning of NULL in NOT IN anti-joins.

The Cure:

Train yourself. There’s nothing but explicitly thinking about NULL, every time you write SQL:

  • Is this predicate correct with respect to NULL?
  • Does NULL affect the result of this function?

2. Processing data in Java memory

Few Java developers know SQL very well. The occasional JOIN, the odd UNION, fine. But window functions? Grouping sets? A lot of Java developers load SQL data into memory, transform the data into some appropriate collection type, execute nasty maths on that collection with verbose loop structures.

But some SQL databases support advanced (and SQL standard!) OLAP features that tend to perform a lot better and are much easier to write. A (non-standard) example is Oracle’s awesome MODEL clause. Just let the database do the processing and fetch only the results into Java memory. Because after all some very smart guys have optimised these expensive products. So in fact, by moving OLAP to the database, you gain two things:

  • Simplicity. It’s probably easier to write correctly in SQL than in Java
  • Performance. The database will probably be faster than your algorithm. And more importantly, you don’t have to transmit millions of records over the wire.

The Cure:

Every time you implement a data-centric algorithm in Java, ask yourself: Is there a way to let the database perform that work for me?

3. Using UNION instead of UNION ALL

It’s a shame that UNION ALL needs an extra keyword compared to UNION. It would be much better if the SQL standard had been defined to support:

  • UNION (allowing duplicates)
  • UNION DISTINCT (removing duplicates)

Not only is the removal of duplicates rarely needed (or sometimes even wrong), it is also quite slow for large result sets with many columns, as the two subselects need to be ordered, and each tuple needs to be compared with its subsequent tuple.

Note that even if the SQL standard specifies INTERSECT ALL and EXCEPT ALL, hardly any database implements these less useful set operations.

The Cure:

Every time you write a UNION, think if you actually wanted to write UNION ALL.

4. Using JDBC Pagination to paginate large results

Most databases support some way of paginating ordered results through LIMIT .. OFFSET, TOP .. START AT, OFFSET .. FETCH clauses. In the absence of support for these clauses, there is still the possibility for ROWNUM (Oracle) or ROW_NUMBER() OVER() filtering (DB2, SQL Server 2008 and less), which is much faster than pagination in memory. This is specifically true for large offsets!

The Cure:

Just use those clauses, or a tool (such as jOOQ) that can simulate those clauses for you.

5. Joining data in Java memory

From early days of SQL, some developers still have an uneasy feeling when expressing JOINs in their SQL. There is an inherent fear of JOIN being slow. This can be true if a cost-based optimiser chooses to perform a nested loop, possibly loading complete tables into database memory, before creating a joined table source. But that happens rarely. With appropriate predicates, constraints and indexes, MERGE JOIN and HASH JOIN operations are extremely fast. It’s all about the correct metadata (I cannot cite Tom Kyte often enough for this). Nonetheless, there are probably still quite a few Java developers who will load two tables from separate queries into maps and join them in Java memory in one way or another.

The Cure:

If you’re selecting from various tables in various steps, think again to see if you cannot express your query in a single statement.

6. Using DISTINCT or UNION to remove duplicates from an accidental cartesian product

With heavy joining, one can lose track of all the relations that are playing a role in a SQL statement. Specifically, if multi-column foreign key relationships are involved, it is possible to forget to add the relevant predicates in JOIN .. ON clauses. This might result in duplicate records, but maybe only in exceptional cases. Some developers may then choose to use DISTINCT to remove those duplicates again. This is wrong in three ways:

  • It (may) solve the symptoms but not the problem. It may as well not solve the symptoms in edge-cases.
  • It is slow for large result sets with many columns. DISTINCT performs an ORDER BY operation to remove duplicates.
  • It is slow for large cartesian products, which will still load lots of data into memory

The Cure:

As a rule of thumb, when you get unwanted duplicates, always review your JOIN predicates. There’s probably a subtle cartesian product in there somewhere.

7. Not using the MERGE statement

This isn’t really a mistake, but probably some lack of knowledge or some fear towards the powerful MERGE statement. Some databases know other forms of UPSERT statements, e.g. MySQL’s ON DUPLICATE KEY UPDATE clause. But MERGE is really so powerful, most importantly in databases that heavily extend the SQL standard.

The Cure:

If you’re UPSERTING by chaining INSERT and UPDATE or by chaining SELECT .. FOR UPDATE and then INSERT or UPDATE, think again. Apart from risking race conditions, you might be able to express a simpler MERGE statement.

8. Using aggregate functions instead of window functions

Before the introduction of window functions, the only means to aggregate data in SQL was by using a GROUP BY clause along with aggregate functions in the projection. This works well in many cases, and if aggregation data needed to be enriched with regular data, the grouped query can be pushed down into a joined subquery.

But SQL:2003 defined window functions, which are implemented by many popular database vendors. Window functions can aggregate data on result sets that are not grouped. In fact, each window function supports its own, independent PARTITION BY clause, which is an awesome tool for reporting.

Using window functions will:

  • Lead to more readable SQL (less dedicated GROUP BY clauses in subqueries)
  • Improve performance, as a RDBMS is likely to optimise window functions more easily

The Cure:

When you write a GROUP BY clause in a subquery, think again if this cannot be done with a window function.

9. Using in-memory sorting for sort indirections

The SQL ORDER BY clause supports many types of expressions, including CASE statements, which can be very useful for sort indirections. You should probably never sort data in Java memory because you think that

  • SQL sorting is too slow
  • SQL sorting cannot do it

The Cure:

If you sort any SQL data in memory, think again if you cannot push sorting into your database. This goes along well with pushing pagination into the database.

10. Inserting lots of records one by one

JDBC knows batching, and you should use it. Do not INSERT thousands of records one by one, re-creating a new PreparedStatement every time. If all of your records go to the same table, create a batch INSERT statement with a single SQL statement and multiple bind value sets. Depending on your database and database configuration, you may need to commit after a certain amount of inserted records, in order to keep the UNDO log slim.

The Cure:

Always batch-insert large sets of data.

Some interesting books

Some very interesting books on similar topics are

jOOQ is the best way to write SQL in Java

Liked this article?

You might also like these follow-up articles

29 thoughts on “10 Common Mistakes Java Developers Make when Writing SQL

  1. Many true statements. However, at least some of these problems arise partly through a wish to remain independent of the DBMS resp. the current SQL-dialekt. A prescription of “only ANSI-SQL” or “must be compatible with both Oracle and SQL Server” is not that uncommon—and even projects that focus on one DBMS still often have to support multiple versions.

    (Whether it is also wise is another question: I would rather simply use different methods with different DBMSs.)

    The model-clause, i.e., is entirely Oracle specific (probably even 11g specific) as are many other constructs, and even the null handling/semantik differs between DBMSs.

    An interesting example is how I recently implemented a check for loops in certain self-referential data using the tool DB-Sanity: I, obviously, used connect by for Oracle, found a similar, but entirely incompatible, possibility for SQL Server using a with-clause (that also looked and behaved differently from the Oracle with-clause—having a second incompatibility), and then found out that DB-Sanity refused to execute any check not beginning with the word “select”…

    (I ended up hacking a select-statement for SQL Server that explicitly left-outer-joined the table to it self nine times and doing some nasty hand-coded checks. It worked, but was very ugly—and if I had not been able to assume a maximum depth of the recursion, this would not have worked.)

    1. Thanks for these interesting insights, Michael. I’d be very interested in seeing a blog post about your experience with emulating Oracle’s nice CONNECT BY clause with SQL Server’s recursive common table expressions (CTE).

      Yes, there are lots of differences between the various existing SQL dialects, and that’s good in a way. Many good ideas were developed first at Oracle or Microsoft, before they made it into ANSI SQL. I’m still curious if CONNECT BY or CTE will make it into the standard first. On the other hand, lots of standard features aren’t implemented by any database (at least that I’m aware of), such as the MATCH predicate. Row value expressions have been around forever, but they’re hardly implemented anywhere. e.g. (A, B) = (1, 2)

      I think you might be interested in a couple of other blog posts on this blog, about SQL transformation and how it is implemented in jOOQ (https://www.jooq.org). jOOQ does a lot to standardise SQL as good as possible. Some jOOQ users use jOOQ with as many as 5 different databases, productively…

  2. I don’t disagree with your point, but concerning this : “Simplicity. It’s probably easier to write correctly in SQL than in Java”. This is just wrong.

    You will write better sql for things that matters in sql. But Java provides you with compile-time checking, powerful IDEs, unit testing system and a lot more.
    There’s no sql equivalent to all these bullet-proofs tools.

    1. OK, you’re right. It depends, of course. But I’ve recently replaced hundreds of lines of tedious Java code calculating a running total with a couple of lines of SQL involving window functions, compile-time checking included (wrote a view). But I agree. As far as IDEs and testing is concerned, there’s less tooling. But hey. 100’s of lines of Java or a 5-liner in SQL. In the latter case, I need neither IDE nor a lot of testing :-)

      1. Agree with most of the points but one “regarding data processing” there are quite a few cases where SQL and JAVA has to work in together to process data, e.g calculation, group by, order by has to be done in SQL, whereas formatting grouped by data in hierarchical format and calculating hierarchical sum has to be performed in JAVA unless someone can suggest me how to do it SQL I would be glad to look into it.

        here the exampl
        group by level1, level2, level3

        now if I want represent sum of all level3 as part of level and sum of all level2 as level1 it has to be done in JAVA

        1. You may want to look into GROUPING SETS, CUBE, and ROLLUP.

          A very well executed explanation can be found in the SQL Server manual: http://technet.microsoft.com/en-us/library/bb510427(v=sql.105).aspx

          You don’t have to perform calculations / formatting in the database, just because you can. All I’m saying is that Java developers default to doing things in Java way too often, simply because they’re unaware of equivalent SQL solutions.

    2. It goes both ways. There are certainly some advantages to using Java with its class libraries and tools. But I had a coworker who had been working on some Java method for days, having trouble making it efficient. I asked what it was supposed to do, and when he told me, I said, “why, that’s just an outer join.” I offered to show him how to do that, but he literally said, “I forbid you” from describing an outer join. I think he had invested so much effort into that Java method that he couldn’t bear to discard it.

        1. “When you can’t accept that your lovingly crafted code sucks and should be written off, you have already lost the battle.”

          I have accepted these things as “building character”

    3. Oracle’s SQL Developer provides compile-time checking, it is a powerful IDE, and it features a unit testing system (provided your SQL is put into PL/SQL units).

      1. That’s true, of course. I guess what “autra” was trying to say is that SQL and Java aren’t integrated very well in IDEs (short of using an internal DSL like jOOQ)

    1. There’s no golden rule to that. Surely, you shouldn’t put your business logic all over the place. But sometimes, “business” logic is really “data” logic, or let’s call it “reporting” logic. Then, there’s nothing wrong with having the database execute it.

      For example, when calculating a running total over transactions, you can do that in either SQL (with window functions, hierarchical SQL, or with the Oracle MODEL clause), or in Java. Putting the logic in SQL will make that logic available to non-Java database clients as well, helping to keep things DRY. After all, a running total is just another database column.

      Some may claim that stored procedures are a good place to execute business logic. I guess that has to be evaluated in the context of the overall architecture.

  3. How can save the database as a Java program? I have a large database consisting of dog names and sizes. I would like to export the database to save it as a Java program. I don’t care if it is Java script instead (they are almost same thing just different name).

    Thankyou

  4. There is a SQL plugin for JQuery which solves most of the problems handling SQL these days. I am surprised you didn’t mention it.

  5. I think that it would be better if UNION was not allowed, and we had to explicitly choose UNION ALL or UNION DISTINCT. That way code would be more explicit, easier to read, and since we read SQL 20 times for each time we write it, it’s a good trade-off.

    1. I completely agree with you. In fact, it would be nice if “ALL” was the default, just as with SELECT ALL (the standard specifies the possibility of putting ALL there, even if I’m not aware of any implementation that allows it).

  6. One thing I would like to highlight is that windows functions provide a different solution or rather solve a different requirement than aggregate functions.

    To just state this: “When you write a GROUP BY clause in a subquery, think again if this cannot be done with a window function.” is rather misleading to show that somehow analytic or window functions are better than aggregate functions.

    1. Sure, you’re right. That was not the implication was not intended. The reason I phrased things this way is mostly because I’ve seen so many developers struggle with working around basic GROUP BY facts (namely “losing” non-grouping columns from their results) that I thought, occasionally thinking of window functions as a viable alternative might help. This is obviously only true for:

      1) Aggregating window functions (not ranking functions)
      2) Window functions without ORDER BY or ROWS/RANGE clauses

      In other cases you’re perfectly right. They’re solving a different problem.

  7. I would like to get your thought on some Java guys keep saying DB agnostic design using JPA and spring data and not using any of DB features and write everything in Java. I am really facing struggle to discuss this with one of our new architect who keep saying he will do everything in Java including data processing logic. Your thoughts please.

    1. My best, possibly only, advice is dual: Do not be religious and try to teach the other party not to be religous.

      A big point is that which approach is the better depends on the circumstances. Both sides must be open to the alternative given the right circumstances. Compare e.g. a software written to be sold for use with whatever DBMS the customer already uses, where performance is a minor concern (e.g. due to small amounts of data), with a software written for a single instance deployment within the company it self with data going in the Terabytes. Other factors that play in is e.g. whether the system otherwise is geared for object persistency (including inheritance and polymorphy) or for more relational data storage.

      Remarks:

      o In both cases an incompetent developer that does not understand e.g. how database performance or object-relational-mapping works has a severely increased risk of producing a technically poor software.

      o Working closer to the DBMS tends to be both more fun and more interesting. Moving from JEE to PL/SQL as a field of work is not something I regret, although PL/SQL has many quirks a lacks a few nice features.

      o Some frame-works, notably Hibernate, make things too complicated or unexpected, causing unnecessary risks and errors. (Or require quite in-depth knowledge to develop safely.)

      1. Thanks for your comments. What did you refer to when you mention religiousness? These tips are meant for people who made the choice of using SQL. Once that choice was made, the tips are probably quite objective, no? Did you mean the introduction? That wasn’t meant as some religious preachery but as an eye opener to Java (“OO”) developers who are simply unaware of the fact that SQL is a set based language. I’ve seen many systems that would be redesigned without any further ado, if the team knew what the database can do for them, apart from persisting stuff.

        So, in a post like this, I’m assuming that move has already been decided to be the right choice.

        Regarding your remarks:

        In both cases, an incompetent developer…

        That is true. But again, I have seen incompetence (or rather: ignorance) in only one of two fields, and that can be fixed relatively easily.

        … PL/SQL …

        That’s cool – I kinda share the feeling. What’s your story behind this move?

        … Hibernate …

        Hibernate is a really sophisticated expert tool for advanced object graph persistence, which was unfortunately sold as a simple CRUD solution. Yes it can be the latter for really really simple applications, but it quickly shows its true power in an indeed sometimes unexpected way.

        1. For context, my reply was to a specific recent comment by Hick Kart, not the post in general. I can see how it would be very confusing without that context…

          As for my change of fields, it was originally a coincidence: I was thrown into the role full-time for a few months, because I happened to be the only one available at my then employer’s with some general Oracle experience and ANY PL/SQL experience at all. I was then put back as JEE man, but found that increasingly less rewarding, and decided to make a permanent switch. While I cannot catalogue all the details of the top of my head, they do include JEE having more boring leg work and so many APIs and frame-works (and different ones in every new project…) that I spent too much time just reading documentation, as well as my having had done several projects in a row with Websphere…

          1. I’m sorry for the confusion – you’re right of course. Unfortunately, wordpress doesn’t show whom a reply is directed to in the notification email…

            I see, I completely understand your switch (I’ve done almost the same, in my case from Weblogic).

Leave a Reply to PhillipCancel reply