10 Features I Wish Java Would Steal From the Kotlin Language

This article is overdue. After the hype around the release of Kotlin 1.0 has settled, let’s have a serious look at some Kotlin language features that we should have in Java as well.

In this article, I’m not going to wish for unicorns. But there are some low hanging fruit (as far as I naively can see), which could be introduced into the Java language without great risk. While you’re reading this article, be sure to copy paste examples to http://try.kotlinlang.org, an online REPL for Kotlin

1. Data class

Language designers hardly ever agree on the necessity and the feature scope of what a class is. In Java, curiously, every class always has identity a concept that is not really needed in 80% – 90% of all real world Java classes. Likewise, a Java class always has a monitor on which you can synchronize.

In most cases, when you write a class, you really just want to group values, like Strings, ints, doubles. For instance:

public class Person {
    final String firstName;
    final String lastName;
    public JavaPerson(...) {
    // Getters

    // Hashcode / equals

    // Tostring

    // Egh...

By the time you’ve finished typing all of the above, your fingers will no longer be. Java developers have implemented ugly workarounds for the above, like IDE code generation, or lombok, which is the biggest of all hacks. In a better Java, nothing in Lombok would really be needed.

As, for instance, if Java had Kotlin’s data classes:

data class Person(
  val firstName: String,
  val lastName: String

The above is all we need to declare the equivalent of the previous Java code. Because a data class is used to store data (duh), i.e. values, the implementation of things like hashCode(), equals(), toString() is obvious and can be provided by default. Furthermore, data classes are first class tuples, so they can be used as such, e.g. to destructure them again in individual references:

val jon = Person("Jon", "Doe") 
val (firstName, lastName) = jon

In this case, we may hope. Valhalla / Java 10 is being designed and with it, value types. We’ll see how many features will be provided on the JVM directly, and in the Java language. This will certainly be an exciting addition.

Notice how val is possible in Kotlin: Local variable type inference. This is being discussed for a future Java version right now.

2. Defaulted parameters

How many times do you overload an API like the following:

interface Stream<T> {
    Stream<T> sorted();
    Stream<T> sorted(Comparator<? super T> comparator);

The above are exactly the same JDK Stream operations. The first one simply applies Comparator.naturalOrder() to the second one. So we could write the following, in Kotlin:

fun sorted(comparator : Comparator<T> 
         = Comparator.naturalOrder()) : Stream<T>

The advantage of this isn’t immediately visible, when there is only one defaulted parameter. But imagine a function with tons of optional parameters:

fun reformat(str: String,
             normalizeCase: Boolean = true,
             upperCaseFirstLetter: Boolean = true,
             divideByCamelHumps: Boolean = false,
             wordSeparator: Char = ' ') {

Which can be called in any of the following ways:

reformat(str, true, true, false, '_')
  normalizeCase = true,
  upperCaseFirstLetter = true,
  divideByCamelHumps = false,
  wordSeparator = '_'

The power of defaulted parameters is that they are especially useful when passing arguments by name, rather than by index. This is currently not supported in the JVM, which until Java 8, doesn’t retain the parameter name at all (in Java 8, you can turn on a JVM flag for this, but with all of Java’s legacy, you shouldn’t rely on this yet).

Heck, this feature is something I’m using in PL/SQL every day. Of course, in Java, you can work around this limitation by passing a parameter object.

3. Simplified instanceof checks

If you will, this is really an instanceof switch. Some people may claim that this stuff is evil, bad OO design. Nja nja. I say, this happens every now and then. And apparently, in Java 7, string switches were considered sufficiently common to modify the language to allow them. Why not instanceof switches?

val hasPrefix = when(x) {
  is String -> x.startsWith("prefix")
  else -> false

Not only is this doing an instanceof switch, it is doing it in the form of an assignable expression. Kotlin’s version of this when expression is powerful. You can mix any sort of predicate expressions, similar to SQL’s CASE expression. For instance, this is possible as well:

when (x) {
  in 1..10 -> print("x is in the range")
  in validNumbers -> print("x is valid")
  !in 10..20 -> print("x is outside the range")
  else -> print("none of the above")

Compare to SQL (not implemented in all dialects):

  WHEN BETWEEN 1 AND 10 THEN 'x is in the range'
  WHEN IN (SELECT * FROM validNumbers) THEN 'x is valid'
  WHEN NOT BETWEEN 10 AND 20 'x is outside the range'
  ELSE 'none of the above'

As you can see, only SQL is more powerful than Kotlin.

4. Map key / value traversal

Now this could really be done very easily only with syntax sugar. Granted, having local variable type inference would already be a plus, but check this out

val map: Map<String, Int> = ...

And now, you can do:

for ((k, v) in map) {

After all, most of the time when traversing a map, it’ll be by Map.entrySet(). Map could have been enhanced to extend Iterable<Entry<K, V>> in Java 5, but hasn’t. That’s really a pity. After all, it has been enhanced in Java 8 to allow for internal iteration over the entry set in Java 8 via Map.forEach():

map.forEach((k, v) -> {

It’s not too late, JDK gods. You can still let Map<K, V> extend Iterable<Entry<K, V>>

5. Map access literals

This one is something that would add tons and tons of value to the Java language. We have arrays, like most other languages. And like most other languages, we can access array elements by using square brackets:

int[] array = { 1, 2, 3 };
int value = array[0];

Note also the fact that we have array initialiser literals in Java, which is great. So, why not also allow for accessing map elements with the same syntax?

val map = hashMapOf<String, Int>()
map.put("a", 1)

In fact, x[y] is just syntax sugar for a method call backed by x.get(y). This is so great, we have immediately proceeded with renaming our Record.getValue() methods in jOOQ to Record.get() (leaving the old ones as synonyms, of course), such that you can now dereference your database record values as such, in Kotlin

ctx.select(a.FIRST_NAME, a.LAST_NAME, b.TITLE)
   .orderBy(1, 2, 3)
   .forEach {
               by ${it[a.FIRST_NAME]} ${it[a.LAST_NAME]}""")

Since jOOQ holds all column type information on individual record columns, you can actually know in advance that it[b.TITLE] is a String expression. Great, huh? So, not only can this syntax be used with JDK maps, it can be used with any library that exposes the basic get() and set() methods.

Stay tuned for more jOOQ and Kotlin examples here:

6. Extension functions

This one is a controversial topic, and I can perfectly understand when language designers stay clear of it. But every now and then, extension functions are very useful. The Kotlin syntax here is actually just for a function to pretend to be part of the receiver type:

fun MutableList<Int>.swap(index1: Int, index2: Int) {
  val tmp = this[index1] // 'this' corresponds to the list
  this[index1] = this[index2]
  this[index2] = tmp

This will now allow for swapping elements in a list:

val l = mutableListOf(1, 2, 3)
l.swap(0, 2)

This would be very useful for libraries like jOOλ, which extends the Java 8 Stream API by wrapping it in a jOOλ type (another such library is StreamEx, with a slightly different focus). The jOOλ Seq wrapper type is not really important, as it pretends to be a Stream on steroids. It would be great, if jOOλ methods could be put onto Stream artificially, just by importing them:


The zipWithIndex() method isn’t really there. The above would just translate to the following, less readable code:


In fact, extension methods would even allow to bypass wrapping everything explicitly in a stream(). For instance, you could then do:


As all of jOOλ’s method could be designed to also be applied to Iterable.

Again, this is a controversial topic. For instance, because

While giving the illusion of being virtual, extension functions really are just sugared static methods. It’s a significant risk for object oriented application design to engage in that trickery, which is why this feature probably won’t make it into Java.

7. Safe-call operator (and also: Elvis operator)

Optional is meh. It’s understandable that an Optional type needed to be introduced in order to abstract over the absence of primitive type values, which cannot be null. We now have things like OptionalInt, e.g. to model things like:

OptionalInt result =
IntStream.of(1, 2, 3)
         .filter(i -> i > 3)

// Agressive programming ahead

Optional is a monad

Yes. It allows you to flatMap() the absent value.


Sure, if you want to do sophisticated functional programming, you’ll start typing map() and flatMap() everywhere. Like today, when we’re typing getters and setters. Along will come lombok generating flatmapping calls, and Spring will add some @AliasFor style annotation for flatmapping. And only the enlightened will be able to decipher your code.

When all we needed was just a simple null safety operator before getting back to daily business. Like:

String name = bob?.department?.head?.name

I really like this type of pragmatism in Kotlin. Or do you prefer (flat)mapping?

Optional<String> name = bob

Can you read this? I cannot. Neither can I write this. If you get this wrong, you’ll get boxoxed.

Of course, Ceylon is the only language that got nulls right. But Ceylon has tons of features that Java will not get before version 42, and I’m not wishing for unicorns. I’m wishing for the safe-call operator (and also the elvis operator, which is slightly different), which could be implemented in Java too. The above expression is just syntax sugar for:

String name = null;
if (bob != null) {
    Department d = bob.department
    if (d != null) {
        Person h = d.head;
        if (h != null)
            name = h.name;

What can possibly be wrong with that simplification?

8. Everything is an expression

Now this might just be a unicorn. I don’t know if there is a JLS / parser limitation that will forever keep us in the misery of prehistoric distinction between statement and expression.

At some point in time, people have started using statements for things that yield side-effects, and expressions for more functional-ish things. It is thus not surprising, that all String methods are really expressions, operating on an immutable string, returning a new string all the time.

This doesn’t seem to go well with, for instance, if-else in Java, which is expected to contain blocks and statements, each possibly yielding side-effects.

But is that really a requirement? Can’t we write something like this in Java as well?

val max = if (a > b) a else b

OK, we have this weird conditional expression using ?:. But what about Kotlin’s when (i.e. Java’s switch)?

val hasPrefix = when(x) {
  is String -> x.startsWith("prefix")
  else -> false

Isn’t that much more useful than the following equivalent?

boolean hasPrefix;

if (x instanceof String)
    hasPrefix = x.startsWith("prefix");
    hasPrefix = false;

(yes, I know about ?:. I just find if-else easier to read, and I don’t see why that should be a statement, not an expression. Heck, in Kotlin, even try is an expression, not a statement:

val result = try {
} catch (e: ArithmeticException) {
    throw IllegalStateException(e)


9. Single expression functions

Now this. This would save so much time reading and writing simple glue code. And in fact, we already have the syntax in annotations. Check out Spring’s magical @AliasFor annotation, for instance. It yields:

public @interface AliasFor {
    String value() default "";
    String attribute() default "";

Now, if you squint really hard, these are just methods yielding constant values, because annotations are just interfaces with generated byte code for their implementations. We can discuss syntax. Of course, this irregular usage of default is weird, given that it was not re-used in Java 8 for default methods, but I guess Java always needs the extra syntax so developers feel alive as they can better feel their typing fingers. That’s OK. We can live with that. But then again, why do we have to? Why not just converge to the following?

public @interface AliasFor {
    String value() = "";
    String attribute() = "";

And the same also for class / interface default methods?

// Stop pretending this isn't an interface
public interface AliasFor {
    String value() = "";
    String attribute() = "";

Now that would look nice. But given Java’s existing syntax, this might just be a unicorn, so let’s move on to…

10. Flow-sensitive typing

Now this. THIS!

We’ve blogged about sum types before. Java has sum types with exceptions since Java 7:

try {
catch (IOException | SQLException e) {
    // e can be of type IOException and/or SQLException
    // within this scope

But Java, unfortunately, doesn’t have flow-sensitive typing. Flow-sensitive typing is of the essence in a language that supports sum types, but it is also useful otherwise. For instance, in Kotlin:

when (x) {
    is String -> println(x.length)

We don’t need to cast, obviously, because we already checked that x is String. Conversely, in Java:

if (x instanceof String)
    System.out.println(((String) x).length());

Aaagh, all this typing. IDE autocompletion is smart enough to offer a contextual type’s methods already and then generate the unnecessary cast for you. But it would be great if this was never needed, every time we explicitly narrow a type using control flow structures.

For more info, see this wikipedia entry about flow sensitive typing. A feature that could absolutely be added to the Java language. After all, we already got flow-sensitive final local variables since Java 8.

11. (Bonus) Declaration site variance

Last but not least, better generics via declaration site variance. Many other languages know this, for instance also C#’s IEnumerable:

public interface IEnumerable<out T> : IEnumerable

The keyword out here means that the generic type T is produced from the type IEnumerable (as opposed to in, which stands for consumption). In C#, Scala, Ceylon, Kotlin, and many other languages, we can declare this on the type declaration, rather than on its usage (although, many languages allow for both). In this case, we say that IEnumerable is covariant with its type T, which means again that IEnumerable<Integer> is a subtype of IEnumerable<Object>

In Java, this isn’t possible, which is why we have a bazillion question by Java newbies on Stack Overflow. Why can’t I…

Iterable<String> strings = Arrays.asList("abc");
Iterable<Object> objects = strings; // boom

In languages like Kotlin, the above would be possible. After all, why shouldn’t it? A thing that can produce strings can also produce objects, and we can even use it in this way in Java:

Iterable<String> strings = Arrays.asList("abc");
for (Object o : strings) {
    // Works!

The lack of declaration site variance has made a lot of APIs very intelligible. Consider Stream:

<R> Stream<R> flatMap(Function<? super T, ? extends Stream<? extends R>> mapper);

This is just noise. A function is contravariant with its argument type and covariant with its result type by nature a better definition of Function or Stream would be:

interface Function<in T, out R> {}
interface Stream<out T> {}

If this were possible, all that ? super and ? extends garbage could be removed without losing any functionality.

In case you’re wondering what I’m even talking about? :)

The great news is, this is being discussed for a (near) future version of Java:


Kotlin is a promising language, even if it is very late to a game that already seems to have been decided, not in favour of alternative languages on the JVM. Nonetheless, it is a very interesting language to learn from, and with a lot of very good decisions made about some simple things.

Some of these decisions will hopefully be picked up by the Java language gods and integrated into Java. This list here shows some features that might be “easy” to add.

More info about Kotlin idioms:

Liked this article?

Read on here:

Watch Out For Recursion in Java 8’s [Primitive]Stream.iterate()

An interesting question by Tagir Valeev on Stack Overflow has recently caught my attention. To keep things short (read the question for details), while the following code works:

public static Stream<Long> longs() {
    return Stream.iterate(1L, i ->
        1L + longs().skip(i - 1L)




The following, similar code won’t work:

public static LongStream longs() {
    return LongStream.iterate(1L, i ->
        1L + longs().skip(i - 1L)

Causing a StackOverflowError.

Sure, this kind of recursive iteration is not optimal. It wasn’t prior to Java 8 and it certainly isn’t with the new APIs either. But one might think it should at least work, right? The reason why it doesn’t work is because of a subtle implementation difference between the two iterate() methods in Java 8. While the reference type stream’s Iterator first returns the seed and only then proceeds with iterating by applying the iteration function on the previous value:

final Iterator<T> iterator = new Iterator<T>() {
    T t = (T) Streams.NONE;

    public boolean hasNext() {
        return true;

    public T next() {
        return t = (t == Streams.NONE) ? seed : f.apply(t);

This is not the case for the LongStream.iterate() version (and other primitive streams):

final PrimitiveIterator.OfLong iterator = new PrimitiveIterator.OfLong() {
    long t = seed;

    public boolean hasNext() {
        return true;

    public long nextLong() {
        long v = t;
        t = f.applyAsLong(t);
        return v;

The iteration function is already pre-fetched one value in advance. This is usually not a problem, but can lead to

  1. Optimisation issues when the iteration function is expensive
  2. Infinite recursions when the iterator is used recursively

As a workaround, it might be best to simply avoid recursion with this method in primitive type streams. Luckily, a fix in JDK 9 is already on its way (as a side effect for a feature enhancement):

Time for some Funky SQL: Prefix Sum Calculation

This Stack Overflow question has yet again nerd-sniped me

[finding the] maximum element in the array that would result from performing all M operations

Here’s the question by John that was looking for a Java solution:

With an array of N elements which are initialized to 0. we are given a sequence of M operations of the sort (p; q; r). The operation (p; q; r) signifies that the integer r should be added to all array elements A[p];A[p + 1]; : : : ;A[q]. You are to output the maximum element in the array that would result from performing all M operations. There is a naive solution that simply performs all operations and then returns the maximum value, that takes O(MN) time. We are looking for a more efficient algorithm.

Interesting. Indeed, a naive solution would just perform all the operations as requested. Another naive but less naive solution would transform the operations into signals of the form (x; y) for all (p; r) and for all (q + 1; -r). In other words, we could implement the solution I had presented trivially as such:

// This is just a utility class to model the ops
class Operation {
    final int p;
    final int q;
    final int r;

    Operation(int p, int q, int r) {
        this.p = p;
        this.q = q;
        this.r = r;

// These are some example ops
Operation[] ops = {
    new Operation(4, 12, 2),
    new Operation(2,  8, 3),
    new Operation(6,  7, 1),
    new Operation(3,  7, 2)

// Here, we're calculating the min and max
// values for the combined values of p and q
IntSummaryStatistics stats = Stream
    .flatMapToInt(op -> IntStream.of(op.p, op.q))

// Create an array for all the required elements using
// the min value as "offset"
int[] array = new int[stats.getMax() - stats.getMin()];

// Put +r and -r "signals" into the array for each op
for (Operation op : ops) {
    int lo = op.p     - stats.getMin();
    int hi = op.q + 1 - stats.getMin();

    if (lo >= 0)
        array[lo] = array[lo] + op.r;

    if (hi < array.length)
        array[hi] = array[hi] - op.r;

// Now, calculate the prefix sum sequentially in a
// trivial loop
int maxIndex = Integer.MIN_VALUE;
int maxR = Integer.MIN_VALUE;
int r = 0;

for (int i = 0; i < array.length; i++) {
    r = r + array[i];
    System.out.println((i + stats.getMin()) + ":" + r);

    if (r > maxR) {
        maxIndex = i + stats.getMin();
        maxR = r;

System.out.println(maxIndex + ":" + maxR);

The above program would print out:


So, the maximum value is generated at position 6, and the value is 8.

Faster calculation in Java 8

This can be calculated faster using Java 8’s new Arrays.parallelPrefix() operation. Instead of the loop in the end, just write:

Arrays.parallelPrefix(array, Integer::sum);

Which is awesome, as it can run faster than the sequential O(M+N) solution. Read up about prefix sums here.

Now show me the promised SQL code

In SQL, the naive sequential and linear complexity solution can easily be re-implemented, and I’m showing a solution for PostgreSQL.

How can we do it? We’re using a couple of features here. First off, we’re using common table expressions (also known as the WITH clause). We’re using these to declare table variables. The first variable is the op table, which contains our operation instructions, like in Java:

  op (p, q, r) AS (
      (4, 12, 2),
      (2,  8, 3),
      (6,  7, 1),
      (3,  7, 2)

This is trivial. We’re essentially just generating a couple of example values.

The second table variable is the signal table, where we use the previously described optimisation of putting a +r signal at all p positions, and a -r signal at all q + 1 positions:

  signal(x, r) AS (
    SELECT p, r
    FROM op
    SELECT q + 1, -r
    FROM op

When you run


you would simply get:

x   r
2   3
3   2
4   2
6   1
8  -2
8  -1
9  -3
13 -2

All we need to do now is calculate a running total (which is essentially the same as a prefix sum) as follows:

FROM signal 
x   r
2   3
3   5
4   7
6   8
8   5
8   5
9   2
13  0

Now just find the max value for r, and we’re all set. We’ll take the shortcut by using ORDER BY and LIMIT:

FROM signal 

And we’re back with:

x   r
6   8

Perfect! Here’s the full query:

  op (p, q, r) AS (
      (4, 12, 2),
      (2,  8, 3),
      (6,  7, 1),
      (3,  7, 2)
  signal(x, r) AS (
    SELECT p, r
    FROM op
    SELECT q + 1, -r
    FROM op
FROM signal 

Can you beat the conciseness of this SQL solution? I bet you can’t. Challengers shall write alternatives in the comment section.

Thrilled about the SQL here? Read about how to calculate a subset sum in Oracle SQL.

Improve Your JUnit Experience with this Annotation

JUnit is probably part of 90% of all Java projects. And the exciting thing is, we’ll soon have JUnit 5 with Java 8 support. We’ve blogged about an improvement recently.

Back in JUnit 4 land, there’s this little trick that I can only recommend you put in all of your unit tests. Just add this little annotation here and you’ll be much more happy:

class MyTests {

What does it do? It’s simple. It fixes JUnit’s weird default of not defaulting to any testing order. Sure, not having any order in your tests might help accidentally discover some evil test inter-dependency. But usually, when you’re looking for individual tests and results, e.g. in your IDE, it’s just much much better to be able to visually scan the test suite and find the right method.

E.g. what do you prefer? This?


Or this?


Exactly. Finally, a useful annotation. Just put the following everywhere and help make this a slightly better world:

class MyTests {

10 Easy Steps to a Complete Understanding of SQL

Too many programmers think SQL is a bit of a beast. It is one of the few declarative languages out there, and as such, behaves in an entirely different way from imperative, object-oriented, or even functional languages (although, some say that SQL is also somewhat functional).

As a SQL trainer (do visit our training, it’s great!) I’m writing SQL every day and embracing SQL with our company’s Open Source library jOOQ.

jOOQ: The best way to write SQL in Java

I thus feel compelled to bring the beauty of SQL a bit closer to those of you still struggling with it. The following tutorial is destined for

  • readers who have already worked with SQL but never completely understood it
  • readers who know SQL well but have never really thought about its syntax
  • readers who want to teach SQL to others

This tutorial will focus on SELECT statements only. Other DML statements will be covered in another tutorial.

Note: This tutorial was previously published exclusively on Tech.Pro (see a historic version here). Unfortunately, Tech.Pro went offline. With the permission of Tech.Pro, we’re re-publishing this content again on the jOOQ blog.

Here are…

10 Easy Steps to a Complete Understanding of SQL.

1. SQL is declarative

Get this into your head first. Declarative. The only paradigm where you “just” declare the nature of the results that you would like to get. Not how your computer shall compute those results. Isn’t that wonderful?

SELECT first_name, last_name 
FROM employees 
WHERE salary > 100000

Easy to understand. You don’t care where employee records physically come from. You just want those that have a decent salary.

What do we learn from this?

So if this is so simple, what’s the problem? The problem is that most of us intuitively think in terms of imperative programming. As in: “machine, do this, and then do that, but before, run a check and fail if this-and-that”. This includes storing temporary results in variables, writing loops, iterating, calling functions, etc. etc.

Forget about all that. Think about how to declare things. Not about how to tell the machine to compute things.

2. SQL syntax is not “well-ordered”

A common source of confusion is the simple fact that SQL syntax elements are not ordered in the way they are executed. The lexical ordering is:

  • FROM

For simplicity, not all SQL clauses are listed. This lexical ordering differs fundamentally from the logical order (which may again differ from the order of execution, depending on the optimiser choices):

  • FROM

There are three things to note:

  1. FROM is the first clause, not SELECT. The first thing that happens is loading data from the disk into memory, in order to operate on such data.
  2. SELECT is executed after most other clauses. Most importantly, after FROM and GROUP BY. This is important to understand when you think you can reference stuff that you declare in the SELECT clause from the WHERE clause. The following is not possible:

    SELECT A.x + A.y AS z
    FROM A
    WHERE z = 10 -- z is not available here!

    If you wanted to reuse z, you have two options. Either repeat the expression:

    SELECT A.x + A.y AS z
    FROM A
    WHERE (A.x + A.y) = 10

    … or you resort to derived tables, common table expressions, or views to avoid code repetition. See examples further down.

  3. UNION is placed before ORDER BY in both lexical and logical ordering. Many people think that each UNION subselect can be ordered, but according to the SQL standard and most SQL dialects, that is not true. While some dialects allow for ordering subqueries or derived tables, there is no guarantee that such ordering will be retained after a UNION operation

Note, not all databases implement things the same way. Rule number 2, for instance, does not apply exactly in the above way to MySQL, PostgreSQL, and SQLite.

What do we learn from this?

Always remember both the lexical order and the logical order of SQL clauses to avoid very common mistakes. If you understand that distinction, it will become very obvious why some things work and others don’t.

Of course, it would have been nice if the language was designed in a way that the lexical order actually reflected the logical order, as it is implemented in Microsoft’s LINQ.

3. SQL is about table references

Because of the difference between lexical ordering and logical ordering, most beginners are probably tricked into thinking that column values are the first-class citizens in SQL. They are not. The most important things are table references.

The SQL standard defines the FROM clause as such:

<from clause> ::= 
    FROM <table reference> 
        [ { <comma> <table reference> }... ]

The “output” of the FROM clause is a combined table reference of the combined degree of all table references. Let’s digest this, slowly.

FROM a, b

The above produces a combined table reference of the degree of a + the degree of b. If a has 3 columns and b has 5 columns, then the “output table” will have 8 (3 + 5) columns.

The records contained in this combined table reference are those of the cross product / cartesian product of a x b. In other words, each record of a is paired with each record of b. If a has 3 records and b has 5 records, then the above combined table reference will produce 15 records (3 x 5).

This “output” is “fed” / “piped” into the GROUP BY clause (after filtering in the WHERE clause), where it is transformed into a new “output”. We’ll deal with that later on.

If we’re looking at these things from a relational algebra / set theory perspective, a SQL table is a relation or a set of tuples. And each SQL clause will transform one or several relations in order to produce new relations.

What do we learn from this?

Always think in terms of table references to understand how data is “pipelined” through your SQL clauses.

4. SQL table references can be rather powerful

A table reference is something rather powerful. A simple example of their power is the JOIN keyword, which is actually not part of the SELECT statement, but part of a “special” table reference. The joined table, as defined in the SQL standard (simplified):

<table reference> ::=
    <table name>
  | <derived table>
  | <joined table>

If we take again the example from before:

FROM a, b

a can be a joined table as such:

a1 JOIN a2 ON a1.id = a2.id

Expanding this into the previous expression, we’d get:

FROM a1 JOIN a2 ON a1.id = a2.id, b

While it is discouraged to combine the comma-separated list of table references syntax with the joined table syntax, you can most certainly do this. The resulting, combined table reference will now have a degree of a1+a2+b.

Derived tables are even more powerful than joined tables. We’ll get to that.

What do we learn from this?

Always, always think in terms of table references. Not only is this important to understand how data is “pipelined” through your SQL clauses (see previous section), it will also help you understand how complex table references are constructed.

And, importantly, understand that JOIN is a keyword for constructing joined tables. Not a part of the SELECT statement. Some databases allow for using JOIN in INSERT, UPDATE, DELETE

5. SQL JOIN tables should be used rather than comma-separated tables

Before, we’ve seen this clause:

FROM a, b

Advanced SQL developers will probably tell you that it is discouraged to use the comma-separated list at all, and always fully express your JOINs. This will help you improve readability of your SQL statement, and thus prevent mistakes.

One very common mistake is to forget a JOIN predicate somewhere. Think about the following:

FROM a, b, c, d, e, f, g, h
WHERE a.a1 = b.bx
AND a.a2 = c.c1
AND d.d1 = b.bc
-- etc...

The join table syntax is both

  • Safer, as you can place join predicates close to the joined tables, thus preventing mistakes.
  • More expressive, as you can distinguish between OUTER JOIN, INNER JOIN, etc.

What do we learn from this?

Always use JOIN. Never use comma-separated table references in your FROM clauses.

6. SQL’s different JOIN operations

JOIN operations essentially come with five flavours:


These terms are commonly used in relational algebra. SQL uses different terms for the above concepts, if they exist at all. Let’s have a closer look:


This is the most common JOIN operation. It has two sub-flavours:

  • INNER JOIN (or just JOIN)
  • OUTER JOIN (further sub-flavoured as LEFT, RIGHT, FULL OUTER JOIN)

The difference is best explained by example:

-- This table reference contains authors and their books.
-- There is one record for each book and its author.
-- authors without books are NOT included
author JOIN book ON author.id = book.author_id

-- This table reference contains authors and their books
-- There is one record for each book and its author.
-- ... OR there is an "empty" record for authors without books
-- ("empty" meaning that all book columns are NULL)
author LEFT OUTER JOIN book ON author.id = book.author_id



This relational concept can be expressed in two ways in SQL: Using an IN predicate, or using an EXISTS predicate. “Semi” means “half” in latin. This type of join is used to join only “half” of a table reference. What does that mean? Consider again the above joining of author and book. Let’s imagine that we don’t want author/book combinations, but just those authors who actually also have books. Then we can write:

-- Using IN
FROM author
WHERE author.id IN (SELECT book.author_id FROM book)

-- Using EXISTS
FROM author
WHERE EXISTS (SELECT 1 FROM book WHERE book.author_id = author.id)

While there is no general rule as to whether you should prefer IN or EXISTS, these things can be said:

  • IN predicates tend to be more readable than EXISTS predicates
  • EXISTS predicates tend to be more expressive than IN predicates (i.e. it is easier to express very complex SEMI JOIN)
  • There is no formal difference in performance. There may, however, be a huge performance difference on some databases.

Because INNER JOIN also produces only those authors that actually have books, many beginners may think that they can then remove duplicates using DISTINCT. They think they can express a SEMI JOIN like this:

-- Find only those authors who also have books
SELECT DISTINCT first_name, last_name
FROM author
JOIN book ON author.id = book.author_id

This is very bad practice for two reasons:

  • It is very slow, as the database has to load a lot of data into memory, just to remove duplicates again.
  • It is not entirely correct, even if it produces the correct result in this simple example. But as soon as you JOIN more table references, you will have a very hard time correctly removing duplicates from your results.

Some more information about abuse of DISTINCT can be seen in this blog post.


This relational concept is just the opposite of a SEMI JOIN. You can produce it simply by adding a NOT keyword to the IN or EXISTS predicates. An example, where we’ll select those authors who do not have any books:

-- Using IN
FROM author
WHERE author.id NOT IN (SELECT book.author_id FROM book)

-- Using EXISTS
FROM author
WHERE NOT EXISTS (SELECT 1 FROM book WHERE book.author_id = author.id)

The same rules with respect to performance, readability, expressivity apply. However, there is a small caveat with respect to NULLs when using NOT IN, which is a bit out of scope for this tutorial.


This produces a cross product of the two joined table references, combining every record of the first table reference with every record of the second table reference. We have seen before, that this can be achieved with comma-separated table references in the FROM clause. In the rare cases where this is really desired, you can also write a CROSS JOIN explicitly, in most SQL dialects:

-- Combine every author with every book
author CROSS JOIN book


The relational division is really a beast of its own breed. In short, if JOIN is multiplication, division is the inverse of JOIN. Relational divisions are very tough to express in SQL. As this is a beginners’ tutorial, explaining it is out of scope. For the brave among you, read on about it here, here, and here.

What do we learn from this?

A lot. Again, let’s hammer this into our heads. SQL is about table references. Joined tables are quite sophisticated table references. But there is a difference in relational-speak and SQL-speak. Not all relational join operations are also formal SQL join operations. With a bit of practice and knowledge about relational theory, you will always be able to choose the right type of relational JOIN and be able to translate it to the correct SQL.

7. SQL’s derived tables are like table variables

Before, we’ve learned that SQL is a declarative language, and as such, variables do not have a place (they do in some SQL dialects, though). But you can write something like variables. And those beasts are called derived tables.

A derived table is nothing but a subquery wrapped in parentheses.

-- A derived table

Note that some SQL dialects require derived tables to have a correlation name (also known as alias).

-- A derived table with an alias
FROM (SELECT * FROM author) a

Derived tables are awesome when you want to circumvent the problems caused by the logical ordering of SQL clauses. For instance, if you want to reuse a column expression in both the SELECT and the WHERE clause, just write (Oracle dialect):

-- Get authors' first and last names, and their age in days
SELECT first_name, last_name, age
  SELECT first_name, last_name, current_date - date_of_birth age
  FROM author
-- If the age is greater than 10000 days
WHERE age > 10000

Note that some databases, and the SQL:1999 standard have taken derived tables to the next level, introducing common table expressions. This will allow you to reuse the same derived table several times within a single SQL SELECT statement. The above query would then translate to the (almost) equivalent:

  SELECT first_name, last_name, current_date - date_of_birth age
  FROM author
WHERE age > 10000

Obviously, you could also externalise “a” into a standalone view for even broader reuse of common SQL subselects. Read more about views here.

What do we learn from this?

Again, again, again. SQL is mostly about table references, not columns. Make use of them. Don’t be afraid of writing derived tables or other complex table references.

8. SQL GROUP BY transforms previous table references

Let’s reconsider our previous FROM clause:

FROM a, b

And now, let’s apply a GROUP BY clause to the above combined table reference

GROUP BY A.x, A.y, B.z

The above produces a new table reference with only three remaining columns (!). Let’s digest this again. If you apply GROUP BY, then you reduce the number of available columns in all subsequent logical clauses – including SELECT. This is the syntactical reason why you can only reference columns from the GROUP BY clause in the SELECT clause.

  • Note that other columns may still be available as arguments of aggregate functions:

    SELECT A.x, A.y, SUM(A.z)
    FROM A
    GROUP BY A.x, A.y
  • Note that MySQL, unfortunately, doesn’t adhere to this standard, causing nothing but confusion. Don’t fall for MySQL’s tricks. GROUP BY transforms table references. You can thus only reference columns also referenced in the GROUP BY clause.

What do we learn from this?

GROUP BY, again, operates on table references, transforming them into a new form.

9. SQL SELECT is called projection in relational algebra

I personally like the term “projection”, as it is used in relational algebra. Once you’ve generated your table reference, filtered it, transformed it, you can step to projecting it to another form. The SELECT clause is like a projector. A table function making use of a row value expression to transform each record from the previously constructed table reference into the final outcome.

Within the SELECT clause, you can finally operate on columns, creating complex column expressions as parts of the record / row.

There are a lot of special rules with respect to the nature of available expressions, functions, etc. Most importantly, you should remember these:

  1. You can only use column references that can be produced from the “output” table reference
  2. If you have a GROUP BY clause, you may only reference columns from that clause, or aggregate functions.
  3. You can use window functions instead of aggregate functions, when you don’t have a GROUP BY clause.
  4. If you don’t have a GROUP BY clause, you must not combine aggregate functions with non-aggregate functions.
  5. There are some rules with respect to wrapping regular functions in aggregate functions and vice-versa.
  6. There are …

Well, there are lots of complex rules. They could fill yet another tutorial. For instance, the reason why you cannot combine aggregate functions with non-aggregate functions in the projection of a SELECT statement without GROUP BY clause (rule number 4) is this:

  1. It doesn’t make sense. Intuitively.
  2. If intuition doesn’t help (it hardly does, with a SQL beginner), then syntax rules do. SQL:1999 introduced GROUPING SETS, and SQL:2003 introduced empty grouping sets: GROUP BY (). Whenever an aggregate function is present, and there is no explicit GROUP BY clause, an implicit, empty GROUPING SET is applied (rule number 2). Hence, the original rules about logical ordering aren’t exactly true anymore, and the projection (SELECT) influences the outcome of a logically preceding, yet lexically succeeding clause (GROUP BY).

Confused? Yes. Me too. Let’s get back to simpler things.

What do we learn from this?

The SELECT clause may be one of the most complex clauses in SQL, even if it appears so simple. All other clauses just “pipe” table references from one to another. The SELECT clause messes up the beauty of these table references, by completely transforming them, applying some rules to them retroactively.

In order to understand SQL, it is important to understand everything else first, before trying to tackle SELECT. Even if SELECT is the first clause in lexical ordering, it should be the last.

10. SQL DISTINCT, UNION, ORDER BY, and OFFSET are simple again

After the complicated SELECT, we can get back to simple things again:

  • Set operations (DISTINCT and UNION)
  • Ordering operations (ORDER BY, OFFSET .. FETCH)

Set operations

Set operations operate on “sets”, which are actually nothing other than… tables. Well, almost. Conceptually, they’re easy to understand.

  • DISTINCT removes duplicates after the projection.
  • UNION concatenates two subselects and removes duplicates
  • UNION ALL concatenates two subselects retaining duplicates
  • EXCEPT removes records from the first subselect that are also contained in the second subselect (and then removes duplicates)
  • INTERSECT retains only records contained in both subselects (and then removes duplicates)

All of this removing duplicates is usually non-sense. Most often, you should just use UNION ALL, when you want to concatenate subselects.

Ordering operations

Ordering is not a relational feature. It is a SQL-only feature. It is applied at the very end of both lexical ordering and logical ordering of your SQL statement. Using ORDER BY and OFFSET .. FETCH is the only way to guarantee that records can be accessed by index in a reliable way. All other ordering is always arbitrary and random, even if it may appear to be reproducible.

OFFSET .. FETCH is only one syntax variant. Other variants include MySQL’s and PostgreSQL’s LIMIT .. OFFSET, or SQL Server’s and Sybase’s TOP .. START AT. A good overview of various ways to implement OFFSET .. FETCH can be seen here.

Let’s get to work

As with every language, SQL takes a lot of practice to master. The above 10 simple steps will help you make more sense of the every day SQL that you’re writing. On the other hand, it is also good to learn from common mistakes. The following two articles list lots of common mistakes Java (and other) developers make when writing SQL:

Ceylon Might Just be the Only (JVM) Language that Got Nulls Right

Here we go again. THAT TOPIC.

But hang on. The approach discussed here (and in the Ceylon language) is not something you see every day. At the same time, it is very cunning.

Nulls are baked into the language

… or so it may seem. Indeed, in Ceylon, like in Kotlin (and possibly many other languages), there is a special type “annotation” that you can postfix to any reference type in order to make it nullable. For instance:

String firstName = "Homer";
String? middleName = "J";
String lastName = "Simpson";

In the above example, both firstName and lastName are mandatory values that can never be null, whereas middleName is optional. Most languages that support the above then ship with special operators to access the optional value, e.g. ?. in Ceylon and also in Kotlin.

// Another optional value:
Integer? length = middleName?.length;

// A non-optional value:
Integer length = middleName?.length else 0;

So, what is it about Ceylon that works so smoothly?

The thing that Ceylon got very right is the fact that all of the above is just syntactic sugar that is:

  • Easy to use
  • Maps well to our mindset, where null still is a thing
  • Can interoperate with Java
  • Doesn’t introduce cognitive friction

For us Java folks, we can still pretend that null is an OK-ish, hard to avoid thing (as we’ve claimed before on this blog). But what is null really? Is it the absent value? The unknown value? The uninitialised value?

Java only has one null thingy, and it is (ab-)used for all of the previous things, and more, when in theory, it is only really the uninitialised value, nothing more. On the other hand, when working with JDBC (and thus, SQL), it implicitly means the unknown value (with all the related caveats).

In Ceylon, however, Null is a special type, similar to Void in Java. The only value that can be assigned to the Null type is null:

// Ceylon
Null x = null;

// Java
Void x = null;

But the big difference is, null cannot be assigned to any other type! Wait. Couldn’t we assign null to String? … ? Of course, the following is possible in Ceylon:

String? x = null;

But why is this possible? Because String? is just syntax sugar for String|Null, a union type, i.e. a type that is either the String type or the Null type.

Huh, what are union types?

Let’s look at this more closely. When in the jOOQ API you want to work with SQL functions and expressions, there is always a great set of overloads that provide you with a standard version, and a convenience version where you can pass a bind variable. Take the equals operator, for instance:

interface Field<T> {
    Condition eq(Field<T> field);
    Condition eq(T value);

The above overloads allow you for writing things like the following, without needing to think about the distinction between a SQL expression and a Java bind variable (which is ultimately also a SQL expression):

// Comparing a column with bind variable

// Comparing a column with another column expression

In fact, there are even more overloads, because the right hand side of a comparison operation can have other expressions as well, for instance:

interface Field<T> {
    Condition eq(Field<T> field);
    Condition eq(T value);
    Condition eq(Select<? extends Record1<T>> query);
    Condition eq(QuantifiedSelect<? extends Record1<T>> query);

Now, the same set of overloads needs to be repeated for not equals, greater than, greater or equal, etc. Wouldn’t it be nice to be able to express this “right-hand-side” thingy as a single, reusable type? I.e. a union type of all of the above types?

interface Field<T> {
    Condition eq(
      | T
      | Select<? extends Record1<T>>
      | QuantifiedSelect<? extends Record1<T>> thingy

Or even

// This is called a type alias. Another awesome
// Ceylon language feature (pseudo syntax)
alias Thingy => 
  | T
  | Select<? extends Record1<T>>
  | QuantifiedSelect<? extends Record1<T>>;

interface Field<T> {
    Condition eq(Thingy thingy);

After all, that’s also how the SQL language is defined. Heck, that’s how any BNF notation defines syntactic elements. For instance:

<predicate> ::=
    <comparison predicate>
  | <between predicate>
  | <in predicate>
  | <like predicate>
  | <null predicate>
  | <quantified comparison predicate>
  | <exists predicate>
  | <unique predicate>
  | <match predicate>
  | <overlaps predicate>

OK, granted, a syntactic element is not strictly the same thing as a type, but the intuitive perception is the same.

Oh, and Java has union types, too!

In a brief flash of revelation, the Java 7 expert groups added support for union types in exception handling. You can write things like:

try {
catch (IOException | SQLException e) {
    // e can be any of the above!

And you can emulate union types with generics, which don’t support union types but intersection types in Java.

Back to Ceylon and NULL

Ceylon has gotten Null right. Because, historically, a nullable type is a type that can be the “real” type or the “null” value. We want that. We Java developers crave that. We cannot live without the soothing option of this kind of optional.

But the excellent thing about this approach is that it is extendable. What if I really need to distinguish between “unknown”, “uninitialised”, “undefined”, “42”? I can. Using types. Here’s a String that can model all of the aforementioned “special values”:


And if that’s too verbose, I just assign a name to it

interface TheStringToRuleThemAll
  => String|Unknown|Uninitialised|Undefined|FortyTwo;

But it cannot be Null. Because I don’t want it to be that value, that is everything and nothing. Are you convinced? I bet you are. From now on:

Don’t trust any language that pretends that the Option(al) monad is a decent approach at modelling null. It isn’t.

― me. Just now

Why? Let me illustrate. Kotlin/Ceylon/Groovy style syntax sugar using the elvis operator (regardless of the backing null semantics):

String name = bob?.department?.head?.name

Same thing with Optional monads:

Optional<String> name = bob


Some people claim

Using union types is like driving around in a brand new Ferrari with your mother-in-law in the passenger seat.

by Elvira

Sure. But I claim: Well done, Ceylon. Let’s hope we’ll get union types in Java, too, outside of catch blocks!

Further reading

Liked this article? How about:

Java 10’s new Local-Variable Type Inference

News could hardly get more exciting than this, for a programming language aficionado!

There is now a JEP 286 for Local-Variable Type Inference with status “Candidate”. And a request for feedback by Brian Goetz, which I would love to invite you to participate in:

Please do so, the survey remains open only from March 9 to March 16!

This is not a feature that will be implemented. It might be implemented. Hence, there is no specific Java version yet, which is why I name the Java version “A” (for Awesome).

What is local-variable type inference and why is it good?

Let’s have a look at a feature that various other languages have had for quite a while. In this blog post, I’d like to discuss the general idea, not the possibly specific implementation that might be planned for Java, as that would be too early, and I certainly don’t have the big picture of how this fits into Java.

In Java, as well as in some other languages, types are always declared explicitly and verbosely. For instance, you write things like:

// Java 5 and 6
List<String> list = new ArrayList<String>();

// Java 7
List<String> list = new ArrayList<>();

Notice how in Java 7, some syntax sugar was added via the useful diamond operator <>. It helps removing unnecessary redundancy in the Java way, i.e. by applying “target-typing”, which means the type is defined by the “target”. Possible targets are:

  • Local variable declarations
  • Method arguments (both from the outside and from the inside of the method)
  • Class members

Since in many cases, the target type MUST be declared explicitly (method arguments, class members), Java’s approach makes a lot of sense. In the case of local variables, however, the target type doesn’t really need to be declared. Since the type definition is bound to a very local scope, from which it cannot escape, it may well be inferred by the compiler without the source code ever being explicit about it, from the “source type”. This means, we will be able to do things like:

// Java 10 as suggested in the JEP

// infers ArrayList<String>
var list = new ArrayList<String>();

// infers Stream<String>
val stream = list.stream();

In the above example var stands for a mutable (non-final) local variable, whereas val stands for an immutable (final) local variable. Notice how the type of list was never really needed, just as when we write the following, where the type is already inferred today:

stream = new ArrayList<String>().stream();

This will work no different from lambda expressions, where we already have this kind of type inference in Java 8:

List<String> list = new ArrayList<>();

// infers String
list.forEach(s -> {

Think of lambda arguments as local variables. An alternative syntax for such a lambda expression might have been:

List<String> list = new ArrayList<>();

// infers String
list.forEach((val s) -> {

Other languages have this, but is it good?

Among these other languages: C# and Scala and JavaScript, if you will ;). YAGNI is probably an common reaction to this feature. For most people, it’s mere convenience to be able not to type all types all the time. Some people might prefer to see the type explicitly written down, when reading code. Especially, when you have a complex Java 8 Stream processing pipeline, it can get hard to track all the types that are inferred along the way. An example of this can be seen in our article about jOOλ’s window function support:

BigDecimal currentBalance = new BigDecimal("19985.81");
    tuple(9997, "2014-03-18", new BigDecimal("99.17")),
    tuple(9981, "2014-03-16", new BigDecimal("71.44")),
    tuple(9979, "2014-03-16", new BigDecimal("-94.60")),
    tuple(9977, "2014-03-16", new BigDecimal("-6.96")),
    tuple(9971, "2014-03-15", new BigDecimal("-65.95")))
    .comparing((Tuple3<Integer, String, BigDecimal> t) 
        -> t.v1, reverseOrder())
    .thenComparing(t -> t.v2), Long.MIN_VALUE, -1)
.map(w -> w.value().concat(
     currentBalance.subtract(w.sum(t -> t.v3)

The above implements a running total calculation that yields:

|   v0 | v1         |     v2 |       v3 |
| 9997 | 2014-03-18 |  99.17 | 19985.81 |
| 9981 | 2014-03-16 |  71.44 | 19886.64 |
| 9979 | 2014-03-16 | -94.60 | 19815.20 |
| 9977 | 2014-03-16 |  -6.96 | 19909.80 |
| 9971 | 2014-03-15 | -65.95 | 19916.76 |

While the Tuple3 type needs to be declared because of the existing Java 8’s limited type inference capabilities (see also this article on generalized target type inference), are you able to track all the other types? Can you easily predict the result? Some people prefer the short style, others claim:

On the other hand, do you like to manually write down a type like Tuple3<Integer, String, BigDecimal>? Or, when working with jOOQ, which of the following versions of the same code do you prefer?

// Explicit typing
// ----------------------------------------
for (Record3<String, Integer, Date> record : ctx
) {
    // Do things with record
    String title = record.value1();

// "Don't care" typing
// ----------------------------------------
for (Record record : ctx
) {
    // Do things with record
    String title = record.getValue(0, String.class);

// Implicit typing
// ----------------------------------------
for (val record : ctx
) {
    // Do things with record
    String title = record.value1();

I’m sure that few of you would really like to explicitly write down the whole generic type, but if your compiler can still remember the thing, that would be awesome, wouldn’t it? And it’s an opt-in feature. You can always revert to explicit type declarations.

Edge-cases with use-site variance

There are some things that are not possible without this kind of type inference, and they’re related to use-site variance and the specifics of generics as implemented in Java. With use-site variance and wild cards, it is possible to construct “dangerous” types that cannot be assigned to anything because they’re undecidable. For details, please read Ross Tate’s paper on Taming Wildcards in Java’s Type System.

Use-site variance is also a pain when exposed from method return types, as can be seen in some libraries that either:

  • Didn’t care about this pain they’re inflicting on their users
  • Didn’t find a better solution as Java doesn’t have declaration-site variance
  • Were oblivious to this issue

An example:

interface Node {
    void add(List<? extends Node> children);
    List<? extends Node> children();

Imagine a tree data structure library, where tree nodes return lists of their children. A technically correct children type would be List<? extends Node> because the children are Node subtypes, and it is perfectly OK to use a Node subtype list.

Accepting this type in the add() method is great from an API design perspective. It allows people to add a List<LeafNode>, for instance. Returning it from children() is horrible, though, because the only options are now:

// Raw type. meh
List children = parent.children();

// Wild card. meh
List<?> children = parent.children();

// Full type declaration. Yuk
List<? extends Node> children = parent.children();

With JEP 286, we might be able to work around all of this and have this nice fourth option:

// Awesome. The compiler knows it's 
// List<? extends Node>
val children = parent.children();


Local Variable Type Inference is a hot topic. It’s entirely optional, we don’t need it. But it makes a lot of things much much easier, especially when working with tons of generics. We’ve seen that type inference is a killer feature when working with lambda expressions and complex Java 8 Stream transformations. Sure, it will be harder to track all the types across a long statement, but at the same time, if those types were spelled out, it would make the statement very unreadable (and often also very hard to write).

Type inference helps make developers more productive without giving up on type safety. It actually encourages type safety, because API designers are now less reluctant to expose complex generic types to their users, as users can use these types more easily (see again the jOOQ example).

In fact, this feature is already present in Java in various situations, just not when assigning a value to a local variable, giving it a name.

Whatever your opinion is: Do make sure to share it to the community and answer this survey:

Looking forward to Java 10.

SQL JOIN or EXISTS? Chances Are, You’re Doing it Wrong

I’ve noticed this very consistently with a lot of customers, and also with participants of our Data Geekery SQL Workshop (which I highly recommend to everyone, if you excuse the advertising): A lot of developers get the distinction between JOIN and SEMI-JOIN wrong. Let me explain…

What are JOIN and SEMI-JOIN

A little bit of relational algebra first. What is an (INNER) JOIN? An JOIN is nothing but a filtered cartesian product. And what is a cartesian product? Wikipedia explains this very nicely:

for sets A and B, the Cartesian product A × B is the set of all ordered pairs (a, b) where a ∈ A and b ∈ B.

That was the technical way of putting it. A more understandable way might be the following:

ranks = {A, K, Q, J, 10, 9, 8, 7, 6, 5, 4, 3, 2}
suits = {♠, ♥, ♦, ♣}

so, ranks × suits =
{(A, ♠), (A, ♥), (A, ♦), (A, ♣), (K, ♠),
(3, ♣), (2, ♠), (2, ♥), (2, ♦), (2, ♣)}

Or, as an image:

By Trainler - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=7104281
By Trainler – Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=7104281

The above cartesian product models the combination of each rank with each suite. Simple, right?

In SQL, a cartesian product can be written as either a CROSS JOIN, or a table list in the FROM clause. The following query combines every customer with every staff member:

SELECT c.last_name, s.last_name
FROM customer AS c

-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
     staff AS s

Now, as I mentioned before, an (INNER) JOIN is nothing but a filtered CROSS JOIN, where the filter is applied in a dedicated USING or ON clause.

SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s 
  ON c.last_name = s.last_name

The above query will match only those customers with those users whose last_name are the same. As I’ve told you before, an (INNER) JOIN is just a filtered CROSS JOIN, so the below queries will be semantically equivalent to the above:

SELECT c.last_name, s.last_name
FROM customer AS c
WHERE c.last_name = s.last_name

-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
     staff AS s
WHERE c.last_name = s.last_name

Specifically the last version is still used in many SQL codebases, which have not yet migrated to the ANSI JOIN syntax (even if ANSI joins should be preferred for readability reasons).

But that might be wrong

Unfortunately, I’m seeing this mistake all the time, as I’ve mentioned before. JOIN might appear like a useful tool to match rows between tables. But remember one thing, and I’m starting to repeat myself:

(INNER) JOIN is just a filtered CROSS JOIN

This means that if you choose INNER JOIN to find those customers for which there are matching staff, you will create a cartesian product between customer and staff, and then apply a filter. Why is that a problem? Let’s assume the following:

| first_name | last_name |
| John       | Doe       |
| Alice      | Miller    |
| Max        | Doe       |

| first_name | last_name |
| John       | Doe       |
| Alice      | Peterson  |
| Jane       | Doe       |

What happens when you run the above queries that use (INNER) JOIN to match customers with staff? Exactly. You’ll form a cartesian product first:

{ (John Doe, John Doe),
  (John Doe, Alice Peterson),
  (John Doe, Jane Doe),
  (Alice Miller, John Doe),
  (Alice Miller, Alice Peterson),
  (Alice Miller, Jane Doe),
  (Max Doe, John Doe),
  (Max Doe, Alice Peterson),
  (Max Doe, Jane Doe) }

… and then filter out the tuples that shouldn’t be in the result, i.e. the ones that don’t have matching last names (of course, the database might choose to optimise this and not materialise the entire cross product):

{ (John Doe, John Doe),
  (John Doe, Jane Doe),
  (Max Doe, John Doe),
  (Max Doe, Jane Doe) }

We’re now left with 4 tuples. That’s great, if that’s what you were after in the first place. A combination of all customers with all staff, for which the combination shares the same last name. But maybe you were asking yourself something else, namely:

Do we have any customers who are staff family members?

Use-case: Exclude such customers from a raffle (let’s assume that last names are a sufficient criteria here).

In that case, we’ll get “duplicate” records. Because the query that some of you might’ve written would have been:

FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)


| first_name | last_name |
| John       | Doe       |
| John       | Doe       |
| Max        | Doe       |
| Max        | Doe       |

Bummer. How to remove duplicates? With DISTINCT you might think:

FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)


| first_name | last_name |
| John       | Doe       |
| Max        | Doe       |

What’s wrong with DISTINCT?

Using DISTINCT in this situation is a big mistake. Why?

  • Your accidental cartesian product loads too many records from disk, and produces too many records in memory, which have to be removed again
  • DISTINCT can be expensive in some databases, that implement it via sorting, rather than via hashing
  • DISTINCT may change the semantics of your SELECT clause, with nasty side-effects
  • In order to prevent those side-effects, you might even resort to wrapping this DISTINCT query in a subselect, making performance even worse

That’s horrible. See also this list of common SQL mistakes:

How to do it right?

By using a SEMI-JOIN. It is called semi join (i.e. “half” join) in relational algebra, because we only care about one side of the JOIN operation in the results, not the other side. In this example, we only care about customers in the result. We don’t want to have any staff records. The relational algebra notation would be

Customer ⋉ Staff

Unfortunately, SQL doesn’t have SEMI JOIN keywords, so the following isn’t possible:

FROM customer AS c
  USING (last_name)

The SQL way to express a SEMI JOIN is by using EXISTS () or IN (). The following two are equivalent:

-- Using EXISTS
FROM customer AS c
  FROM staff AS s
  WHERE c.last_name = s.last_name

-- Using IN
FROM customer
WHERE last_name IN (
  SELECT last_name
  FROM staff

(Note, that NOT EXISTS and NOT IN are NOT equivalent)

Not only are these queries more correct, they are also much faster in most SQL databases for a simple reason. The database can stop searching for staff as soon as it has encountered at least one staff for which there is a matching customer. This is also nicely explained in Dan Martensen’s article SQL Performance of JOIN and WHERE EXISTS. And we’ve blogged about a related topic here: SQL Tip of the Day: Be Wary of SELECT COUNT(*).

Semi Join and Anti Join in jOOQ

We believe that these useful relational operators should be first class citizens in SQL as we have stated in our blog post:

Semi join


Anti join


The above is much easier to write, and will transform into the corresponding (NOT) EXISTS predicate.


There are some databases that may unfortunately show worse performance for some of these semi join / anti join operators. See, for instance this outdated article on MySQL performance:

Do measure first, before you believe any of these articles, though!

Another exception is when you have a primary key / foreign key relationship that guarantees that an (INNER) JOIN produces no duplicate values, i.e. when you’re joining a one-to-one or many-to-one relationship, then JOIN is a correct solution, but it is usually equally fast, so semi join will still be more readable.


If you need to check whether you have any matches between a table A and a table B, but you only really care about the results from table A, do make sure you’re using a SEMI-JOIN (i.e. an EXISTS or IN predicate), not an (INNER) JOIN.

Further reading:

UI Developers! Choose Sensible Default Ordering!

Good decisions come from experience. Experience comes from making bad decisions.

― Mark Twain

Today, let’s look at one piece of experience and how we can turn that into good decisions when implementing UI logic. Please, all UI developers read this.

The bad decision

When UI developers display tabular data, it is very common for the table to offer sorting on each column. This is extremely useful, as it helps the user extract basic insight from the data by just performing a single click. Let’s look at an example where the choice of sorting default seemed to be correct at first, but was wrong later on:

Bing Webmaster tools for the jOOQ blog. When I reach the page traffic website to see the Bing traffic for the last month, I can see this:


The default table ordering is applied to one of the obvious columns: “Appeared in search”, descendingly. Personally, I might have preferred it to be applied to “Clicks from search” per default, but what’s important: The column is ordered descendingly. I really only care about our top 5 best blog posts. Not about the worst.

So, this is good. Let’s see what happens if I do click on “Clicks from search”, however:


I get an overview of our worst-performing blog posts for that week. Yes, there are some posts that don’t attract any audience for an entire month from Bing. Bummer. (Let’s blame it on Bing’s popularity, not on our blog’s). But that’s not what I cared about. I wanted to see the inverse: The top performing blog posts. In order to see those, I have to click again on the column, to invert the sort order.

The experience

I see this behaviour all the time. UI developers mindlessly defaulting to the technical natural sort order on UI table widgets. In many cases, as a user, this is a frustrating experience, because:

  1. Heck, what did I do? Why is it displaying this data
  2. Aaah, it is sorting ascendingly
  3. Crap, I have to click again

The cognitive dissoncance between steps 1) and 2) shouldn’t be underestimated. Depending on the complexity of the task, or the data that is being displayed, a user might first be confused before they realise that the wanted behaviour is 2 clicks away, not 1. While it should be a technical detail that there are things like ascending and descending orderings, in UIs there is a third notion: That of natural ordering.

Why do developers get this wrong so often? Simple! Because there is also a technical natural ordering, and that’s almost always the ascending order. For instance, in Java, when you do this:

TreeSet<Integer> set = new TreeSet<>();

You’ll get the nicely sorted data as such:

[1, 3, 12]

The technical natural ordering depends on the “raw” data type only. In the case of Integer, this is simply the natural ascending integer number ordering.

The UI natural ordering, however, depends on the context of a data type. While a meaningless integer might still be sorted ascendingly, the previous count value (also an integer!) should be sorted descendingly by default.

The good decision

So, are there any rules with respect to the UI natural ordering? Intuition, yes! But also the following more concrete (and far from exhaustive) list of hints:

Data types and contexts in favour of ascending natural ordering

  • names: All sorts of names of things like people, cities, countries, etc. should be ordered ascendingly in their alphabetical (case-insensitive) order. That is how people skim phone books, that’s how they expect names to be ordered
  • phone numbers: (and other similar numbers) should be ordered like integers: ascendingly. But beware of their specific formatting. It is very likely that the special characters in US formats (like (555) 123-4567) shouldn’t matter when comparing numbers (e.g. with +1-800-1112222), or with numbers from other countries
  • row numbers: This is an obvious candidate for ascending ordering because the row number itself already yields an implicit order, by which it was calculated (see also our article on SQL ROW_NUMBER())
  • dates in the future: If you know your dates are in the future, then you should order them ascendingly, as users want to see the closest date first. Think of a calendar, for instance. Do you really want to display a date in the year 8375, just because you happen to celebrate your 6394th birthday? Probably not. But if in doubt, with dates, better sort them descendingly (see below), as you usually have most dates in the past and only few dates in the future.
  • aggregations: There are few aggregations that should be sorted ascendingly by default. One of them is SQL’s MIN() aggregation. If you’re really looking for the lowest value, the lowest value should be on top, followed by higher values. Other aggregations that are OK to default to ascending order are percentiles (e.g. the MEDIAN()), or standard deviations, or linear regression functions, because it is not clear whether the user cares about the highest or the lowest value. In this case, it is OK to simply default to the technical natural ordering. Most other aggregations, however, should be sorted descendingly (see below)
  • members of a period: A period is something like a year, month, week, day. Periods come with a finite, discrete number of members, such as day of year, week of year, month (for year), day of month (for month), weekday (for week), hour, minute, second (for day). The default ordering here is obvious: always ascendingly, in order of period traversal
  • money (price): No one wants to buy the most expensive flight! Obvious, right? But be careful. Prices are expressed in money, and money isn’t always best sorted ascendingly. If in doubt, order money owed to someone ascendingly, and money owned descendingly. What a difference a little "n" makes!

Data types and contexts in favour of descending natural ordering

  • dates: This is a tricky one, but there are few occasions where dates really should be sorted ascendingly, so default to sorting them descendingly, if most dates lie in he past. The reason for this rationale is the fact that users want to see the closest date first, e.g. the most recent date of a bank account transaction.
  • aggregations: When you run SQL COUNT(*) or SUM() or AVG() or MAX() aggregations, users will really care about the highest values only, as we saw in our Bing Webmaster tools example. Please do sort these aggregations descendingly by default!
  • changes: If the change between your current value and e.g. last week’s value is the thing of interest (e.g. stock market, or again Bing Webmaster tools), then both orderings are interesting. The biggest winners / losers are both useful pieces of information. However, let’s stay positive here and order stuff descendingly by default in order to display the biggest winners first. We don’t want to be negative by default. Whether the change in percent or the absolute change is of interest is another story and depends on the domain.
  • file sizes: Probably, the user is looking for the biggest files – e.g. to see what to delete to save most disk space. Order descendingly by default. If in doubt, think of sizes as the COUNT(*) value of any content. And that should clearly be ordered descendingly.
  • booleans: When ordering a column that contains true/false information (e.g. E-Mail does or does not have any attachments), then the true information is usually more interesting. Since true = 1 and false = 0, order these columns descendingly by default.
  • money (balance): Unlike prices (money owed to someone), balances (money owned) should be ordered descendingly. We want to know how many billions we have. No one cares about their worst assets.

Data types without ordering

There are some data types that simply shouldn’t be ordered. Don’t offer ordering by default on them, it might confuse the user and it might kill your server! These include:

  • URLs: In the case of Bing Webmaster tools, there is really no point in ordering URLs. I mean, the natural order would be http vs. https first, then the domain (but not from top level domain down to the irrelevant sharding identifier), then possibly the port (completely useless piece of information for the user), then the path (probably ordered by date in blogs, but pretty random otherwise). Ordering by URLs doesn’t add value, so don’t offer it. Caveat: If you display only a URL part (e.g. the domain name), ordering might make sense.
  • text: Now, plain text (e.g. E-Mail content) is really the very last thing you want to order. Most SQL databases don’t even allow for ordering CLOB content. This should be obvious, just don’t do it.
  • composite data: If data points are structured (like age and sex in one data point, in case the combination matters for your domain), they’re very hard to order correctly. Specifically, sex doesn’t have any non-technical order. If in doubt, better don’t offer ordering, or decompose the data point.

Data types where sorting challenges mean that tables are the wrong tool

Some data types are tricky to sort by default. Mostly, this is because we’re dealing with discrete or continuous values that go in both directions of a “zero” value. E.g. numbers, percentages, dates (where zero=today):

  • dates: As we’ve seen above, dates are a bit of a tricky data type to sort in tables, as the user experience depends on whether dates are mostly in the past (like bank account transactions) or mostly in the future (like appointments), or both (like calendar entries). A much better UI widget to display timelines that expand both into the past and the future are .. well .. timelines, which cannot be sorted by the user. They’re always ordered by date, displaying today’s date by default
  • percentages: If percentages are the most interesting data point in a data set (e.g. stock option changes), then chances are, that the value 0.00% is the center of your data, e.g. in a winners/losers display widget. While they’re the center of your data, they’re not the center of interest. The most interesting values will still be the top winners and the top losers. This is hard to display with sorting only. Filtering (or pagination) will need to apply in order to remove the stocks that are in the middle
  • (approximate) search results: You don’t see any means of ordering Google search results, right? That’s because Google searches are approximate, i.e. their results are already ordered in terms of relevance. You usually don’t want to offer your users to re-order these results (at least not on the relevance scale). One exception might be ordering of exact search results by date (or something else), but this is really hard to get right from a UX perspective, as you risk displaying lots of irrelevant results based on their freshness.

Situations where the above is not true

Now, the above are useful advice for making the right decision in the case of simple and homogeneous tables, like the one exposed in Bing’s Webmaster Tools (all columns are either unsortable (URL) or aggregations). If you display arbitrary data, then it might not be wise to apply these rules as it will confuse the user if one column defaults to descending ordering, and another defaults to ascending order. In that case, revert to sorting all columns ascendingly. The user will understand.


If you’re a UI developer, make natural ordering flags first class citizens of your software design. Pretty much every data type ships with an intuitive, and obvious value for default ordering, i.e. one of:

  • Ascending
  • Descending
  • No ordering

Every time you design a table, please do think of the above. It’s that little extra effort that will make your user interface much more meaningful. And, beware. This is really what you as a UI developer need to do. The backend developers operating on the database cannot specify this, because:

  • Databases contain raw, context-free data (e.g. of type NUMBER or VARCHAR)
  • UI ordering is not necessarily the same as SQL ordering

How to Support Java 6, 8, 9 in a Single API

With jOOQ 3.7, we have finally added formal support for Java 8 features. This opened the door to a lot of nice improvements, such as:

Creating result streams

try (Stream<Record2<String, String>> stream =
        .select(FIRST_NAME, LAST_NAME)
        .stream()) {

    List<String> people =
    stream.map(p -> p.value1() + " " + p.value2())

Calling statements asynchronously (jOOQ 3.8+)

CompletionStage<Record> result =

result.thenComposing(r -> ...);

But obviously, we didn’t want to disappoint our paying customers who are stuck with Java 6 because of their using an older application server, etc.

How to support several Java versions in a single API

This is why we continue publishing a Java 6 version of jOOQ for our commercial customers. How did we do it? Very easily. Our commercial code base (which is our main code base) contains tons of “flags” as in the following example:

public interface Query 
    /* [java-8] */, AutoCloseable /* [/java-8] */ 

    int execute() throws DataAccessException;

    /* [java-8] */
    CompletionStage<Integer> executeAsync();
    CompletionStage<Integer> executeAsync(Executor executor);
    /* [/java-8] */


(Sure, AutoCloseable was available already in Java 7, but we don’t have a Java 7 version).

When we build jOOQ, we build it several times after using a preprocessor to strip logic from the source files:

  • The commercial Java 8 version is built first as is
  • The commercial Java 6 version is built second by stripping all the code between [java-8] and [/java-8] markers
  • The commercial free trial version is built by adding some code to the commercial version
  • The open source version is built third by stripping all the code between [pro] and [/pro] markers

Advantages of this approach

There are several advantages of this approach compared to others:

  • We only have a single source of truth, the original commercial source code.
  • The line numbers are the same in all different versions
  • The APIs are compatible to a certain extent
  • No magic is involved via class loading or reflection

The disadvantages are:

  • Committing to repositories is a bit slower as we have several repositories.
  • Publishing releases takes longer as the different versions need to be built and integration tested several times
  • Sometimes, we simply forget adding a marker and have to re-build again when the Java-6 nightly build crashes
  • We still cannot use lambda expressions in ordinary code that is contained in the Java 6 version (most code)

In our opinion, the advantages outweigh clearly. It’s OK if we can’t implement top-notch Java features as long as our customers can, and as long as those customers who are stuck with old versions can still upgrade to the latest jOOQ version.

We’re looking forward to supporting JDK 9 features, like modularity and the new Flow API without any compromise to existing users.

What about you?

How do you approach cross JDK version compatibility?