How to Translate SQL GROUP BY and Aggregations to Java 8

I couldn’t resist. I have read this question by Hugo Prudente on Stack Overflow. And I knew there had to be a better way than what the JDK has to offer.

The question reads:

I’m looking for a lambda to refine the data already retrieved. I have a raw resultset, if the user do not change the date I want use java’s lambda to group by the results for then. And I’m new to lambdas with java.

The lambda I’m looking for works simliar to this query.

SELECT
    z, w, 
    MIN(x), MAX(x), AVG(x), 
    MIN(y), MAX(y), AVG(y) 
FROM table 
GROUP BY z, w;

SQL is declarative. Functional programming is not.

Before we go on with this discussion, let’s establish a very important fact. SQL is a completely declarative language. Functional (or “functional-ish”, to keep the Haskell-aficionados at peace) programming languages like Java 8 are not declarative. While expressing data transformation algorithms using functions is much more concise than expressing them using objects, or worse, using imperative instructions, you’re still explicitly expressing the algorithm.

When you write SQL, you don’t write any algorithm. You just describe the result you want to have. The SQL engine’s optimiser will figure out the algorithm for you – e.g. based on the fact that you may have an index on Z but not on W or on (Z, W).

While simple examples like these can easily be implemented using Java 8, you will quickly run into Java’s limitations, once you need to do more complex reporting.

Of course, as we’ve blogged before, the optimum is reached when you combine SQL and functional programming.

How can this be written in Java 8?

There are a variety of ways to do it. The essence is to understand all the participants in such a transformation. And no matter if you find this easy or hard, suitable for Java 8 or inadequate, thinking about the different, lesser-known parts of new Stream API is certainly worth the exercise.

The main participants here are:

  • Stream: If you’re using JDK 8 libraries, then the new java.util.stream.Stream type will be your first choice.
  • Collector: The JDK provides us with a rather low-level and thus very powerful new API for data aggregation (also known as “reduction”). This API is summarised by the new java.util.stream.Collector type, a new type from which we have heard only little so far in the blogosphere

Disclaimer

Some of the code displayed here might not work in your favourite IDE. Unfortunately, even if Java 7 reaches its end of life, all major IDEs (Eclipse, IntelliJ, NetBeans), and even the javac compiler still have quite a few bugs related to the combination of generic type inference and lambda expressions. Stay tuned until those bugs are fixed! And report any bug you discover. We’ll all thank you for it!

Let’s go!

Let’s review our SQL statement:

SELECT
    z, w, 
    MIN(x), MAX(x), AVG(x), 
    MIN(y), MAX(y), AVG(y) 
FROM table 
GROUP BY z, w;

In terms of the Stream API, the table itself is the Stream. Let’s just assume that we have a “table type” A as such:

class A {
    final int w;
    final int x;
    final int y;
    final int z;

    A(int w, int x, int y, int z) {
        this.w = w;
        this.x = x;
        this.y = y;
        this.z = z;
    }

    @Override
    public String toString() {
        return "A{" +
                "w=" + w +
                ", x=" + x +
                ", y=" + y +
                ", z=" + z +
                '}';
    }
}

You can also add equals() and hashCode() if you must.

We can now easily compose the Stream using Stream.of(), and some sample data:

Stream<A> stream =
Stream.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5));

Now, the next step is to GROUP BY z, w. The Stream API itself, unfortunately, doesn’t contain such a convenience method. We have to resort to more low-level operations by specifying the more general Stream.collect() operation, and passing a Collector to it that does the grouping. Luckily, a variety of different grouping Collectors are already made available from the Collectors helper class.

So we add that to our stream

Stream.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))
.collect(Collectors.groupingBy(...));

jool-logo-blackNow the interesting part starts. How do we specify that we want to group by both A.z and A.w? We need to provide this groupingBy method with a function that can extract something like a SQL tuple from the A type. We could write our own tuple, or we simply use that of jOOλ, a library that we have created and open-sourced to improve our jOOQ integration tests.

The Tuple2 type roughly looks like this:

public class Tuple2<T1, T2> {

    public final T1 v1;
    public final T2 v2;

    public T1 v1() {
        return v1;
    }

    public T2 v2() {
        return v2;
    }

    public Tuple2(T1 v1, T2 v2) {
        this.v1 = v1;
        this.v2 = v2;
    }
}

public interface Tuple {
    static <T1, T2> Tuple2<T1, T2> tuple(T1 v1, T2 v2) {
        return new Tuple2<>(v1, v2);
    }
}

It has many more useful features, but these ones will be sufficient for this article.

On a side-note

Why the JDK doesn’t ship with built-in tuples like C#’s or Scala’s escapes me.

Functional programming without tuples is like coffee without sugar: A bitter punch in your face.

Anyway… back on track

So we’re grouping by the (A.z, A.w) tuple, as we would in SQL

Map<Tuple2<Integer, Integer>, List<A>> map =
Stream.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))
.collect(Collectors.groupingBy(
    a -> tuple(a.z, a.w)
));

As you can see, this produces a verbose but very descriptive type, a map containing our grouping tuple as its key, and a list of collected table records as its value.

Running the following statement

map.entrySet().forEach(System.out::println);

will yield:

(1, 1)=[A{w=1, x=1, y=1, z=1}, A{w=1, x=2, y=3, z=1}]
(4, 9)=[A{w=9, x=8, y=6, z=4}, A{w=9, x=9, y=7, z=4}]
(5, 2)=[A{w=2, x=3, y=4, z=5}, A{w=2, x=4, y=4, z=5}, A{w=2, x=5, y=5, z=5}]

That’s already quite awesome! In fact, this behaves like the SQL:2011 standard COLLECT() aggregate function, that is also available in Oracle 10g+

Now, instead of actually collecting the A records, we prefer to aggregate the individual values of x and y. The JDK provides us with a couple of interesting new types, e.g. the java.util.IntSummaryStatistics, which is available for convenience again from the Collectors type via Collectors.summarizingInt().

On a side note

For my taste, this sledge-hammer data aggregation technique is a bit quirky. The JDK libraries have been left intentionally low level and verbose, perhaps to keep the library footprint small, or to prevent “horrible” consequences when in 5-10 years (after the release of JDK 9 and 10), it becomes obvious that some features may have been added prematurely.

At the same time, there is this all-or-nothing IntSummaryStatistics, that blindly aggregates these popular aggregation values for your collection:

  • COUNT(*)
  • SUM()
  • MIN()
  • MAX()

and obviously, once you have SUM() and COUNT(*), you also have AVG() = SUM() / COUNT(*). So that’s going to be the Java way. IntSummaryStatistics.

In case you were wondering, the SQL:2011 standard specifies these aggregate functions:

AVG, MAX, MIN, SUM, EVERY, ANY, SOME, COUNT, STDDEV_POP, STDDEV_SAMP, VAR_SAMP, VAR_POP, COLLECT, FUSION, INTERSECTION, COVAR_POP, COVAR_SAMP, CORR, REGR_SLOPE, REGR_INTERCEPT, REGR_COUNT, REGR_R2, REGR_AVGX, REGR_AVGY, REGR_SXX, REGR_SYY, REGR_SXY, PERCENTILE_CONT, PERCENTILE_DISC, ARRAY_AGG

And obviously there are many other, vendor-specific aggregate and window functions in SQL. We’ve blogged about them all:

True, MIN, MAX, SUM, COUNT, AVG are certainly the most popular ones. But it would’ve been nicer if they hadn’t been included in these default aggregation types, but made available in a much more composable way.

Anyway… back on track

If you want to stay low-level and use mostly JDK API, you can use the following technique to implement aggregation over two columns:

Map<
    Tuple2<Integer, Integer>, 
    Tuple2<IntSummaryStatistics, IntSummaryStatistics>
> map = Stream.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))
.collect(Collectors.groupingBy(
    a -> tuple(a.z, a.w),
    Collector.of(

        // When collecting, we'll aggregate data
        // into two IntSummaryStatistics for x and y
        () -> tuple(new IntSummaryStatistics(), 
                    new IntSummaryStatistics()),

        // The accumulator will simply take
        // new t = (x, y) values
        (r, t) -> {
            r.v1.accept(t.x);
            r.v2.accept(t.y);
        },

        // The combiner will merge two partial
        // aggregations, in case this is executed
        // in parallel
        (r1, r2) -> {
            r1.v1.combine(r2.v1);
            r1.v2.combine(r2.v2);

            return r1;
        }
    )
));

map.entrySet().forEach(System.out::println);

The above would now print

(1, 1)=(IntSummaryStatistics{count=2, sum=3, min=1, average=1.500000, max=2}, 
        IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3})
(4, 9)=(IntSummaryStatistics{count=2, sum=17, min=8, average=8.500000, max=9}, 
        IntSummaryStatistics{count=2, sum=13, min=6, average=6.500000, max=7})
(5, 2)=(IntSummaryStatistics{count=3, sum=12, min=3, average=4.000000, max=5}, 
        IntSummaryStatistics{count=3, sum=13, min=4, average=4.333333, max=5})

But obviously, no one will want to write that much code. The same thing can be achieved with jOOλ with much less code

Map<
    Tuple2<Integer, Integer>, 
    Tuple2<IntSummaryStatistics, IntSummaryStatistics>
> map =

// Seq is like a Stream, but sequential only,
// and with more features
Seq.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))

// Seq.groupBy() is just short for 
// Stream.collect(Collectors.groupingBy(...))
.groupBy(
    a -> tuple(a.z, a.w),

    // ... because once you have tuples, 
    // why not add tuple-collectors?
    Tuple.collectors(
        Collectors.summarizingInt(a -> a.x),
        Collectors.summarizingInt(a -> a.y)
    )
);

What you see above is probably as close as it gets to the original, very simmple SQL statement:

SELECT
    z, w, 
    MIN(x), MAX(x), AVG(x), 
    MIN(y), MAX(y), AVG(y) 
FROM table 
GROUP BY z, w;

The interesting part here is the fact that we have what we call “tuple-collectors”, a Collector that collects data into tuples of aggregated results for any degree of the tuple (up to 8). Here’s the code for Tuple.collectors:

// All of these generics... sheesh!
static <T, A1, A2, D1, D2> 
       Collector<T, Tuple2<A1, A2>, Tuple2<D1, D2>> 
collectors(
    Collector<T, A1, D1> collector1
  , Collector<T, A2, D2> collector2
) {
    return Collector.of(
        () -> tuple(
            collector1.supplier().get()
          , collector2.supplier().get()
        ),
        (a, t) -> {
            collector1.accumulator().accept(a.v1, t);
            collector2.accumulator().accept(a.v2, t);
        },
        (a1, a2) -> tuple(
            collector1.combiner().apply(a1.v1, a2.v1)
          , collector2.combiner().apply(a1.v2, a2.v2)
        ),
        a -> tuple(
            collector1.finisher().apply(a.v1)
          , collector2.finisher().apply(a.v2)
        )
    );
}

Where the Tuple2<D1, D2> is the aggregation result type that we derive from collector1 (which provides D1) and from collector2 (which provides D2).

That’s it. We’re done!

Conclusion

Java 8 is a first step towards functional programming in Java. Using Streams and lambda expressions, we can already achieve quite a bit. The JDK APIs, however, are extremely low level and the experience when using IDEs like Eclipse, IntelliJ, or NetBeans can still be a bit frustrating. While writing this article (and adding the Tuple.collectors() method), I have reported around 10 bugs to the different IDEs. Some javac compiler bugs are not yet fixed, prior to JDK 1.8.0_40 ea. In other words:

I just keep throwing generic type parameters at the darn thing until the compiler stops bitching at me

But we’re on a good path. I trust that more useful API will ship with JDK 9 and especially with JDK 10, when all of the above will hopefully profit from the new value types and generic type specialization.

jool-logo-blackAnd, of course, if you haven’t already, download and contribute to jOOλ here!

We have created jOOλ to add the missing pieces to the JDK libraries. If you want to go all in on functional programming, i.e. when your vocabulary includes hipster terms (couldn’t resist) like monads, monoids, functors, and all that, we suggest you skip the JDK’s Streams and jOOλ entirely, and go download functionaljava by Mark Perry or vavr by Daniel Dietrich

Three-State Booleans in Java

Every now and then, I miss SQL’s three-valued BOOLEAN semantics in Java. In SQL, we have:

  • TRUE
  • FALSE
  • UNKNOWN (also known as NULL)

Every now and then, I find myself in a situation where I wish I could also express this UNKNOWN or UNINITIALISED semantics in Java, when plain true and false aren’t enough.

Implementing a ResultSetIterator

For instance, when implementing a ResultSetIterator for jOOλ, a simple library modelling SQL streams for Java 8:

SQL.stream(stmt, Unchecked.function(r ->
    new SQLGoodies.Schema(
        r.getString("FIELD_1"),
        r.getBoolean("FIELD_2")
    )
))
.forEach(System.out::println);

In order to implement a Java 8 Stream, we need to construct an Iterator, which we can then pass to the new Spliterators.spliteratorUnknownSize() method:

StreamSupport.stream(
  Spliterators.spliteratorUnknownSize(iterator, 0), 
  false
);

Another example for this can be seen here on Stack Overflow.

When implementing the Iterator interface, we must implement hasNext() and next(). Note that with Java 8, remove() now has a default implementation, so we don’t need to implement it any longer.

While most of the time, a call to next() is preceded by a call to hasNext() exactly once, nothing in the Iterator contract requires this. It is perfectly fine to write:

if (it.hasNext()) {
    // Some stuff

    // Double-check again to be sure
    if (it.hasNext() && it.hasNext()) {

        // Yes, we're paranoid
        if (it.hasNext())
            it.next();
    }
}

How to translate the Iterator calls to backing calls on the JDBC ResultSet? We need to call ResultSet.next().

We could make the following translation:

  • Iterator.hasNext() == !ResultSet.isLast()
  • Iterator.next() == ResultSet.next()

But that translation is:

  • Expensive
  • Not dealing correctly with empty ResultSets
  • Not implemented in all JDBC drivers (Support for the isLast method is optional for ResultSets with a result set type of TYPE_FORWARD_ONLY)

So, we’ll have to maintain a flag, internally, that tells us:

  • If we had already called ResultSet.next()
  • What the result of that call was

Instead of creating a second variable, why not just use a three-valued java.lang.Boolean. Here’s a possible implementation from jOOλ:

class ResultSetIterator<T> implements Iterator<T> {

    final Supplier<? extends ResultSet>  supplier;
    final Function<ResultSet, T>         rowFunction;
    final Consumer<? super SQLException> translator;

    /**
     * Whether the underlying {@link ResultSet} has
     * a next row. This boolean has three states:
     * <ul>
     * <li>null:  it's not known whether there 
     *            is a next row</li>
     * <li>true:  there is a next row, and it
     *            has been pre-fetched</li>
     * <li>false: there aren't any next rows</li>
     * </ul>
     */
    Boolean hasNext;
    ResultSet rs;

    ResultSetIterator(
        Supplier<? extends ResultSet> supplier, 
        Function<ResultSet, T> rowFunction, 
        Consumer<? super SQLException> translator
    ) {
        this.supplier = supplier;
        this.rowFunction = rowFunction;
        this.translator = translator;
    }

    private ResultSet rs() {
        return (rs == null) 
             ? (rs = supplier.get()) 
             :  rs;
    }

    @Override
    public boolean hasNext() {
        try {
            if (hasNext == null) {
                hasNext = rs().next();
            }

            return hasNext;
        }
        catch (SQLException e) {
            translator.accept(e);
            throw new IllegalStateException(e);
        }
    }

    @Override
    public T next() {
        try {
            if (hasNext == null) {
                rs().next();
            }

            return rowFunction.apply(rs());
        }
        catch (SQLException e) {
            translator.accept(e);
            throw new IllegalStateException(e);
        }
        finally {
            hasNext = null;
        }
    }
}

As you can see, the hasNext() method locally caches the hasNext three-valued boolean state only if it was null before. This means that calling hasNext() several times will have no effect until you call next(), which resets the hasNext cached state.

Both hasNext() and next() advance the ResultSet cursor if needed.

Readability?

Some of you may argue that this doesn’t help readability. They’d introduce a new variable like:

boolean hasNext;
boolean hasHasNextBeenCalled;

The trouble with this is the fact that you’re still implementing three-valued boolean state, but distributed to two variables, which are very hard to name in a way that is truly more readable than the actual java.lang.Boolean solution. Besides, there are actually four state values for two boolean variables, so there is a slight increase in the risk of bugs.

Every rule has its exception. Using null for the above semantics is a very good exception to the null-is-bad histeria that has been going on ever since the introduction of Option / Optional

In other words: Which approach is best? There’s no TRUE or FALSE answer, only UNKNOWN ;-)

Be careful with this

However, as we’ve discussed in a previous blog post, you should avoid returning null from API methods if possible. In this case, using null explicitly as a means to model state is fine because this model is encapsulated in our ResultSetIterator. But try to avoid leaking such state to the outside of your API.

Java 8 Friday Goodies: SQL ResultSet Streams

At Data Geekery, we love Java. And as we’re really into jOOQ’s fluent API and query DSL, we’re absolutely thrilled about what Java 8 will bring to our ecosystem. We have blogged a couple of times about some nice Java 8 goodies, and now we feel it’s time to start a new blog series, the…

Java 8 Friday

Every Friday, we’re showing you a couple of nice new tutorial-style Java 8 features, which take advantage of lambda expressions, extension methods, and other great stuff. You’ll find the source code on GitHub.

Java 8 Goodie: SQL ResultSet Streams

Yes, the SQL subject must be dealt with again. Even if last week, we promised an article on concurrency, there is one very important aspect of Java 8 lambdas and interoperability with “legacy” APIs that we need to talk about, first.

Checked Exceptions

Yes. Unfortunately, those beasts from the past still haunt us, more than ever when we’re using Java 8’s lambda expressions. Already before Java 8’s release, there are a couple of Stack Overflow questions related to the subject.

Let’s remember how the IOExceptions caused issues when traversing the file system. Unless you write your own utility, you’ll have to resort to this beauty:

Arrays.stream(dir.listFiles()).forEach(file -> {
    try {
        System.out.println(file.getCanonicalPath());
    }
    catch (IOException e) {
        throw new RuntimeException(e);
    }

    // Ouch, my fingers hurt! All this typing!
});

We think it is safe to say:

Java 8 and checked exceptions don’t match.tweet this

A workaround is to write your own CheckedConsumer that wraps the checked exception. Such a consumer will be highly reusable, but… Did you think of all the other FunctionalInterfaces? There are quite a few of them in the java.util.function package:

Some of the many types in java.util.function
Some of the many types in java.util.function

jOOλ – Fixing lambda in Java 8

jool-logo-blackWhile writing this Java 8 blog series, we’ve constantly run into the need to wrap checked exceptions inside lambda expressions. And what do we geeks do when we frequently run into a problem? We fix it! And we have created jOOλ (also jOOL, jOO-Lambda), ASL 2.0 licensed, where we have duplicated pretty much every FunctionalInterface that is available from the JDK to support checked exceptions. Here’s how you would use jOOλ in the above example:

Arrays.stream(dir.listFiles()).forEach(
    Unchecked.consumer(file -> {
        // Throw all sorts of checked exceptions
        // here, we don't care...
        System.out.println(file.getCanonicalPath());
    })
);

The above example shows how you can simply ignore and pass checked exceptions as RuntimeExceptions. If you actually want to handle them, you can pass an exception handler lambda:

Arrays.stream(dir.listFiles())
      .forEach(Unchecked.consumer(

    file -> {
        System.out.println(file.getCanonicalPath());
    },
    e -> {
        log.info("Log stuff here", e);
        throw new MyRuntimeException(e);
    }
);

The second example now seems equally verbose, but don’t worry. You will probably reuse that exception handler and fall back to this:

Arrays.stream(dir.listFiles())
      .forEach(Unchecked.consumer(
    file -> {
        System.out.println(file.getCanonicalPath());
    },
    myExceptionHandler
);

jOOλ – Providing JDBC ResultSet Streams

Unfortunately, most efforts in the Java 8 Streams API were made in the area of correctly implementing parallelisable streams. While this is very useful for those of us actually doing parallel computing, for most others better integration with legacy APIs would have been better. One API that seriously deserves some lifting is JDBC, and we’ve blogged about this before. With jOOλ, you can now generate Streams directly from ResultSets or even from PreparedStatements. Here’s how you prepare:

Class.forName("org.h2.Driver");
try (Connection c = getConnection()) {
    String sql = "select schema_name, is_default " +
                 "from information_schema.schemata " +
                 "order by schema_name";

    try (PreparedStatement stmt = c.prepareStatement(sql)) {
        // code here
    }
}

Now, all you have to do when using jOOλ is stream your PreparedStatements as such:

SQL.stream(stmt, Unchecked.function(rs ->
    new SQLGoodies.Schema(
        rs.getString("SCHEMA_NAME"),
        rs.getBoolean("IS_DEFAULT")
    )
))
.forEach(System.out::println);

Where SQLGoodies.Schema is just an ordinary POJO. Some of the stream() method’s signatures are these ones:

public static <T> Stream<T> stream(
    PreparedStatement stmt,
    Function<ResultSet, T> rowFunction
);

public static <T> Stream<T> stream(
    PreparedStatement stmt,
    Function<ResultSet, T> rowFunction,
    Consumer<? super SQLException> exceptionHandler
);

Others are available as well.

That is awesome, isn’t it?

JDBC ResultSets should be Java 8 Streams.tweet this

Too bad, the above code didn’t make it into the JDK 8, as this would have been a chance to finally greatly improve on the JDBC API. Another, similar attempt at improving things has been done here by Julian Exenberger.

Java 8 alternatives of writing SQL

We’ve also published a couple of alternatives to jOOλ, using Java 8 with SQL here:

http://www.jooq.org/java-8-and-sql

Conclusion

While Java 8’s lambda expressions are awesome, the new Streams API is pretty incomplete. When implementing the above, we had to implement our own ResultSetIterator, and write all this mess to wrap the iterator in a Stream:

StreamSupport.stream(
    Spliterators.spliteratorUnknownSize(
        new ResultSetIterator<>(
            supplier, 
            rowFunction, 
            exceptionTranslator
        ), 0
    ), false
);

And it shouldn’t be necessary to write an Iterator in the first place, if only we were able to generate finite streams:

// Unfortunately, this method doesn't exist
Stream.generate(
    // Supplier, generating new POJOs
    () -> { 
        rs.next(); 
        return new SQLGoodies.Schema(
            rs.getString("SCHEMA_NAME"),
            rs.getBoolean("IS_DEFAULT")
        );
    },

    // Predicate, terminating the Stream
    () -> { !rs.isLast(); }
);

While jOOλ is an acceptable intermediate solution, and the Guava guys are probably already working out how to fix their library, it is really too bad, that Java 8 is lacking such utility functionality.

But we’re complaining on a high level. Next week, as promised, we’ll see a couple of examples related to concurrency, so stay tuned!

More on Java 8

In the mean time, have a look at Eugen Paraschiv’s awesome Java 8 resources page