Common SQL Clauses and Their Equivalents in Java 8 Streams


Functional programming allows for quasi-declarative programming in a general purpose language. By using powerful fluent APIs like Java 8’s Stream API, or jOOλ’s sequential Stream extension Seq or more sophisticated libraries like javaslang or functionaljava, we can express data transformation algorithms in an extremely concise way. Compare Mario Fusco’s imperative and functional version of the same algorithm:

Using such APIs, functional programming certainly feels like true declarative programming.

The most popular true declarative programming language is SQL. When you join two tables, you don’t tell the RDBMS how to implement that join. It may decide at its discretion whether a nested loop, merge join, hash join, or some other algorithm is the most suitable in the context of the complete query and of all the available meta information. This is extremely powerful because the performance assumptions that are valid for a simple join may no longer be valid for a complex one, where a different algorithm would outperform the original one. By this abstraction, you can just easily modify a query in 30 seconds, without worrying about low-level details like algorithms or performance.

When an API allows you to combine both (e.g. jOOQ and Streams), you will get the best of both worlds – and those worlds aren’t too different.

In the following sections, we’ll compare common SQL constructs with their equivalent expressions written in Java 8 using Streams and jOOλ, in case the Stream API doesn’t offer enough functionality.

Tuples

For the sake of this article, we’re going to assume that SQL rows / records have an equivalent representation in Java. For this, we’ll be using jOOλ’s Tuple type, which is essentially:

public class Tuple2<T1, T2> {

    public final T1 v1;
    public final T2 v2;

    public Tuple2(T1 v1, T2 v2) {
        this.v1 = v1;
        this.v2 = v2;
    }
}

… plus a lot of useful gimmicks like Tuple being Comparable, etc.

Note that we’re assuming the following imports in this and all subsequent examples.

import static org.jooq.lambda.Seq.*;
import static org.jooq.lambda.tuple.Tuple.*;

import java.util.*;
import java.util.function.*;
import java.util.stream.*;

import org.jooq.lambda.*;

Much like SQL rows, a tuple is a “value-based” type, meaning that it doesn’t really have an identity. Two tuples (1, 'A') and (1, 'A') can be considered exactly equivalent. Removing identity from the game makes SQL and functional programming with immutable data structures extremely elegant.

FROM = of(), stream(), etc.

In SQL, the FROM clause logically (but not syntactically) precedes all the other clauses. It is used to produce a set of tuples from at least one table, possibly multiple joined tables. A single-table FROM clause can be trivially mapped to Stream.of(), for instance, or to any other method that simply produces a stream:

SQL

SELECT *
FROM (
  VALUES(1, 1),
        (2, 2)
) t(v1, v2)

yielding

+----+----+
| v1 | v2 |
+----+----+
|  1 |  1 |
|  2 |  2 |
+----+----+

Java

Stream.of(
  tuple(1, 1),
  tuple(2, 2)
).forEach(System.out::println);

yielding

(1, 1)
(2, 2)

CROSS JOIN = flatMap()

Selecting from multiple tables is already more interesting. The easiest way to combine two tables in SQL is by producing a cartesian product, either via a table list or using a CROSS JOIN. The following two are equivalent SQL statements:

SQL

-- Table list syntax
SELECT *
FROM (VALUES( 1 ), ( 2 )) t1(v1), 
     (VALUES('A'), ('B')) t2(v2)

-- CROSS JOIN syntax
SELECT *
FROM       (VALUES( 1 ), ( 2 )) t1(v1)
CROSS JOIN (VALUES('A'), ('B')) t2(v2)

yielding

+----+----+
| v1 | v2 |
+----+----+
|  1 |  A |
|  1 |  B |
|  2 |  A |
|  2 |  B |
+----+----+

In a cross join (or cartesian product), every value from t1 is combined with every value from t2 producing size(t1) * size(t2) rows in total.

Java

In functional programming using Java 8’s Stream, the Stream.flatMap() method corresponds to SQL CROSS JOIN as can be seen in the following example:

List<Integer> s1 = Stream.of(1, 2);
Supplier<Stream<String>> s2 = ()->Stream.of("A", "B");

s1.flatMap(v1 -> s2.get()
                   .map(v2 -> tuple(v1, v2)))
  .forEach(System.out::println);

yielding

(1, A)
(1, B)
(2, A)
(2, B)

Note how we have to wrap the second stream in a Supplier because streams can be consumed only once, but the above algorithm is really implementing a nested loop, combining all elements of stream s2 with each element from stream s1. An alternative would be not to use streams but lists (which we will do in subsequent examples, for simplicity):

List<Integer> s1 = Arrays.asList(1, 2);
List<String> s2 = Arrays.asList("A", "B");

s1.stream()
  .flatMap(v1 -> s2.stream()
                   .map(v2 -> tuple(v1, v2)))
  .forEach(System.out::println);

In fact, CROSS JOIN can be chained easily both in SQL and in Java:

SQL

-- Table list syntax
SELECT *
FROM (VALUES( 1 ), ( 2 )) t1(v1), 
     (VALUES('A'), ('B')) t2(v2), 
     (VALUES('X'), ('Y')) t3(v3)

-- CROSS JOIN syntax
SELECT *
FROM       (VALUES( 1 ), ( 2 )) t1(v1)
CROSS JOIN (VALUES('A'), ('B')) t2(v2)
CROSS JOIN (VALUES('X'), ('Y')) t3(v3)

yielding

+----+----+----+
| v1 | v2 | v3 |
+----+----+----+
|  1 |  A |  X |
|  1 |  A |  Y |
|  1 |  B |  X |
|  1 |  B |  Y |
|  2 |  A |  X |
|  2 |  A |  Y |
|  2 |  B |  X |
|  2 |  B |  Y |
+----+----+----+

Java

List<Integer> s1 = Arrays.asList(1, 2);
List<String> s2 = Arrays.asList("A", "B");
List<String> s3 = Arrays.asList("X", "Y");

s1.stream()
  .flatMap(v1 -> s2.stream()
                   .map(v2 -> tuple(v1, v2)))
  .flatMap(v12-> s3.stream()
                   .map(v3 -> tuple(v12.v1, v12.v2, v3)))
  .forEach(System.out::println);

yielding

(1, A, X)
(1, A, Y)
(1, B, X)
(1, B, Y)
(2, A, X)
(2, A, Y)
(2, B, X)
(2, B, Y)

Note how we explicitly unnested the tuples from the first CROSS JOIN operation to form “flat” tuples in the second operation. This is optional, of course.

Java with jOOλ’s crossJoin()

Us jOOQ developers, we’re a very SQL-oriented people, so it is only natural to have added a crossJoin() convenience method for the above use-case. So our triple-cross join can be written like this:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<String> s2 = Seq.of("A", "B");
Seq<String> s3 = Seq.of("X", "Y");

s1.crossJoin(s2)
  .crossJoin(s3)
  .forEach(System.out::println);

yielding

((1, A), X)
((1, A), Y)
((1, B), X)
((1, B), Y)
((2, A), X)
((2, A), Y)
((2, B), X)
((2, B), Y)

In this case, we didn’t unnest the tuple produced in the first cross join. From a merely relational perspective, this doesn’t matter either. Nested tuples are the same thing as flat tuples. In SQL, we just don’t see the nesting. Of course, we could still unnest as well by adding a single additional mapping:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<String> s2 = Seq.of("A", "B");
Seq<String> s3 = Seq.of("X", "Y");

s1.crossJoin(s2)
  .crossJoin(s3)
  .map(t -> tuple(t.v1.v1, t.v1.v2, t.v2))
  .forEach(System.out::println);

yielding, again

(1, A, X)
(1, A, Y)
(1, B, X)
(1, B, Y)
(2, A, X)
(2, A, Y)
(2, B, X)
(2, B, Y)

(You may have noticed that map() corresponds to SELECT as we’ll see again later on)

INNER JOIN = flatMap() with filter()

The SQL INNER JOIN is essentially just syntactic sugar for a SQL CROSS JOIN with a predicate that reduces the tuple set after cross-joining. In SQL, the following two ways of inner joining are equivalent:

SQL

-- Table list syntax
SELECT *
FROM (VALUES(1), (2)) t1(v1), 
     (VALUES(1), (3)) t2(v2)
WHERE t1.v1 = t2.v2

-- INNER JOIN syntax
SELECT *
FROM       (VALUES(1), (2)) t1(v1)
INNER JOIN (VALUES(1), (3)) t2(v2)
ON t1.v1 = t2.v2

yielding

+----+----+
| v1 | v2 |
+----+----+
|  1 |  1 |
+----+----+

(note that the keyword INNER is optional).

So, the values 2 from t1 and the values 3 from t2 are “thrown away”, as they produce any rows for which the join predicate yields true.

The same can be expressed easily, yet more verbosely in Java

Java (inefficient solution!)

List<Integer> s1 = Arrays.asList(1, 2);
List<Integer> s2 = Arrays.asList(1, 3);

s1.stream()
  .flatMap(v1 -> s2.stream()
                   .map(v2 -> tuple(v1, v2)))
  .filter(t -> Objects.equals(t.v1, t.v2))
  .forEach(System.out::println);

The above correctly yields

(1, 1)

But beware that you’re attaining this result after producing a cartesian product, the nightmare of every DBA! As mentioned at the beginning of this article, unlike in declarative programming, in functional programming you instruct your program to do exactly the order of operations that you specify. In other words:

In functional programming, you define the exact “execution plan” of your query.

In declarative programming, an optimiser may reorganise your “program”

There is no optimiser to transform the above into the much more efficient:

Java (more efficient)

List<Integer> s1 = Arrays.asList(1, 2);
List<Integer> s2 = Arrays.asList(1, 3);

s1.stream()
  .flatMap(v1 -> s2.stream()
                   .filter(v2 -> Objects.equals(v1, v2))
                   .map(v2 -> tuple(v1, v2)))
  .forEach(System.out::println);

The above also yields

(1, 1)

Notice, how the join predicate has moved from the “outer” stream into the “inner” stream, that is produced in the function passed to flatMap().

Java (optimal)

As mentioned previously, functional programming doesn’t necessarily allow you to rewrite algorithms depending on knowledge of the actual data. The above presented implementation for joins always implement nested loop joins going from the first stream to the second. If you join more than two streams, or if the second stream is very large, this approach can be terribly inefficient. A sophisticated RDBMS would never blindly apply nested loop joins like that, but consider constraints, indexes, and histograms on actual data.

Going deeper into that topic would be out of scope for this article, though.

Java with jOOλ’s innerJoin()

Again, inspired by our work on jOOQ we’ve also added an innerJoin() convenience method for the above use-case:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<Integer> s2 = Seq.of(1, 3);

s1.innerJoin(s2, (t, u) -> Objects.equals(t, u))
  .forEach(System.out::println);

yielding

(1, 1)

… because after all, when joining two streams, the only really interesting operation is the join Predicate. All else (flatmapping, etc.) is just boilerplate.

LEFT OUTER JOIN = flatMap() with filter() and a “default”

SQL’s OUTER JOIN works like INNER JOIN, except that additional “default” rows are produced in case the JOIN predicate yields false for a pair of tuples. In terms of set theory / relational algebra, this can be expressed as such:

dd81ee1373d922122ce1b3e0da74cb28

Or in a SQL-esque dialect:

R LEFT OUTER JOIN S ::=

R INNER JOIN S
UNION (
  (R EXCEPT (SELECT R.* FROM R INNER JOIN S))
  CROSS JOIN
  (null, null, ..., null)
)

This simply means that when left outer joining S to R, there will be at least one row in the result for each row in R, with possibly an empty value for S.

Inversely, when right outer joining S to R, there will be at least one row in the result for each row in S, with possibly an empty value for R.

And finally, when full outer joining S to R, there will be at least one row in the result for each row in R with possibly an empty value for S AND for each row in S with possibly an empty value for R.

Let us look at LEFT OUTER JOIN, which is used most often in SQL.

SQL

-- Table list, Oracle syntax (don't use this!)
SELECT *
FROM (SELECT 1 v1 FROM DUAL
      UNION ALL 
      SELECT 2 v1 FROM DUAL) t1, 
     (SELECT 1 v2 FROM DUAL
      UNION ALL
      SELECT 3 v2 FROM DUAL) t2
WHERE t1.v1 = t2.v2 (+)

-- OUTER JOIN syntax
SELECT *
FROM            (VALUES(1), (2)) t1(v1)
LEFT OUTER JOIN (VALUES(1), (3)) t2(v2)
ON t1.v1 = t2.v2

yielding

+----+------+
| v1 |   v2 |
+----+------+
|  1 |    1 |
|  2 | null |
+----+------+

(note that the keyword OUTER is optional).

Java

Unfortunately, the JDK’s Stream API doesn’t provide us with an easy way to produce “at least” one value from a stream, in case the stream is empty. We could be writing a utility function as explained by Stuart Marks on Stack Overflow:

static <T> Stream<T> defaultIfEmpty(
    Stream<T> stream, Supplier<T> supplier) {
    Iterator<T> iterator = stream.iterator();

    if (iterator.hasNext()) {
        return StreamSupport.stream(
            Spliterators.spliteratorUnknownSize(
                iterator, 0
            ), false);
    } else {
        return Stream.of(supplier.get());
    }
}

Or, we just use jOOλ’s Seq.onEmpty()

List<Integer> s1 = Arrays.asList(1, 2);
List<Integer> s2 = Arrays.asList(1, 3);

seq(s1)
.flatMap(v1 -> seq(s2)
              .filter(v2 -> Objects.equals(v1, v2))
              .onEmpty(null)
              .map(v2 -> tuple(v1, v2)))
.forEach(System.out::println);

(notice, we’re putting null in a stream. This might not always be a good idea. We’ll follow up with that in a future blog post)

The above also yields

(1, 1)
(2, null)

How to read the implicit left outer join?

  • We’ll take each value v1 from the left stream s1
  • For each such value v1, we flatmap the right stream s2 to produce a tuple (v1, v2) (a cartesian product, cross join)
  • We’ll apply the join predicate for each such tuple (v1, v2)
  • If the join predicate leaves no tuples for any value v2, we’ll generate a single tuple containing the value of the left stream v1 and null

Java with jOOλ

For convenience, jOOλ also supports leftOuterJoin() which works as described above:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<Integer> s2 = Seq.of(1, 3);

s1.leftOuterJoin(s2, (t, u) -> Objects.equals(t, u))
  .forEach(System.out::println);

yielding

(1, 1)
(2, null)

RIGHT OUTER JOIN = inverse LEFT OUTER JOIN

Trivially, a RIGHT OUTER JOIN is just the inverse of the previous LEFT OUTER JOIN. The jOOλ implementation of rightOuterJoin() looks like this:

default <U> Seq<Tuple2<T, U>> rightOuterJoin(
    Stream<U> other, BiPredicate<T, U> predicate) {
    return seq(other)
          .leftOuterJoin(this, (u, t) -> predicate.test(t, u))
          .map(t -> tuple(t.v2, t.v1));
}

As you can see, the RIGHT OUTER JOIN inverses the results of a LEFT OUTER JOIN, that’s it. For example:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<Integer> s2 = Seq.of(1, 3);

s1.rightOuterJoin(s2, (t, u) -> Objects.equals(t, u))
  .forEach(System.out::println);

yielding

(1, 1)
(null, 3)

WHERE = filter()

The most straight-forward mapping is probably SQL’s WHERE clause having an exact equivalent in the Stream API: Stream.filter().

SQL

SELECT *
FROM (VALUES(1), (2), (3)) t(v)
WHERE v % 2 = 0

yielding

+---+
| v |
+---+
| 2 |
+---+

Java

Stream<Integer> s = Stream.of(1, 2, 3);

s.filter(v -> v % 2 == 0)
 .forEach(System.out::println);

yielding

2

The interesting thing with filter() and the Stream API in general is that the operation can apply at any place in the call chain, unlike the WHERE clause, which is limited to be placed right after the FROM clause – even if SQL’s JOIN .. ON or HAVING clauses are semantically similar.

GROUP BY = collect()

The least straight-forward mapping is GROUP BY vs. Stream.collect().

First off, SQL’s GROUP BY may be a bit tricky to fully understand. It is really part of the FROM clause, transforming the set of tuples produced by FROM .. JOIN .. WHERE into groups of tuples, where each group has an associated set of aggregatable tuples, which can be aggregated in the HAVING, SELECT, and ORDER BY clauses. Things get even more interesting when you use OLAP features like GROUPING SETS, which allow for duplicating tuples according to several grouping combinations.

In most SQL implementations that don’t support ARRAY or MULTISET, the aggregatable tuples are not available as such (i.e. as nested collections) in the SELECT. Here, the Stream API’s feature set excels. On the other hand, the Stream API can group values only as a terminal operation, where in SQL, GROUP BY is applied purely declaratively (and thus, lazily). The execution planner may choose not to execute the GROUP BY at all if it is not needed. For instance:

SELECT *
FROM some_table
WHERE EXISTS (
    SELECT x, sum(y)
    FROM other_table
    GROUP BY x
)

The above query is semantically equivalent to

SELECT *
FROM some_table
WHERE EXISTS (
    SELECT 1
    FROM other_table
)

The grouping in the subquery was unnecessary. Someone may have copy-pasted that subquery in there from somewhere else, or refactored the query as a whole. In Java, using the Stream API, each operation is always executed.

For the sake of simplicity, we’ll stick to the most simple examples here

Aggregation without GROUP BY

A special case is when we do not specify any GROUP BY clause. In that case, we can specify aggregations on all columns of the FROM clause, producing always exactly one record. For instance:

SQL

SELECT sum(v)
FROM (VALUES(1), (2), (3)) t(v)

yielding

+-----+
| sum |
+-----+
|   6 |
+-----+

Java

Stream<Integer> s = Stream.of(1, 2, 3);

int sum = s.collect(Collectors.summingInt(i -> i));
System.out.println(sum);

yielding

6

Aggregation with GROUP BY

A more common case of aggregation in SQL is to specify an explicit GROUP BY clause as explained before. For instance, we may want to group by even and odd numbers:

SQL

SELECT v % 2, count(v), sum(v)
FROM (VALUES(1), (2), (3)) t(v)
GROUP BY v % 2

yielding

+-------+-------+-----+
| v % 2 | count | sum |
+-------+-------+-----+
|     0 |     1 |   2 |
|     1 |     2 |   4 |
+-------+-------+-----+

Java

For this simple grouping / collection use-case, luckily, the JDK offers a utility method called Collectors.groupingBy(), which produces a collector that generates a Map<K, List<V>> type like this:

Stream<Integer> s = Stream.of(1, 2, 3);

Map<Integer, List<Integer>> map = s.collect(
    Collectors.groupingBy(v -> v % 2)
);

System.out.println(map);

yielding

{0=[2], 1=[1, 3]}

This certainly takes care of the grouping. Now we want to produce aggregations for each group. The slightly awkward JDK way to do this would be:

Stream<Integer> s = Stream.of(1, 2, 3);

Map<Integer, IntSummaryStatistics> map = s.collect(
    Collectors.groupingBy(
        v -> v % 2,
        Collectors.summarizingInt(i -> i)
    )
);

System.out.println(map);

we’ll now get:

{0=IntSummaryStatistics{count=1, sum=2, min=2, average=2.000000, max=2},
 1=IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3}}

As you can see, the count() and sum() values have been calculated somewhere along the lines of the above.

More sophisticated GROUP BY

When doing multiple aggregations with Java 8’s Stream API, you will quickly be forced to wrestle low-level API implementing complicated collectors and accumulators yourself. This is tedious and unnecessary. Consider the following SQL statement:

SQL

CREATE TABLE t (
  w INT,
  x INT,
  y INT,
  z INT
);

SELECT
    z, w, 
    MIN(x), MAX(x), AVG(x), 
    MIN(y), MAX(y), AVG(y) 
FROM t
GROUP BY z, w;

In one go, we want to:

  • Group by several values
  • Aggregate from several values

Java

In a previous article, we’ve explained in detail how this can be achieved using convenience API from jOOλ via Seq.groupBy()

class A {
    final int w;
    final int x;
    final int y;
    final int z;
 
    A(int w, int x, int y, int z) {
        this.w = w;
        this.x = x;
        this.y = y;
        this.z = z;
    }
}

Map<
    Tuple2<Integer, Integer>, 
    Tuple2<IntSummaryStatistics, IntSummaryStatistics>
> map =
Seq.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))
 
// Seq.groupBy() is just short for 
// Stream.collect(Collectors.groupingBy(...))
.groupBy(
    a -> tuple(a.z, a.w),
 
    // ... because once you have tuples, 
    // why not add tuple-collectors?
    Tuple.collectors(
        Collectors.summarizingInt(a -> a.x),
        Collectors.summarizingInt(a -> a.y)
    )
);

System.out.println(map);

The above yields

{(1, 1)=(IntSummaryStatistics{count=2, sum=3, min=1, average=1.500000, max=2},
         IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3}),
 (4, 9)=(IntSummaryStatistics{count=2, sum=17, min=8, average=8.500000, max=9},
         IntSummaryStatistics{count=2, sum=13, min=6, average=6.500000, max=7}),
 (5, 2)=(IntSummaryStatistics{count=3, sum=12, min=3, average=4.000000, max=5},
         IntSummaryStatistics{count=3, sum=13, min=4, average=4.333333, max=5})}

For more details, read the full article here.

Notice how using Stream.collect(), or Seq.groupBy() already makes for an implicit SELECT clause, which we are no longer needed to obtain via map() (see below).

HAVING = filter(), again

As mentioned before, there aren’t really different ways of applying predicates with the Stream API, there is only Stream.filter(). In SQL, HAVING is a “special” predicate clause that is syntactically put after the GROUP BY clause. For instance:

SQL

SELECT v % 2, count(v)
FROM (VALUES(1), (2), (3)) t(v)
GROUP BY v % 2
HAVING count(v) > 1

yielding

+-------+-------+
| v % 2 | count |
+-------+-------+
|     1 |     2 |
+-------+-------+

Java

Unfortunately, as we have seen before, collect() is a terminal operation in the Stream API, which means that it eagerly produces a Map, instead of transforming the Stream<T> into a Stream<K, Stream<V>, which would compose much better in complex Stream. This means that any operation that we’d like to implement right after collecting will have to be implemented on a new stream produced from the output Map:

Stream<Integer> s = Stream.of(1, 2, 3);

s.collect(Collectors.groupingBy(
      v -> v % 2,
      Collectors.summarizingInt(i -> i)
  ))
  .entrySet()
  .stream()
  .filter(e -> e.getValue().getCount() > 1)
  .forEach(System.out::println);

yielding

1=IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3}

As you can see, the type transformation that is applied is:

  • Map<Integer, IntSummaryStatistics>
  • Set<Entry<Integer, IntSummaryStatistics>>
  • Stream<Entry<Integer, IntSummaryStatistics>>

SELECT = map()

The SELECT clause in SQL is nothing more than a tuple transformation function that takes the cartesian product of tuples produced by the FROM clause and transforms it into a new tuple expression, which is fed either to the client, or to some higher-level query if this is a nested SELECT. An illustration:

FROM output

+------+------+------+------+------+
| T1.A | T1.B | T1.C | T2.A | T2.D |
+------+------+------+------+------+
|    1 |    A |    a |    1 |    X |
|    1 |    B |    b |    1 |    Y |
|    2 |    C |    c |    2 |    X |
|    2 |    D |    d |    2 |    Y |
+------+------+------+------+------+

Applying SELECT

SELECT t1.a, t1.c, t1.b || t1.d

+------+------+--------------+
| T1.A | T1.C | T1.B || T1.D |
+------+------+--------------+
|    1 |    a |           AX |
|    1 |    b |           BY |
|    2 |    c |           CX |
|    2 |    d |           DY |
+------+------+--------------+

Using Java 8 Streams, SELECT can be achieved very simply by using Stream.map(), as we’ve already seen in previous examples, where we unnested tuples using map(). The following examples are functionally equivalent:

SQL

SELECT t.v1 * 3, t.v2 + 5
FROM (
  VALUES(1, 1),
        (2, 2)
) t(v1, v2)

yielding

+----+----+
| c1 | c2 |
+----+----+
|  3 |  6 |
|  6 |  7 |
+----+----+

Java

Stream.of(
  tuple(1, 1),
  tuple(2, 2)
).map(t -> tuple(t.v1 * 3, t.v2 + 5))
 .forEach(System.out::println);

yielding

(3, 6)
(6, 7)

DISTINCT = distinct()

The DISTINCT keyword that can be supplied with the SELECT clause simply removes duplicate tuples right after they have been produced by the SELECT clause. An illustration:

FROM output

+------+------+------+------+------+
| T1.A | T1.B | T1.C | T2.A | T2.D |
+------+------+------+------+------+
|    1 |    A |    a |    1 |    X |
|    1 |    B |    b |    1 |    Y |
|    2 |    C |    c |    2 |    X |
|    2 |    D |    d |    2 |    Y |
+------+------+------+------+------+

Applying SELECT DISTINCT

SELECT DISTINCT t1.a

+------+
| T1.A |
+------+
|    1 |
|    2 |
+------+

Using Java 8 Streams, SELECT DISTINCT can be achieved very simply by using Stream.distinct() right after Stream.map(). The following examples are functionally equivalent:

SQL

SELECT DISTINCT t.v1 * 3, t.v2 + 5
FROM (
  VALUES(1, 1),
        (2, 2),
        (2, 2)
) t(v1, v2)

yielding

+----+----+
| c1 | c2 |
+----+----+
|  3 |  6 |
|  6 |  7 |
+----+----+

Java

Stream.of(
  tuple(1, 1),
  tuple(2, 2),
  tuple(2, 2)
).map(t -> tuple(t.v1 * 3, t.v2 + 5))
 .distinct()
 .forEach(System.out::println);

yielding

(3, 6)
(6, 7)

UNION ALL = concat()

Set operations are powerful both in SQL and using the Stream API. The UNION ALL operation maps to Stream.concat(), as can be seen below:

SQL

SELECT *
FROM (VALUES(1), (2)) t(v)
UNION ALL
SELECT *
FROM (VALUES(1), (3)) t(v)

yielding

+---+
| v |
+---+
| 1 |
| 2 |
| 1 |
| 3 |
+---+

Java

Stream<Integer> s1 = Stream.of(1, 2);
Stream<Integer> s2 = Stream.of(1, 3);

Stream.concat(s1, s2)
      .forEach(System.out::println);

yielding

1
2
1
3

Java (using jOOλ)

Unfortunately, concat() exists in Stream only as a static method, while Seq.concat() also exists on instances when working with jOOλ.

Seq<Integer> s1 = Seq.of(1, 2);
Seq<Integer> s2 = Seq.of(1, 3);

s1.concat(s2)
  .forEach(System.out::println);

UNION = concat() and distinct()

In SQL, UNION is defined to remove duplicates after concatenating the two sets via UNION ALL. The following two statements are equivalent:

SELECT * FROM t
UNION
SELECT * FROM u;

-- equivalent

SELECT DISTINCT *
FROM (
  SELECT * FROM t
  UNION ALL
  SELECT * FROM u
);

Let’s put this in action:

SQL

SELECT *
FROM (VALUES(1), (2)) t(v)
UNION
SELECT *
FROM (VALUES(1), (3)) t(v)

yielding

+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
+---+

Java

Stream<Integer> s1 = Stream.of(1, 2);
Stream<Integer> s2 = Stream.of(1, 3);

Stream.concat(s1, s2)
      .distinct()
      .forEach(System.out::println);

ORDER BY = sorted()

The ORDER BY mapping is trivial

SQL

SELECT *
FROM (VALUES(1), (4), (3)) t(v)
ORDER BY v

yielding

+---+
| v |
+---+
| 1 |
| 3 |
| 4 |
+---+

Java

Stream<Integer> s = Stream.of(1, 4, 3);

s.sorted()
 .forEach(System.out::println);

yielding

1
3
4

LIMIT = limit()

The LIMIT mapping is even more trivial

SQL

SELECT *
FROM (VALUES(1), (4), (3)) t(v)
LIMIT 2

yielding

+---+
| v |
+---+
| 1 |
| 4 |
+---+

Java

Stream<Integer> s = Stream.of(1, 4, 3);

s.limit(2)
 .forEach(System.out::println);

yielding

1
4

OFFSET = skip()

The OFFSET mapping is trivial as well

SQL

SELECT *
FROM (VALUES(1), (4), (3)) t(v)
OFFSET 1

yielding

+---+
| v |
+---+
| 4 |
| 3 |
+---+

Java

Stream<Integer> s = Stream.of(1, 4, 3);

s.skip(1)
 .forEach(System.out::println);

yielding

4
3

Conclusion

In the above article, we’ve seen pretty much all the useful SQL SELECT query clauses and how they can be mapped to the Java 8 Stream API, or to jOOλ’s Seq API, in case Stream doesn’t offer sufficient functionality.

The article shows that SQL’s declarative world is not that much different from Java 8’s functional world. SQL clauses can compose ad-hoc queries just as well as Stream methods can be used to compose functional transformation pipelines. But there is a fundamental difference.

While SQL is truly declarative, functional programming is still very instructive. The Stream API does not make optimisation decisions based on constraints, indexes, histograms and other meta information about the data that you’re transforming. Using the Stream API is like using all possible optimisation hints in SQL to force the SQL engine to choose one particular execution plan over another. However, while SQL is a higher level algorithm abstraction, the Stream API may allow you to implement more customisable algorithms.

5 thoughts on “Common SQL Clauses and Their Equivalents in Java 8 Streams

    • Thanks for your nice words.

      “SQL? They still use that?”

      That’s the kind of words that all adolescents use when they try to emancipate from their parents / mentors. When they mature, a couple of years (or decades, at times) later, they will realise “heck, the parents hadn’t been that wrong”.

      I tend to quote Isaac Newton:

      If I have seen further, it is by standing on the shoulders of giants.

      SQL is built on one of the few unchallenged and solid foundations of our industry. The syntax may be weird and feel COBOL-esque at times, but the ideas are timeless. If anything will ever supersede SQL (it won’t), it will be SQL with a better, even more powerful syntax.

  1. A great article! Both Java 8 and JOOQ look powerful.

    If you can supplement “More sophisticated GROUP BY” using java 8 rather than jooq, I think it can be better.😀
    Also, I will try to implement it following you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s