Using jOOλ to Combine Several Java 8 Collectors into One

With Java 8 being mainstream now, people start using Streams for everything, even in cases where that’s a bit exaggerated (a.k.a. completely nuts, if you were expecting a hyperbole here). For instance, take mykong’s article here, showing how to collect a Map’s entry set stream into a list of keys and a list of values:

http://www.mkyong.com/java8/java-8-convert-map-to-list

The code posted on mykong.com does it in two steps:

package com.mkyong.example;

import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

public class ConvertMapToList {
    public static void main(String[] args) {
        Map<Integer, String> map = new HashMap<>();
        map.put(10, "apple");
        map.put(20, "orange");
        map.put(30, "banana");
        map.put(40, "watermelon");
        map.put(50, "dragonfruit");

        System.out.println("\n1. Export Map Key to List...");

        List<Integer> result = map.entrySet().stream()
                .map(x -> x.getKey())
                .collect(Collectors.toList());

        result.forEach(System.out::println);

        System.out.println("\n2. Export Map Value to List...");

        List<String> result2 = map.entrySet().stream()
                .map(x -> x.getValue())
                .collect(Collectors.toList());

        result2.forEach(System.out::println);
    }
}

This is probably not what you should do in your own code. First off, if you’re OK with iterating the map twice, the simplest way to collect a map’s keys and values would be this:

List<Integer> result1 = new ArrayList<>(map.keySet());
List<String> result2 = new ArrayList<>(map.values());

There’s absolutely no need to resort to Java 8 streams for this particular example. The above is about as simple and speedy as it gets.

Don’t shoehorn Java 8 Streams into every problem

But if you really want to use streams, then I would personally prefer a solution where you do it in one go. There’s no need to iterate the Map twice in this particular case. For instance, you could do it by using jOOλ’s Tuple.collectors() method, a method that combines two collectors into a new collector that returns a tuple of the individual collections.

Code speaks for itself more clearly than the above description. Mykong.com’s code could be replaced by this:

Tuple2<List<Integer>, List<String>> result = 
map.entrySet()
    .stream()
    .collect(Tuple.collectors(
        Collectors.mapping(Entry::getKey, Collectors.toList()),
        Collectors.mapping(Entry::getValue, Collectors.toList())
    ));

The only jOOλ code put in place here is the call to Tuple.collectors(), which combines the standard JDK collectors that apply mapping on the Map entries before collecting keys and values into lists.

When printing the above result, you’ll get:

([50, 20, 40, 10, 30], [dragonfruit, orange, watermelon, apple, banana])

i.e. a tuple containing the two resulting lists.

Even simpler, don’t use the Java 8 Stream API, use jOOλ’s Seq (for sequential stream) and write this shorty instead:

Tuple2<List<Integer>, List<String>> result = 
Seq.seq(map)
   .collect(
        Collectors.mapping(Tuple2::v1, Collectors.toList()),
        Collectors.mapping(Tuple2::v2, Collectors.toList())
   );

Where Collectable.collect(Collector, Collector) provides awesome syntax sugar over the previous example

Convinced? Get jOOλ here: https://github.com/jOOQ/jOOL

Cyclops-react Organises the Cambrian Explosion of Java 8 Libraries

We’re excited to announce another very interesting guest post on the jOOQ Blog by John Mcclean from AOL.

AOL is a global digital media and technology company, founded in 1985 and once known as America Online, AOL is now part of the Verizon Group. AOL focuses on four areas – video, mobile, ad technology and platforms, and open ecosystems. AOL connects publishers with advertisers across their global, programmatic platforms, tapping into Microsoft inventory and original content brands like TechCrunch, The Huffington Post and MAKERS.

johnmccleanJohn is an Architect at AOL. He works in the ad tech and platforms group, where he leads the advertising demand side forecasting team. A team that builds and runs a system that processes billions of RTB, impression and viewability records in realtime to generate price volume curves and other forecasts for advertising campaigns in milliseconds. John is also the lead developer for AOL open source projects cyclops-react and Microserver. Extracted from AOL’s forecasting system these projects allow AOL to rapidly deploy new features that work at scale, by guiding Java developers along the path of functional, reactive, microservices.

What is Cyclops-react?

The arrival of Lambda expressions and default methods in Java 8 heralded the biggest structural changes to the Java language in a decade. Building on top of this were some new cool APIs, such as Stream, Optional, CompletableFuture – finally Java developers could code in a more functional style. While this was very welcome, for many the enhancements did not quite go far enough.  

Stream, Optional, CompletableFuture all share the same abstract structure and obey the same rules. Yet the APIs don’t agree on common method names, never mind provide a common interface. For example Stream#map / Optional#map becomes CompletableFuture#thenApply. Also, the functionality added to Stream & Optional is missing from collections generally. Where is List#map ?

The JDK Stream implementation performs well, is totally lazy and well designed for extension, but provides only a limited subset of potential operators (constrained, perhaps, by a focus on data parallelism). Into the void stepped libraries such as jOOλ with its sequential Stream extension (called Seq). Seq adds many additional Streaming operators. jOOλ generally adds many missing functional features such as Tuples.

A core goal of cyclops-react, as well as adding original features such as FutureStreams, is to provide a mechanism for joining up both the JDK APIs and the third party functional libraries. There was a Cambrian explosion of cool libraries that emerged after the launch of Java 8. Libraries like vavr & Project Reactor. cyclops-react does this in the first instance by extending the JDK, and by leveraging other libraries such as  jOOλ, pCollections & Agrona. These libraries in turn also extend JDK interfaces where possible to add features such as Persistent Collections and wait free Many Producer Single Consumer Queues.

Beyond reusing and extending JDK interfaces our aims were to make it easy for developers to integrate with external libraries by making use of third party standards such as the reactive-streams API and by building our own abstractions where no set standard existed. The libraries we currently focus on integrating with are Google’s Guava, RxJava, Functional Java, Project Reactor and vavr. We’ve created abstractions for wrapping types like Stream, Optional & CompletableFuture – where no interface existed or was possible before. We chose these goals, because we are using cyclops-react in production across a Microservices architecture and being able to leverage the right technology for a problem and have it integrate smoothly with the rest of our code base is critical.

cyclops-react is quite a large feature rich project, and in addition has a number of integration modules. In the article below I’ll cover some of the available features with a particular goal of showing how cyclops-react helps join up the dots across the JDK and into the brave new world of the pace setting Java 8 open source community.
 

Extending the JDK

cyclops-react extends JDK APIs where possible. For example ReactiveSeq adds functionality for handling errors, asynchronous processing and much more extends extends both JDK Stream and jOOλ’s Seq. cyclops-react Collection extensions, rather than creating new collection implementations, implement and extend the appropriate JDK interfaces. cyclops-react LazyFutureStream in turn extends ReactiveSeq, and allows aggregate operations over Streams of Futures as if it were a simple Stream (this proves to be very useful for handling a large number typical Java I/O operations asynchronously and performantly).

ListX extends List, but adds operators that execute eagerly

ListX<Integer> tenTimes = ListX.of(1,2,3,4)
                               .map(i->i*10);

cyclops-react adds lots of operators for users to explore. We can, for example, apply functions across multiple collections at the same time

The reactive-streams API acts as a natural bridge between producers (publishers) of data and consumers (subscribers). All cyclops-react data types implement the Publisher interface from reactive-streams, and Subscriber implementations that can convert to any cyclops-react type are provided also. This makes direct integration with other reactive-streams based libraries, such as Project Reactor straightforward.

For example we can lazily populate a Reactor Flux from any cyclops publisher, such as SortedSetX, or populate a cyclops-react type from a Reactor type.

Flux<Integer> stream = Flux.from(
  SortedSetX.of(1,2,3,4,5,6,7,8));
//Flux[1,2,3,4,5,6,7,8]

ListX<Character> list = ListX.fromPublisher(
  Flux.just("a","b","c"));

Reactor Flux and Mono types can work directly with cyclops-react For comprehensions (each supported library also has their own set of native For comprehension classes in their integration module). 

// import static com.aol.cyclops.control.For.*;
        
Publishers.each2(
  Flux.just(1,2,3), 
  i -> ReactiveSeq.range(i,5),Tuple::tuple).printOut();
        
/*
(1, 1)
(1, 2)
(1, 3)
(1, 4)
(2, 2)
(2, 3)
(2, 4)
(3, 3)
(3, 4)
*/

A For comprehension is a way of managing nested iteration over types with flatMap and map methods, by cascading calls to the appropriate methods. In cyclops-react, nested statements can access the elements of the previous statements, so For comprehensions can be a very useful way of managing the behavior of existing. For example to ensure that calls to existing methods findId and loadData which may return null values, and will throw NPEs if provided with a null parameter we can make use of a For comprehension that will safely execute loadData only when an Optional with a value is returned from findId()

List<Data> data = 
For.optional(findId())
   .optional(this::loadData);
//loadData is only called if findId() returns a value

Similarly, a type such as Try could be used to handle exceptional results from either findId or loadData, Futures can be used to execute chained methods asynchronously and so on.

Building cross-library abstractions

Java 8 introduced Monads to Java (Stream, Optional, CompletableFuture), but didn’t provide a common interface that would help reuse, in fact the method names used in CompletableFuture differ significantly from those used in Optional & Stream for the same function. So map became thenApply and flatMap thenCompose. Across the Java 8 world monads are becoming an increasingly common pattern, but there is often no way to abstract across them. In cyclops-react, rather than attempt to define an interface to represent monads, we built a set of wrapper interfaces and a number of custom adapters to adapt different instances from across the main functional-style libraries for Java 8 to those wrappers. The wrappers extend AnyM (short for Any Monad) and there are two sub-interfaces – AnyMValue which represents any monadic type that resolves to a single value (like Optional or CompletableFuture) or AnyMSeq that ultimately resolves to a sequence of values (like a Stream or List). The cyclops extension wrappers provide a mechanism to wrap the types from RxJava, Guava, Reactor, FunctionalJava and vavr.

//We can wrap any type from Reactor, RxJava,
//FunctionalJava, vavr, Guava
AnyMSeq<Integer> wrapped = 
  Fj.list(List.list(1,2,3,4,5));

//And manipulate it
AnyMSeq<Integer> timesTen = wrapped.map(i->i*10); 

cyclops-react provides a common set of interfaces that these wrappers (and other cyclops-react types) inherit from, allowing developers to write more generic reusable code. AnyM extends reactive-streams publishers, meaning you can make any vavr, Guava, FunctionalJava or RxJava type a reactive-streams publisher with cyclops-react.

AnyMSeq<Integer> wrapped = 
  Javaslang.traversable(List.of(1,2,3,4,5));

//The wrapped type is a reactive-streams publisher
Flux<Integer> fromVavr = Flux.from(wrapped);

wrapped.forEachWithError(
  System.out::println,
  System.out::err);
  

Furthermore the reactive functionality from cyclops-react is provided directly on the AnyM types. This means we can, for example, schedule data emission from a vavr or FunctionalJava Stream – or execute a reduce operation lazily, or asynchronously.


AnyMSeq<Integer> wrapped = 
  Javaslang.traversable(Stream.of(1,2,3,4,5));

CompletableFuture<Integer> asyncResult = 
  wrapped.futureOperations(Executors.newFixedThreadPool(1))
         .reduce(50, (acc, next) -> acc + next);
//CompletableFuture[1550]

AnyMSeq<Integer> wrapped = 
  FJ.list(list.list(1,2,3,4,5));

Eval<Integer> lazyResult = 
  wrapped.map(i -> i * 10)
         .lazyOperations()
         .reduce(50, (acc,next) -> acc + next);
//Eval[15500]

HotStream<Integer> emitting = wrapped.schedule(
  "0 * * * * ?", 
  Executors.newScheduledThreadPool(1));

emitting.connect()
        .debounce(1,TimeUnit.DAYS)
        .forEachWithError(
           this::logSuccess,
           this::logFailure);

Theres a lot to explore both in cyclops-react and in the new broader Java 8 eco-system, hopefully you’ll have a fun adventure playing with, learning from and extending the Java 8 boundaries yourself!

How to Pattern-Match Files and Display Adjacent Lines in Java

Recently, we’ve published our article about the awesome window function support in jOOλ 0.9.9, which I believe is some of the best additions to the library that we’ve ever done.

Today, we’ll look into an awesome application of window functions in a use-case that is inspired by this Stack Overflow question Sean Nguyen:

How to get lines before and after matching from java 8 stream like grep?

I have a text files that have a lot of string lines in there. If I want to find lines before and after a matching in grep, I will do like this:

grep -A 10 -B 10 "ABC" myfile.txt

How can I implement the equivalent in java 8 using streams?

So the question is:

How can I implement the equivalent in Java 8 using streams?

jOOλ - The Missing Parts in Java 8 jOOλ improves the JDK libraries in areas where the Expert Group's focus was elsewhere.Well, the unix shell and its various “pipable” commands are about the only thing that are even more awesome (and mysterious) than window functions. Being able to grep for a certain string in a file, and then display a “window” of a couple of lines is quite useful.

With jOOλ 0.9.9, however, we can do that very easily in Java 8 as well. Consider this little snippet:

Seq.seq(Files.readAllLines(Paths.get(
        new File("/path/to/Example.java").toURI())))
   .window()
   .filter(w -> w.value().contains("ABC"))
   .forEach(w -> {
       System.out.println();
       System.out.println("-1:" + w.lag().orElse(""));
       System.out.println(" 0:" + w.value());
       System.out.println("+1:" + w.lead().orElse(""));
       // ABC: Just checking
   });

This program will output:

-1: .window()
 0: .filter(w -> w.value().contains("ABC"))
+1: .forEach(w -> {

-1:     System.out.println("+1:" + w.lead().orElse(""));
 0:     // ABC: Just checking
+1: });

So, I’ve run the program on itself and I’ve found all the lines that match “ABC”, plus the previous lines (“lagging” / lag()) and the following lines (leading / lead()). These lead() and lag() functions work just like their SQL equivalents.

But unlike SQL, composing functions in Java (or other general purpose languages) is a bit simpler as there is less syntax clutter involved. We can easily do aggregations over a window frame to collect a generic amount of lines “lagging” and “leading” a match. Consider the following alternative:

int lower = -5;
int upper =  5;
        
Seq.seq(Files.readAllLines(Paths.get(
        new File("/path/to/Example.java").toURI())))
   .window(lower, upper)
   .filter(w -> w.value().contains("ABC"))
   .map(w -> w.window()
              .zipWithIndex()
              .map(t -> tuple(t.v1, t.v2 + lower))
              .map(t -> (t.v2 > 0 
                       ? "+" 
                       : t.v2 == 0 
                       ? " " : "") 
                       + t.v2 + ":" + t.v1)
              .toString("\n"))

And the output that we’re getting is this:

-5:int upper =  5;
-4:        
-3:Seq.seq(Files.readAllLines(Paths.get(
-2:        new File("/path/to/Example.java").toURI())))
-1:   .window(lower, upper)
 0:   .filter(w -> w.value().contains("ABC"))
+1:   .map(w -> w.window()
+2:              .zipWithIndex()
+3:              .map(t -> tuple(t.v1, t.v2 + lower))
+4:              .map(t -> (t.v2 > 0 
+5:                       ? "+" 

Could it get any more concise? I don’t think so. Most of the logic above was just generating the index next to the line.

Conclusion

Window functions are extremely powerful. The recent discussion on reddit about our previous article on jOOλ’s window function support has shown that other languages also support primitives to build similar functionality. But usually, these building blocks aren’t as concise as the ones exposed in jOOλ, which are inspired by SQL.

With jOOλ mimicking SQL’s window functions, there is only little cognitive friction when composing powerful operations on in memory data streams.

Learn more about window functions in these articles here:

2016 Will be the Year Remembered as When Java Finally Had Window Functions!

You heard right. Up until now, the awesome window functions were a feature uniquely reserved to SQL. Even sophisticated functional programming languages still seem to lack this beautiful functionality (correct me if I’m wrong, Haskell folks).

We’ve written tons of blog posts about window functions, evangelising them to our audience, in articles like:

One of my favourite example use-cases for window functions is the running total. I.e. to get from the following bank account transaction table:

| ID   | VALUE_DATE | AMOUNT |
|------|------------|--------|
| 9997 | 2014-03-18 |  99.17 |
| 9981 | 2014-03-16 |  71.44 |
| 9979 | 2014-03-16 | -94.60 |
| 9977 | 2014-03-16 |  -6.96 |
| 9971 | 2014-03-15 | -65.95 |

… to this one, with a calculated balance:

| ID   | VALUE_DATE | AMOUNT |  BALANCE |
|------|------------|--------|----------|
| 9997 | 2014-03-18 |  99.17 | 19985.81 |
| 9981 | 2014-03-16 |  71.44 | 19886.64 |
| 9979 | 2014-03-16 | -94.60 | 19815.20 |
| 9977 | 2014-03-16 |  -6.96 | 19909.80 |
| 9971 | 2014-03-15 | -65.95 | 19916.76 |

With SQL, this is a piece of cake. Observe the usage of SUM(t.amount) OVER(...):

SELECT
  t.*,
  t.current_balance - NVL(
    SUM(t.amount) OVER (
      PARTITION BY t.account_id
      ORDER BY     t.value_date DESC,
                   t.id         DESC
      ROWS BETWEEN UNBOUNDED PRECEDING
           AND     1         PRECEDING
    ),
  0) AS balance
FROM     v_transactions t
WHERE    t.account_id = 1
ORDER BY t.value_date DESC,
         t.id         DESC

How do window functions work?

(don’t forget to book our SQL Masterclass to learn about window functions, and much more!)

Despite the sometimes a bit scary syntax, window functions are really very easy to understand. Windows are “views” of the data produced in your FROM / WHERE / GROUP BY / HAVING clauses. They allow you to access all the other rows relative to the current row, while you calculate something in your SELECT clause (or rarely, in your ORDER BY clause). What the above statement really does is this:

| ID   | VALUE_DATE |  AMOUNT |  BALANCE |
|------|------------|---------|----------|
| 9997 | 2014-03-18 | -(99.17)|+19985.81 |
| 9981 | 2014-03-16 | -(71.44)| 19886.64 |
| 9979 | 2014-03-16 |-(-94.60)| 19815.20 |
| 9977 | 2014-03-16 |   -6.96 |=19909.80 |
| 9971 | 2014-03-15 |  -65.95 | 19916.76 |

I.e. for any given balance, subtract from the current balance the SUM()OVER()” the window of all the rows that are in the same partition as the current row (same bank account), and that are strictly “above” the current row.

Or, in detail:

  • PARTITION BY specifies “OVER()” which rows the window spans
  • ORDER BY specifies how the window is ordered
  • ROWS specifies what ordered row indexes should be considered

Can we do this with Java collections?

jOOλ - The Missing Parts in Java 8 jOOλ improves the JDK libraries in areas where the Expert Group's focus was elsewhere.Yes we can! If you’re using jOOλ: A completely free Open Source, Apache 2.0 licensed library that we designed because we thought that the JDK 8 Stream and Collector APIs just don’t do it.

When Java 8 was designed, a lot of focus went into supporting parallel streams. That’s nice but certainly not the only useful area where functional programming can be applied. We’ve created jOOλ to fill this gap – without implementing an all new, alternative collections API, such as vavr or functional java have.

jOOλ already provides:

  1. Tuple types
  2. More useful stuff for ordered, sequential-only streams

With the recently released jOOλ 0.9.9, we’ve added two main new features:

  1. Tons of new Collectors
  2. Window functions

The many missing collectors in the JDK

The JDK ships with a couple of collectors, but they do seem awkward and verbose, and no one really appreciates writing collectors like the ones exposed in this Stack Overflow question (and many others).

But the use case exposed in the linked question is a very valid one. You want to aggregate several things from a list of person:

public class Person {
    private String firstName;
    private String lastName;
    private int age;
    private double height;
    private double weight;
    // getters / setters

Assuming you have this list:

List<Person> personsList = new ArrayList<Person>();

personsList.add(new Person("John", "Doe", 25, 1.80, 80));
personsList.add(new Person("Jane", "Doe", 30, 1.69, 60));
personsList.add(new Person("John", "Smith", 35, 174, 70));

You now want to get the following aggregations:

  • Number of people
  • Max age
  • Min height
  • Avg weight

This is a ridiculous problem for anyone used to writing SQL:

SELECT count(*), max(age), min(height), avg(weight)
FROM person

Done. How hard can it be in Java? It turns out that a lot of glue code needs to be written with vanilla JDK 8 API. Consider the sophisticated answers given

With jOOλ 0.9.9, solving this problem becomes ridiculously trivial again, and it reads almost like SQL:

Tuple result =
Seq.seq(personsList)
   .collect(
       count(),
       max(Person::getAge),
       min(Person::getHeight),
       avg(Person::getWeight)
   );

System.out.println(result);

And the result yields:

(3, Optional[35], Optional[1.69], Optional[70.0])

Note that this isn’t running a query against a SQL database (that’s what jOOQ is for). We’re running this “query” against an in-memory Java collection.

OK ok, that’s already awesome. Now what about window functions?

Right, the title of this article didn’t promise trivial aggregation stuff. It promised the awesome window functions.

Yet, window functions are nothing else than aggregations (or rankings) on a subset of your data stream. Instead of aggregating all of the stream (or table) into a single record, you want to maintain the original records, and provide the aggregation on each individual record directly.

A nice introductory example for window functions is the one provided in this article that explains the difference between ROW_NUMBER(), RANK(), and DENSE_RANK(). Consider the following PostgreSQL query:

SELECT
  v, 
  ROW_NUMBER() OVER(w),
  RANK()       OVER(w),
  DENSE_RANK() OVER(w)
FROM (
  VALUES('a'),('a'),('a'),('b'),
        ('c'),('c'),('d'),('e')
) t(v)
WINDOW w AS (ORDER BY v);

It yields:

| V | ROW_NUMBER | RANK | DENSE_RANK |
|---|------------|------|------------|
| a |          1 |    1 |          1 |
| a |          2 |    1 |          1 |
| a |          3 |    1 |          1 |
| b |          4 |    4 |          2 |
| c |          5 |    5 |          3 |
| c |          6 |    5 |          3 |
| d |          7 |    7 |          4 |
| e |          8 |    8 |          5 |

The same can be done in Java 8 using jOOλ 0.9.9

System.out.println(
    Seq.of("a", "a", "a", "b", "c", "c", "d", "e")
       .window(naturalOrder())
       .map(w -> tuple(
            w.value(),
            w.rowNumber(),
            w.rank(),
            w.denseRank()
       ))
       .format()
);

Yielding…

+----+----+----+----+
| v0 | v1 | v2 | v3 |
+----+----+----+----+
| a  |  0 |  0 |  0 |
| a  |  1 |  0 |  0 |
| a  |  2 |  0 |  0 |
| b  |  3 |  3 |  1 |
| c  |  4 |  4 |  2 |
| c  |  5 |  4 |  2 |
| d  |  6 |  6 |  3 |
| e  |  7 |  7 |  4 |
+----+----+----+----+

Again, do note that we’re not running any queries against a database. Everything is done in memory.

Notice two things:

  • jOOλ’s window functions return 0-based ranks, as is expected for Java APIs, as opposed to SQL, which is all 1-based.
  • In Java, it is not possible to construct ad-hoc records with named columns. That’s unfortunate, and I do hope a future Java will provide support for such language features.

Let’s review what happens exactly in the code:

System.out.println(

    // This is just enumerating our values
    Seq.of("a", "a", "a", "b", "c", "c", "d", "e")

    // Here, we specify a single window to be
    // ordered by the value T in the stream, in
    // natural order
       .window(naturalOrder())

    // The above window clause produces a Window<T>
    // object (the w here), which exposes...
       .map(w -> tuple(

    // ... the current value itself, of type String...
            w.value(),

    // ... or various rankings or aggregations on
    // the above window.
            w.rowNumber(),
            w.rank(),
            w.denseRank()
       ))

    // Just some nice formatting to produce the table
       .format()
);

That’s it! Easy, isn’t it?

We can do more! Check this out:

System.out.println(
    Seq.of("a", "a", "a", "b", "c", "c", "d", "e")
       .window(naturalOrder())
       .map(w -> tuple(
            w.value(),   // v0 
            w.count(),   // v1
            w.median(),  // v2
            w.lead(),    // v3
            w.lag(),     // v4
            w.toString() // v5
       ))
       .format()
);

What does the above yield?

+----+----+----+---------+---------+----------+
| v0 | v1 | v2 | v3      | v4      | v5       |
+----+----+----+---------+---------+----------+
| a  |  1 | a  | a       | {empty} | a        |
| a  |  2 | a  | a       | a       | aa       |
| a  |  3 | a  | b       | a       | aaa      |
| b  |  4 | a  | c       | a       | aaab     |
| c  |  5 | a  | c       | b       | aaabc    |
| c  |  6 | a  | d       | c       | aaabcc   |
| d  |  7 | b  | e       | c       | aaabccd  |
| e  |  8 | b  | {empty} | d       | aaabccde |
+----+----+----+---------+---------+----------+

Your analytics heart should be jumping, now.

4376565[1]

Wait a second. Can we do frames, too, as in SQL? Yes, we can. Just as in SQL, when we omit the frame clause on a window definition (but we do specify an ORDER BY clause), then the following is applied by default:

RANGE BETWEEN UNBOUNDED PRECEDING
  AND CURRENT ROW

We’ve done this in the previous examples. It can be seen in column v5, where we aggregate the string from the very first value up until the current value. So, let’s specify the frame then:

System.out.println(
    Seq.of("a", "a", "a", "b", "c", "c", "d", "e")
       .window(naturalOrder(), -1, 1) // frame here
       .map(w -> tuple(
            w.value(),   // v0
            w.count(),   // v1
            w.median(),  // v2
            w.lead(),    // v3
            w.lag(),     // v4
            w.toString() // v5
       ))
       .format()
);

And the result is, trivially:

+----+----+----+---------+---------+-----+
| v0 | v1 | v2 | v3      | v4      | v5  |
+----+----+----+---------+---------+-----+
| a  |  2 | a  | a       | {empty} | aa  |
| a  |  3 | a  | a       | a       | aaa |
| a  |  3 | a  | b       | a       | aab |
| b  |  3 | b  | c       | a       | abc |
| c  |  3 | c  | c       | b       | bcc |
| c  |  3 | c  | d       | c       | ccd |
| d  |  3 | d  | e       | c       | cde |
| e  |  2 | d  | {empty} | d       | de  |
+----+----+----+---------+---------+-----+

As expected, lead() and lag() are unaffected, as opposed to count(), median(), and toString()

Awesome! Now, let’s review the running total.

Often, you don’t calculate window functions on the scalar value of the stream itself, as that value is usually not a scalar value but a tuple (or a POJO in Java-speak). Instead, you extract values from the tuple (or POJO) and perform the aggregation on that. So, again, when calculating the BALANCE, we need to extract the AMOUNT first.

| ID   | VALUE_DATE |  AMOUNT |  BALANCE |
|------|------------|---------|----------|
| 9997 | 2014-03-18 | -(99.17)|+19985.81 |
| 9981 | 2014-03-16 | -(71.44)| 19886.64 |
| 9979 | 2014-03-16 |-(-94.60)| 19815.20 |
| 9977 | 2014-03-16 |   -6.96 |=19909.80 |
| 9971 | 2014-03-15 |  -65.95 | 19916.76 |

Here’s how you would write the running total with Java 8 and jOOλ 0.9.9

BigDecimal currentBalance = new BigDecimal("19985.81");

Seq.of(
    tuple(9997, "2014-03-18", new BigDecimal("99.17")),
    tuple(9981, "2014-03-16", new BigDecimal("71.44")),
    tuple(9979, "2014-03-16", new BigDecimal("-94.60")),
    tuple(9977, "2014-03-16", new BigDecimal("-6.96")),
    tuple(9971, "2014-03-15", new BigDecimal("-65.95")))
.window(Comparator
    .comparing((Tuple3<Integer, String, BigDecimal> t) 
        -> t.v1, reverseOrder())
    .thenComparing(t -> t.v2), Long.MIN_VALUE, -1)
.map(w -> w.value().concat(
     currentBalance.subtract(w.sum(t -> t.v3)
                              .orElse(BigDecimal.ZERO))
));

Yielding

+------+------------+--------+----------+
|   v0 | v1         |     v2 |       v3 |
+------+------------+--------+----------+
| 9997 | 2014-03-18 |  99.17 | 19985.81 |
| 9981 | 2014-03-16 |  71.44 | 19886.64 |
| 9979 | 2014-03-16 | -94.60 | 19815.20 |
| 9977 | 2014-03-16 |  -6.96 | 19909.80 |
| 9971 | 2014-03-15 | -65.95 | 19916.76 |
+------+------------+--------+----------+

A couple of things have changed here:

  • The comparator now takes two comparisons into account. Unforunately JEP-101 wasn’t entirely implemented, which is why we need to help the compiler with type inference here.
  • The Window.value() is now a tuple, not a single value. So we need to extract the interesting column from it, the AMOUNT (via t -> t.v3). On the other hand, we can simply concat() that additional value to the tuple

But that’s already it. Apart from the verbosity of the comparator (which we’ll certainly address in a future jOOλ version), writing a window function is a piece of cake.

What else can we do?

This article is not a complete description of all we can do with the new API. We’ll soon write a follow-up blog post with additional examples. For instance:

  • The partition by clause wasn’t described, but is available too
  • You can specify many more windows than the single window exposed here, each with individual PARTITION BY, ORDER BY and frame specifications

Also, the current implementation is rather canonical, i.e. it doesn’t (yet) cache aggregations:

  • For unordered / unframed windows (same value for all the partition)
  • Strictly ascendingly framed windows (aggregation can be based on previous value, for associative collectors like SUM(), or toString())

That’s it from our part. Download jOOλ, play around with it and enjoy the fact that the most awesome SQL feature is now available for all of you Java 8 developers!
https://github.com/jOOQ/jOOL

The Danger of Subtype Polymorphism Applied to Tuples

Java 8 has lambdas and streams, but no tuples, which is a shame. This is why we have implemented tuples in jOOλ – Java 8’s missing parts. Tuples are really boring value type containers. Essentially, they’re just an enumeration of types like these:

public class Tuple2<T1, T2> {
    public final T1 v1;
    public final T2 v2;

    public Tuple2(T1 v1, T2 v2) {
        this.v1 = v1;
        this.v2 = v2;
    }

    // [...]
}


public class Tuple3<T1, T2, T3> {
    public final T1 v1;
    public final T2 v2;
    public final T3 v3;

    public Tuple3(T1 v1, T2 v2, T3 v3) {
        this.v1 = v1;
        this.v2 = v2;
        this.v3 = v3;
    }

    // [...]
}

Writing tuple classes is a very boring task, and it’s best done using a source code generator.

Tuples in other languages and APIs

jOOλ‘s current version features tuples of degrees 0 – 16. C# and other .NET languages have tuple types between 1 – 8. There’s a special library just for tuples called Javatuples with tuples between degrees 1 and 10, and the authors went the extra mile and gave the tuples individual English names:

Unit<A> // (1 element)
Pair<A,B> // (2 elements)
Triplet<A,B,C> // (3 elements)
Quartet<A,B,C,D> // (4 elements)
Quintet<A,B,C,D,E> // (5 elements)
Sextet<A,B,C,D,E,F> // (6 elements)
Septet<A,B,C,D,E,F,G> // (7 elements)
Octet<A,B,C,D,E,F,G,H> // (8 elements)
Ennead<A,B,C,D,E,F,G,H,I> // (9 elements)
Decade<A,B,C,D,E,F,G,H,I,J> // (10 elements)

Why?

because Ennead really rings that sweet bell when I see it

Last, but not least, jOOQ also has a built-in tuple-like type, the org.jooq.Record, which serves as a base type for nice subtypes like Record7<T1, T2, T3, T4, T5, T6, T7>. jOOQ follows Scala and defines records up to a degree of 22.

Watch out when defining tuple type hierarchies

As we have seen in the previous example, Tuple3 has much code in common with Tuple2.

As we’re all massively brain-damaged by decades of object orientation and polymorphic design anti-patters, we might think that it would be a good idea to let Tuple3<T1, T2, T3> extend Tuple2<T1, T2>, as Tuple3 just adds one more attribute to the right of Tuple2, right? So…

public class Tuple3<T1, T2, T3> extends Tuple2<T1, T2> {
    public final T3 v3;

    public Tuple3(T1 v1, T2 v2, T3 v3) {
        super(v1, v2);
        this.v3 = v3;
    }

    // [...]
}

The truth is: That’s about the worst thing you could do, for several reasons. First off, yes. Both Tuple2 and Tuple3 are tuples, so they do have some common features. It’s not a bad idea to group those features in a common super type, such as:

public class Tuple2<T1, T2> implements Tuple {
    // [...]
}

But the degree is not one of those things. Here’s why:

Permutations

Think about all the possible tuples that you can form. If you let tuples extend each other, then a Tuple5 would also be assignment-compatible with a Tuple2, for instance. The following would compile perfectly:

Tuple2<String, Integer> t2 = tuple("A", 1, 2, 3, "B");

When letting Tuple3 extend Tuple2, it may have seemed like a good default choice to just drop the right-most attribute from the tuple in the extension chain.

But in the above example, why don’t I want to re-assign (v2, v4) such that the result is (1, 3), or maybe (v1, v3), such that the result is ("A", 2)?

There are a tremendous amount of permutations of possible attributes that could be of interest when “reducing” a higher degree tuple to a lower degree one. No way a default of dropping the right-most attribute will be sufficiently general for all use-cases

Type systems

Much worse than the above, there would be drastic implications for the type system, if Tuple3 extended Tuple2. Check out the jOOQ API, for instance. In jOOQ, you can safely assume the following:

// Compiles:
TABLE1.COL1.in(select(TABLE2.COL1).from(TABLE2))

// Must not compile:
TABLE1.COL1.in(select(TABLE2.COL1, TABLE2.COL2).from(TABLE2))

The first IN predicate is correct. The left hand side of the predicate has a single column (as opposed to being a row value expression). This means that the right hand side of the predicate must also operate on single-column expressions, e.g. a SELECT subquery that selects a single column (of the same type).

The second example selects too many columns, and the jOOQ API will tell the Java compiler that this is wrong.

This is guaranteed by jOOQ via the Field.in(Select) method, whose signature reads:

public interface Field<T> {
    ...
    Condition in(Select<? extends Record1<T>> select);
    ...
}

So, you can provide a SELECT statement that produces any subtype of the Record1<T> type.

Luckily, Record2 does not extend Record1

If now Record2 extended Record1, which might have seemed like a good idea at first, the second query would suddenly compile:

// This would now compile
TABLE1.COL1.in(select(TABLE2.COL1, TABLE2.COL2).from(TABLE2))

… even if it forms an invalid SQL statement. It would compile because it would generate a Select<Record2<Type1, Type2>> type, which would be a subtype of the expected Select<Record1<Type1>> from the Field.in(Select) method.

Conclusion

Tuple2 and Tuple5 types are fundamentally incompatible types. In strong type systems, you mustn’t be lured into thinking that similar types, or related types should also be compatible types.

Type hierarchies are something very object-oriented, and by object-oriented, I mean the flawed and over-engineered notion of object orientation that we’re still suffering from since the 90s. Even in “the Enterprise”, most people have learned to favour Composition over Inheritance. Composition in the case of tuples means that you can well transform a Tuple5 to a Tuple2. But you cannot assign it.

In jOOλ, such a transformation can be done very easily as follows:

// Produces (1, 3)
Tuple2<String, Integer> t2_4 = 
    tuple("A", 1, 2, 3, "B")
    .map((v1, v2, v3, v4, v5) -> tuple(v2, v4));

// Produces ("A", 2)
Tuple2<String, Integer> t1_3 = 
    tuple("A", 1, 2, 3, "B")
    .map((v1, v2, v3, v4, v5) -> tuple(v1, v3));

The idea is that you operate on immutable values, and that you can easily extract parts of those values and map / recombine them to new values.

Read more

If you’ve enjoyed reading this article, you might also like to learn why recursive generics are a terrible idea (in many situations).

How to use Java 8 Functional Programming to Generate an Alphabetic Sequence

I’ve stumbled upon an interesting Stack Overflow question by user “mip”. The question was:

I’m looking for a way of generating an alphabetic sequence:

A, B, C, ..., Z, AA, AB, AC, ..., ZZ.

This can be quickly recognised as the headings of an Excel spreadsheet, which does precisely that:

excel.

So far, none of the answers employed any Java 8 functional programming, which I accepted as a challenge. We’re going to use jOOλ, because the Java 8 Stream API does not offer enough functionality for this task. (I stand corrected – thank you Sebastian, for this interesting answer)

But first, let’s decompose the algorithm in a functional way. What we need are these components:

  1. A (reproducible) representation of the alphabet
  2. An upper bound, i.e. how many letters we want to produce. The requested sequence goes to ZZ, which means the upper bound would be 2
  3. A way to combine each letter of the alphabet with the previously generated combined letters in a cartesian product

Let’s look into some code:

1. Generating the alphabet

We could be writing the alphabet like this:

List<String> alphabet = Arrays.asList("A", "B", ..., "Z");

but that would be lame. Let’s generate it instead, using jOOλ:

List<String> alphabet = Seq
    .rangeClosed('A', 'Z')
    .map(Object::toString)
    .toList();

The above generates a “closed” range (Java-8-Stream-speak for a range with inclusive upper bound) of characters between A and Z, maps characters to strings and collects them into a list.

So far so good. Now:

2. Using an upper bound

The requested sequence of characters includes:

A .. Z, AA, AB, .. ZZ

But we could easily imagine to extend this requirement generally to produce the following, or even more.

A .. Z, AA, AB, .. ZZ, AAA, AAB, .. ZZZ

For this, we’ll use again rangeClosed():

// 1 = A .. Z, 2 = AA .. ZZ, 3 = AAA .. ZZZ
Seq.rangeClosed(1, 2)
   .flatMap(length -> ...)
   .forEach(System.out::println);

The idea here is to produce a new stream for each individual length in the range [1 .. 2], and to flatten those streams into one single stream. flatMap() is essentially the same as a nested loop in imperative programming.

3. Combine letters in a cartesian product

This is the trickiest part: We need to combine each letter with each letter length times. For this, we’ll use the following stream:

Seq.rangeClosed(1, length - 1)
   .foldLeft(Seq.seq(alphabet), (s, i) -> 
       s.crossJoin(Seq.seq(alphabet))
        .map(t -> t.v1 + t.v2))
    );

We’re using again rangeClosed() to produce values in the range [1 .. length-1]. foldLeft() is the same as reduce(), except that foldLeft() is guaranteed to go from “left to right” in a stream, without requiring the folding function to be associative. Whew.

In other, more understandable words: foldLeft() is nothing else but an imperative loop. The “seed” of the loop, i.e. the loop’s initial value, is a complete alphabet (Seq.seq(alphabet)). Now, for every value in the range [1 .. length-1], we produce a cartesian product (crossJoin()) between the letters “folded” so far and a new alphabet, and we concatenate each combination into a single new string (t.v1 and t.v2).

That’s it!

Combining everything

The following simple program prints all the values from A .. Z, AA .. ZZ, AAA .. ZZZ to the console:

import java.util.List;

import org.jooq.lambda.Seq;

public class Test {
    public static void main(String[] args) {
        int max = 3;

        List<String> alphabet = Seq
            .rangeClosed('A', 'Z')
            .map(Object::toString)
            .toList();

        Seq.rangeClosed(1, max)
           .flatMap(length ->
               Seq.rangeClosed(1, length - 1)
                  .foldLeft(Seq.seq(alphabet), (s, i) -> 
                      s.crossJoin(Seq.seq(alphabet))
                       .map(t -> t.v1 + t.v2)))
           .forEach(System.out::println);
    }
}

Disclaimer

This is certainly not the most optimal algorithm for this particular case. One of the best implementations has been given by an unnamed user on Stack Overflow:

import static java.lang.Math.*;

private static String getString(int n) {
    char[] buf = new char[(int) floor(log(25 * (n + 1)) / log(26))];
    for (int i = buf.length - 1; i >= 0; i--) {
        n--;
        buf[i] = (char) ('A' + n % 26);
        n /= 26;
    }
    return new String(buf);
}

Unnecessary to say that the latter runs much much faster than the previous functional algorithm.

Common SQL Clauses and Their Equivalents in Java 8 Streams

Functional programming allows for quasi-declarative programming in a general purpose language. By using powerful fluent APIs like Java 8’s Stream API, or jOOλ’s sequential Stream extension Seq or more sophisticated libraries like vavr or functionaljava, we can express data transformation algorithms in an extremely concise way. Compare Mario Fusco’s imperative and functional version of the same algorithm:

Using such APIs, functional programming certainly feels like true declarative programming.

The most popular true declarative programming language is SQL. When you join two tables, you don’t tell the RDBMS how to implement that join. It may decide at its discretion whether a nested loop, merge join, hash join, or some other algorithm is the most suitable in the context of the complete query and of all the available meta information. This is extremely powerful because the performance assumptions that are valid for a simple join may no longer be valid for a complex one, where a different algorithm would outperform the original one. By this abstraction, you can just easily modify a query in 30 seconds, without worrying about low-level details like algorithms or performance.

When an API allows you to combine both (e.g. jOOQ and Streams), you will get the best of both worlds – and those worlds aren’t too different.

In the following sections, we’ll compare common SQL constructs with their equivalent expressions written in Java 8 using Streams and jOOλ, in case the Stream API doesn’t offer enough functionality.

Tuples

For the sake of this article, we’re going to assume that SQL rows / records have an equivalent representation in Java. For this, we’ll be using jOOλ’s Tuple type, which is essentially:

public class Tuple2<T1, T2> {

    public final T1 v1;
    public final T2 v2;

    public Tuple2(T1 v1, T2 v2) {
        this.v1 = v1;
        this.v2 = v2;
    }
}

… plus a lot of useful gimmicks like Tuple being Comparable, etc.

Note that we’re assuming the following imports in this and all subsequent examples.

import static org.jooq.lambda.Seq.*;
import static org.jooq.lambda.tuple.Tuple.*;

import java.util.*;
import java.util.function.*;
import java.util.stream.*;

import org.jooq.lambda.*;

Much like SQL rows, a tuple is a “value-based” type, meaning that it doesn’t really have an identity. Two tuples (1, 'A') and (1, 'A') can be considered exactly equivalent. Removing identity from the game makes SQL and functional programming with immutable data structures extremely elegant.

FROM = of(), stream(), etc.

In SQL, the FROM clause logically (but not syntactically) precedes all the other clauses. It is used to produce a set of tuples from at least one table, possibly multiple joined tables. A single-table FROM clause can be trivially mapped to Stream.of(), for instance, or to any other method that simply produces a stream:

SQL

SELECT *
FROM (
  VALUES(1, 1),
        (2, 2)
) t(v1, v2)

yielding

+----+----+
| v1 | v2 |
+----+----+
|  1 |  1 |
|  2 |  2 |
+----+----+

Java

Stream.of(
  tuple(1, 1),
  tuple(2, 2)
).forEach(System.out::println);

yielding

(1, 1)
(2, 2)

CROSS JOIN = flatMap()

Selecting from multiple tables is already more interesting. The easiest way to combine two tables in SQL is by producing a cartesian product, either via a table list or using a CROSS JOIN. The following two are equivalent SQL statements:

SQL

-- Table list syntax
SELECT *
FROM (VALUES( 1 ), ( 2 )) t1(v1), 
     (VALUES('A'), ('B')) t2(v2)

-- CROSS JOIN syntax
SELECT *
FROM       (VALUES( 1 ), ( 2 )) t1(v1)
CROSS JOIN (VALUES('A'), ('B')) t2(v2)

yielding

+----+----+
| v1 | v2 |
+----+----+
|  1 |  A |
|  1 |  B |
|  2 |  A |
|  2 |  B |
+----+----+

In a cross join (or cartesian product), every value from t1 is combined with every value from t2 producing size(t1) * size(t2) rows in total.

Java

In functional programming using Java 8’s Stream, the Stream.flatMap() method corresponds to SQL CROSS JOIN as can be seen in the following example:

List<Integer> s1 = Stream.of(1, 2);
Supplier<Stream<String>> s2 = ()->Stream.of("A", "B");

s1.flatMap(v1 -> s2.get()
                   .map(v2 -> tuple(v1, v2)))
  .forEach(System.out::println);

yielding

(1, A)
(1, B)
(2, A)
(2, B)

Note how we have to wrap the second stream in a Supplier because streams can be consumed only once, but the above algorithm is really implementing a nested loop, combining all elements of stream s2 with each element from stream s1. An alternative would be not to use streams but lists (which we will do in subsequent examples, for simplicity):

List<Integer> s1 = Arrays.asList(1, 2);
List<String> s2 = Arrays.asList("A", "B");

s1.stream()
  .flatMap(v1 -> s2.stream()
                   .map(v2 -> tuple(v1, v2)))
  .forEach(System.out::println);

In fact, CROSS JOIN can be chained easily both in SQL and in Java:

SQL

-- Table list syntax
SELECT *
FROM (VALUES( 1 ), ( 2 )) t1(v1), 
     (VALUES('A'), ('B')) t2(v2), 
     (VALUES('X'), ('Y')) t3(v3)

-- CROSS JOIN syntax
SELECT *
FROM       (VALUES( 1 ), ( 2 )) t1(v1)
CROSS JOIN (VALUES('A'), ('B')) t2(v2)
CROSS JOIN (VALUES('X'), ('Y')) t3(v3)

yielding

+----+----+----+
| v1 | v2 | v3 |
+----+----+----+
|  1 |  A |  X |
|  1 |  A |  Y |
|  1 |  B |  X |
|  1 |  B |  Y |
|  2 |  A |  X |
|  2 |  A |  Y |
|  2 |  B |  X |
|  2 |  B |  Y |
+----+----+----+

Java

List<Integer> s1 = Arrays.asList(1, 2);
List<String> s2 = Arrays.asList("A", "B");
List<String> s3 = Arrays.asList("X", "Y");

s1.stream()
  .flatMap(v1 -> s2.stream()
                   .map(v2 -> tuple(v1, v2)))
  .flatMap(v12-> s3.stream()
                   .map(v3 -> tuple(v12.v1, v12.v2, v3)))
  .forEach(System.out::println);

yielding

(1, A, X)
(1, A, Y)
(1, B, X)
(1, B, Y)
(2, A, X)
(2, A, Y)
(2, B, X)
(2, B, Y)

Note how we explicitly unnested the tuples from the first CROSS JOIN operation to form “flat” tuples in the second operation. This is optional, of course.

Java with jOOλ’s crossJoin()

Us jOOQ developers, we’re a very SQL-oriented people, so it is only natural to have added a crossJoin() convenience method for the above use-case. So our triple-cross join can be written like this:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<String> s2 = Seq.of("A", "B");
Seq<String> s3 = Seq.of("X", "Y");

s1.crossJoin(s2)
  .crossJoin(s3)
  .forEach(System.out::println);

yielding

((1, A), X)
((1, A), Y)
((1, B), X)
((1, B), Y)
((2, A), X)
((2, A), Y)
((2, B), X)
((2, B), Y)

In this case, we didn’t unnest the tuple produced in the first cross join. From a merely relational perspective, this doesn’t matter either. Nested tuples are the same thing as flat tuples. In SQL, we just don’t see the nesting. Of course, we could still unnest as well by adding a single additional mapping:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<String> s2 = Seq.of("A", "B");
Seq<String> s3 = Seq.of("X", "Y");

s1.crossJoin(s2)
  .crossJoin(s3)
  .map(t -> tuple(t.v1.v1, t.v1.v2, t.v2))
  .forEach(System.out::println);

yielding, again

(1, A, X)
(1, A, Y)
(1, B, X)
(1, B, Y)
(2, A, X)
(2, A, Y)
(2, B, X)
(2, B, Y)

(You may have noticed that map() corresponds to SELECT as we’ll see again later on)

INNER JOIN = flatMap() with filter()

The SQL INNER JOIN is essentially just syntactic sugar for a SQL CROSS JOIN with a predicate that reduces the tuple set after cross-joining. In SQL, the following two ways of inner joining are equivalent:

SQL

-- Table list syntax
SELECT *
FROM (VALUES(1), (2)) t1(v1), 
     (VALUES(1), (3)) t2(v2)
WHERE t1.v1 = t2.v2

-- INNER JOIN syntax
SELECT *
FROM       (VALUES(1), (2)) t1(v1)
INNER JOIN (VALUES(1), (3)) t2(v2)
ON t1.v1 = t2.v2

yielding

+----+----+
| v1 | v2 |
+----+----+
|  1 |  1 |
+----+----+

(note that the keyword INNER is optional).

So, the values 2 from t1 and the values 3 from t2 are “thrown away”, as they produce any rows for which the join predicate yields true.

The same can be expressed easily, yet more verbosely in Java

Java (inefficient solution!)

List<Integer> s1 = Arrays.asList(1, 2);
List<Integer> s2 = Arrays.asList(1, 3);

s1.stream()
  .flatMap(v1 -> s2.stream()
                   .map(v2 -> tuple(v1, v2)))
  .filter(t -> Objects.equals(t.v1, t.v2))
  .forEach(System.out::println);

The above correctly yields

(1, 1)

But beware that you’re attaining this result after producing a cartesian product, the nightmare of every DBA! As mentioned at the beginning of this article, unlike in declarative programming, in functional programming you instruct your program to do exactly the order of operations that you specify. In other words:

In functional programming, you define the exact “execution plan” of your query.

In declarative programming, an optimiser may reorganise your “program”

There is no optimiser to transform the above into the much more efficient:

Java (more efficient)

List<Integer> s1 = Arrays.asList(1, 2);
List<Integer> s2 = Arrays.asList(1, 3);

s1.stream()
  .flatMap(v1 -> s2.stream()
                   .filter(v2 -> Objects.equals(v1, v2))
                   .map(v2 -> tuple(v1, v2)))
  .forEach(System.out::println);

The above also yields

(1, 1)

Notice, how the join predicate has moved from the “outer” stream into the “inner” stream, that is produced in the function passed to flatMap().

Java (optimal)

As mentioned previously, functional programming doesn’t necessarily allow you to rewrite algorithms depending on knowledge of the actual data. The above presented implementation for joins always implement nested loop joins going from the first stream to the second. If you join more than two streams, or if the second stream is very large, this approach can be terribly inefficient. A sophisticated RDBMS would never blindly apply nested loop joins like that, but consider constraints, indexes, and histograms on actual data.

Going deeper into that topic would be out of scope for this article, though.

Java with jOOλ’s innerJoin()

Again, inspired by our work on jOOQ we’ve also added an innerJoin() convenience method for the above use-case:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<Integer> s2 = Seq.of(1, 3);

s1.innerJoin(s2, (t, u) -> Objects.equals(t, u))
  .forEach(System.out::println);

yielding

(1, 1)

… because after all, when joining two streams, the only really interesting operation is the join Predicate. All else (flatmapping, etc.) is just boilerplate.

LEFT OUTER JOIN = flatMap() with filter() and a “default”

SQL’s OUTER JOIN works like INNER JOIN, except that additional “default” rows are produced in case the JOIN predicate yields false for a pair of tuples. In terms of set theory / relational algebra, this can be expressed as such:

dd81ee1373d922122ce1b3e0da74cb28

Or in a SQL-esque dialect:

R LEFT OUTER JOIN S ::=

R INNER JOIN S
UNION (
  (R EXCEPT (SELECT R.* FROM R INNER JOIN S))
  CROSS JOIN
  (null, null, ..., null)
)

This simply means that when left outer joining S to R, there will be at least one row in the result for each row in R, with possibly an empty value for S.

Inversely, when right outer joining S to R, there will be at least one row in the result for each row in S, with possibly an empty value for R.

And finally, when full outer joining S to R, there will be at least one row in the result for each row in R with possibly an empty value for S AND for each row in S with possibly an empty value for R.

Let us look at LEFT OUTER JOIN, which is used most often in SQL.

SQL

-- Table list, Oracle syntax (don't use this!)
SELECT *
FROM (SELECT 1 v1 FROM DUAL
      UNION ALL 
      SELECT 2 v1 FROM DUAL) t1, 
     (SELECT 1 v2 FROM DUAL
      UNION ALL
      SELECT 3 v2 FROM DUAL) t2
WHERE t1.v1 = t2.v2 (+)

-- OUTER JOIN syntax
SELECT *
FROM            (VALUES(1), (2)) t1(v1)
LEFT OUTER JOIN (VALUES(1), (3)) t2(v2)
ON t1.v1 = t2.v2

yielding

+----+------+
| v1 |   v2 |
+----+------+
|  1 |    1 |
|  2 | null |
+----+------+

(note that the keyword OUTER is optional).

Java

Unfortunately, the JDK’s Stream API doesn’t provide us with an easy way to produce “at least” one value from a stream, in case the stream is empty. We could be writing a utility function as explained by Stuart Marks on Stack Overflow:

static <T> Stream<T> defaultIfEmpty(
    Stream<T> stream, Supplier<T> supplier) {
    Iterator<T> iterator = stream.iterator();

    if (iterator.hasNext()) {
        return StreamSupport.stream(
            Spliterators.spliteratorUnknownSize(
                iterator, 0
            ), false);
    } else {
        return Stream.of(supplier.get());
    }
}

Or, we just use jOOλ’s Seq.onEmpty()

List<Integer> s1 = Arrays.asList(1, 2);
List<Integer> s2 = Arrays.asList(1, 3);

seq(s1)
.flatMap(v1 -> seq(s2)
              .filter(v2 -> Objects.equals(v1, v2))
              .onEmpty(null)
              .map(v2 -> tuple(v1, v2)))
.forEach(System.out::println);

(notice, we’re putting null in a stream. This might not always be a good idea. We’ll follow up with that in a future blog post)

The above also yields

(1, 1)
(2, null)

How to read the implicit left outer join?

  • We’ll take each value v1 from the left stream s1
  • For each such value v1, we flatmap the right stream s2 to produce a tuple (v1, v2) (a cartesian product, cross join)
  • We’ll apply the join predicate for each such tuple (v1, v2)
  • If the join predicate leaves no tuples for any value v2, we’ll generate a single tuple containing the value of the left stream v1 and null

Java with jOOλ

For convenience, jOOλ also supports leftOuterJoin() which works as described above:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<Integer> s2 = Seq.of(1, 3);

s1.leftOuterJoin(s2, (t, u) -> Objects.equals(t, u))
  .forEach(System.out::println);

yielding

(1, 1)
(2, null)

RIGHT OUTER JOIN = inverse LEFT OUTER JOIN

Trivially, a RIGHT OUTER JOIN is just the inverse of the previous LEFT OUTER JOIN. The jOOλ implementation of rightOuterJoin() looks like this:

default <U> Seq<Tuple2<T, U>> rightOuterJoin(
    Stream<U> other, BiPredicate<T, U> predicate) {
    return seq(other)
          .leftOuterJoin(this, (u, t) -> predicate.test(t, u))
          .map(t -> tuple(t.v2, t.v1));
}

As you can see, the RIGHT OUTER JOIN inverses the results of a LEFT OUTER JOIN, that’s it. For example:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<Integer> s2 = Seq.of(1, 3);

s1.rightOuterJoin(s2, (t, u) -> Objects.equals(t, u))
  .forEach(System.out::println);

yielding

(1, 1)
(null, 3)

WHERE = filter()

The most straight-forward mapping is probably SQL’s WHERE clause having an exact equivalent in the Stream API: Stream.filter().

SQL

SELECT *
FROM (VALUES(1), (2), (3)) t(v)
WHERE v % 2 = 0

yielding

+---+
| v |
+---+
| 2 |
+---+

Java

Stream<Integer> s = Stream.of(1, 2, 3);

s.filter(v -> v % 2 == 0)
 .forEach(System.out::println);

yielding

2

The interesting thing with filter() and the Stream API in general is that the operation can apply at any place in the call chain, unlike the WHERE clause, which is limited to be placed right after the FROM clause – even if SQL’s JOIN .. ON or HAVING clauses are semantically similar.

GROUP BY = collect()

The least straight-forward mapping is GROUP BY vs. Stream.collect().

First off, SQL’s GROUP BY may be a bit tricky to fully understand. It is really part of the FROM clause, transforming the set of tuples produced by FROM .. JOIN .. WHERE into groups of tuples, where each group has an associated set of aggregatable tuples, which can be aggregated in the HAVING, SELECT, and ORDER BY clauses. Things get even more interesting when you use OLAP features like GROUPING SETS, which allow for duplicating tuples according to several grouping combinations.

In most SQL implementations that don’t support ARRAY or MULTISET, the aggregatable tuples are not available as such (i.e. as nested collections) in the SELECT. Here, the Stream API’s feature set excels. On the other hand, the Stream API can group values only as a terminal operation, where in SQL, GROUP BY is applied purely declaratively (and thus, lazily). The execution planner may choose not to execute the GROUP BY at all if it is not needed. For instance:

SELECT *
FROM some_table
WHERE EXISTS (
    SELECT x, sum(y)
    FROM other_table
    GROUP BY x
)

The above query is semantically equivalent to

SELECT *
FROM some_table
WHERE EXISTS (
    SELECT 1
    FROM other_table
)

The grouping in the subquery was unnecessary. Someone may have copy-pasted that subquery in there from somewhere else, or refactored the query as a whole. In Java, using the Stream API, each operation is always executed.

For the sake of simplicity, we’ll stick to the most simple examples here

Aggregation without GROUP BY

A special case is when we do not specify any GROUP BY clause. In that case, we can specify aggregations on all columns of the FROM clause, producing always exactly one record. For instance:

SQL

SELECT sum(v)
FROM (VALUES(1), (2), (3)) t(v)

yielding

+-----+
| sum |
+-----+
|   6 |
+-----+

Java

Stream<Integer> s = Stream.of(1, 2, 3);

int sum = s.collect(Collectors.summingInt(i -> i));
System.out.println(sum);

yielding

6

Aggregation with GROUP BY

A more common case of aggregation in SQL is to specify an explicit GROUP BY clause as explained before. For instance, we may want to group by even and odd numbers:

SQL

SELECT v % 2, count(v), sum(v)
FROM (VALUES(1), (2), (3)) t(v)
GROUP BY v % 2

yielding

+-------+-------+-----+
| v % 2 | count | sum |
+-------+-------+-----+
|     0 |     1 |   2 |
|     1 |     2 |   4 |
+-------+-------+-----+

Java

For this simple grouping / collection use-case, luckily, the JDK offers a utility method called Collectors.groupingBy(), which produces a collector that generates a Map<K, List<V>> type like this:

Stream<Integer> s = Stream.of(1, 2, 3);

Map<Integer, List<Integer>> map = s.collect(
    Collectors.groupingBy(v -> v % 2)
);

System.out.println(map);

yielding

{0=[2], 1=[1, 3]}

This certainly takes care of the grouping. Now we want to produce aggregations for each group. The slightly awkward JDK way to do this would be:

Stream<Integer> s = Stream.of(1, 2, 3);

Map<Integer, IntSummaryStatistics> map = s.collect(
    Collectors.groupingBy(
        v -> v % 2,
        Collectors.summarizingInt(i -> i)
    )
);

System.out.println(map);

we’ll now get:

{0=IntSummaryStatistics{count=1, sum=2, min=2, average=2.000000, max=2},
 1=IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3}}

As you can see, the count() and sum() values have been calculated somewhere along the lines of the above.

More sophisticated GROUP BY

When doing multiple aggregations with Java 8’s Stream API, you will quickly be forced to wrestle low-level API implementing complicated collectors and accumulators yourself. This is tedious and unnecessary. Consider the following SQL statement:

SQL

CREATE TABLE t (
  w INT,
  x INT,
  y INT,
  z INT
);

SELECT
    z, w, 
    MIN(x), MAX(x), AVG(x), 
    MIN(y), MAX(y), AVG(y) 
FROM t
GROUP BY z, w;

In one go, we want to:

  • Group by several values
  • Aggregate from several values

Java

In a previous article, we’ve explained in detail how this can be achieved using convenience API from jOOλ via Seq.groupBy()

class A {
    final int w;
    final int x;
    final int y;
    final int z;
 
    A(int w, int x, int y, int z) {
        this.w = w;
        this.x = x;
        this.y = y;
        this.z = z;
    }
}

Map<
    Tuple2<Integer, Integer>, 
    Tuple2<IntSummaryStatistics, IntSummaryStatistics>
> map =
Seq.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))
 
// Seq.groupBy() is just short for 
// Stream.collect(Collectors.groupingBy(...))
.groupBy(
    a -> tuple(a.z, a.w),
 
    // ... because once you have tuples, 
    // why not add tuple-collectors?
    Tuple.collectors(
        Collectors.summarizingInt(a -> a.x),
        Collectors.summarizingInt(a -> a.y)
    )
);

System.out.println(map);

The above yields

{(1, 1)=(IntSummaryStatistics{count=2, sum=3, min=1, average=1.500000, max=2},
         IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3}),
 (4, 9)=(IntSummaryStatistics{count=2, sum=17, min=8, average=8.500000, max=9},
         IntSummaryStatistics{count=2, sum=13, min=6, average=6.500000, max=7}),
 (5, 2)=(IntSummaryStatistics{count=3, sum=12, min=3, average=4.000000, max=5},
         IntSummaryStatistics{count=3, sum=13, min=4, average=4.333333, max=5})}

For more details, read the full article here.

Notice how using Stream.collect(), or Seq.groupBy() already makes for an implicit SELECT clause, which we are no longer needed to obtain via map() (see below).

HAVING = filter(), again

As mentioned before, there aren’t really different ways of applying predicates with the Stream API, there is only Stream.filter(). In SQL, HAVING is a “special” predicate clause that is syntactically put after the GROUP BY clause. For instance:

SQL

SELECT v % 2, count(v)
FROM (VALUES(1), (2), (3)) t(v)
GROUP BY v % 2
HAVING count(v) > 1

yielding

+-------+-------+
| v % 2 | count |
+-------+-------+
|     1 |     2 |
+-------+-------+

Java

Unfortunately, as we have seen before, collect() is a terminal operation in the Stream API, which means that it eagerly produces a Map, instead of transforming the Stream<T> into a Stream<K, Stream<V>, which would compose much better in complex Stream. This means that any operation that we’d like to implement right after collecting will have to be implemented on a new stream produced from the output Map:

Stream<Integer> s = Stream.of(1, 2, 3);

s.collect(Collectors.groupingBy(
      v -> v % 2,
      Collectors.summarizingInt(i -> i)
  ))
  .entrySet()
  .stream()
  .filter(e -> e.getValue().getCount() > 1)
  .forEach(System.out::println);

yielding

1=IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3}

As you can see, the type transformation that is applied is:

  • Map<Integer, IntSummaryStatistics>
  • Set<Entry<Integer, IntSummaryStatistics>>
  • Stream<Entry<Integer, IntSummaryStatistics>>

SELECT = map()

The SELECT clause in SQL is nothing more than a tuple transformation function that takes the cartesian product of tuples produced by the FROM clause and transforms it into a new tuple expression, which is fed either to the client, or to some higher-level query if this is a nested SELECT. An illustration:

FROM output

+------+------+------+------+------+
| T1.A | T1.B | T1.C | T2.A | T2.D |
+------+------+------+------+------+
|    1 |    A |    a |    1 |    X |
|    1 |    B |    b |    1 |    Y |
|    2 |    C |    c |    2 |    X |
|    2 |    D |    d |    2 |    Y |
+------+------+------+------+------+

Applying SELECT

SELECT t1.a, t1.c, t1.b || t1.d

+------+------+--------------+
| T1.A | T1.C | T1.B || T1.D |
+------+------+--------------+
|    1 |    a |           AX |
|    1 |    b |           BY |
|    2 |    c |           CX |
|    2 |    d |           DY |
+------+------+--------------+

Using Java 8 Streams, SELECT can be achieved very simply by using Stream.map(), as we’ve already seen in previous examples, where we unnested tuples using map(). The following examples are functionally equivalent:

SQL

SELECT t.v1 * 3, t.v2 + 5
FROM (
  VALUES(1, 1),
        (2, 2)
) t(v1, v2)

yielding

+----+----+
| c1 | c2 |
+----+----+
|  3 |  6 |
|  6 |  7 |
+----+----+

Java

Stream.of(
  tuple(1, 1),
  tuple(2, 2)
).map(t -> tuple(t.v1 * 3, t.v2 + 5))
 .forEach(System.out::println);

yielding

(3, 6)
(6, 7)

DISTINCT = distinct()

The DISTINCT keyword that can be supplied with the SELECT clause simply removes duplicate tuples right after they have been produced by the SELECT clause. An illustration:

FROM output

+------+------+------+------+------+
| T1.A | T1.B | T1.C | T2.A | T2.D |
+------+------+------+------+------+
|    1 |    A |    a |    1 |    X |
|    1 |    B |    b |    1 |    Y |
|    2 |    C |    c |    2 |    X |
|    2 |    D |    d |    2 |    Y |
+------+------+------+------+------+

Applying SELECT DISTINCT

SELECT DISTINCT t1.a

+------+
| T1.A |
+------+
|    1 |
|    2 |
+------+

Using Java 8 Streams, SELECT DISTINCT can be achieved very simply by using Stream.distinct() right after Stream.map(). The following examples are functionally equivalent:

SQL

SELECT DISTINCT t.v1 * 3, t.v2 + 5
FROM (
  VALUES(1, 1),
        (2, 2),
        (2, 2)
) t(v1, v2)

yielding

+----+----+
| c1 | c2 |
+----+----+
|  3 |  6 |
|  6 |  7 |
+----+----+

Java

Stream.of(
  tuple(1, 1),
  tuple(2, 2),
  tuple(2, 2)
).map(t -> tuple(t.v1 * 3, t.v2 + 5))
 .distinct()
 .forEach(System.out::println);

yielding

(3, 6)
(6, 7)

UNION ALL = concat()

Set operations are powerful both in SQL and using the Stream API. The UNION ALL operation maps to Stream.concat(), as can be seen below:

SQL

SELECT *
FROM (VALUES(1), (2)) t(v)
UNION ALL
SELECT *
FROM (VALUES(1), (3)) t(v)

yielding

+---+
| v |
+---+
| 1 |
| 2 |
| 1 |
| 3 |
+---+

Java

Stream<Integer> s1 = Stream.of(1, 2);
Stream<Integer> s2 = Stream.of(1, 3);

Stream.concat(s1, s2)
      .forEach(System.out::println);

yielding

1
2
1
3

Java (using jOOλ)

Unfortunately, concat() exists in Stream only as a static method, while Seq.concat() also exists on instances when working with jOOλ.

Seq<Integer> s1 = Seq.of(1, 2);
Seq<Integer> s2 = Seq.of(1, 3);

s1.concat(s2)
  .forEach(System.out::println);

UNION = concat() and distinct()

In SQL, UNION is defined to remove duplicates after concatenating the two sets via UNION ALL. The following two statements are equivalent:

SELECT * FROM t
UNION
SELECT * FROM u;

-- equivalent

SELECT DISTINCT *
FROM (
  SELECT * FROM t
  UNION ALL
  SELECT * FROM u
);

Let’s put this in action:

SQL

SELECT *
FROM (VALUES(1), (2)) t(v)
UNION
SELECT *
FROM (VALUES(1), (3)) t(v)

yielding

+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
+---+

Java

Stream<Integer> s1 = Stream.of(1, 2);
Stream<Integer> s2 = Stream.of(1, 3);

Stream.concat(s1, s2)
      .distinct()
      .forEach(System.out::println);

ORDER BY = sorted()

The ORDER BY mapping is trivial

SQL

SELECT *
FROM (VALUES(1), (4), (3)) t(v)
ORDER BY v

yielding

+---+
| v |
+---+
| 1 |
| 3 |
| 4 |
+---+

Java

Stream<Integer> s = Stream.of(1, 4, 3);

s.sorted()
 .forEach(System.out::println);

yielding

1
3
4

LIMIT = limit()

The LIMIT mapping is even more trivial

SQL

SELECT *
FROM (VALUES(1), (4), (3)) t(v)
LIMIT 2

yielding

+---+
| v |
+---+
| 1 |
| 4 |
+---+

Java

Stream<Integer> s = Stream.of(1, 4, 3);

s.limit(2)
 .forEach(System.out::println);

yielding

1
4

OFFSET = skip()

The OFFSET mapping is trivial as well

SQL

SELECT *
FROM (VALUES(1), (4), (3)) t(v)
OFFSET 1

yielding

+---+
| v |
+---+
| 4 |
| 3 |
+---+

Java

Stream<Integer> s = Stream.of(1, 4, 3);

s.skip(1)
 .forEach(System.out::println);

yielding

4
3

Conclusion

In the above article, we’ve seen pretty much all the useful SQL SELECT query clauses and how they can be mapped to the Java 8 Stream API, or to jOOλ’s Seq API, in case Stream doesn’t offer sufficient functionality.

The article shows that SQL’s declarative world is not that much different from Java 8’s functional world. SQL clauses can compose ad-hoc queries just as well as Stream methods can be used to compose functional transformation pipelines. But there is a fundamental difference.

While SQL is truly declarative, functional programming is still very instructive. The Stream API does not make optimisation decisions based on constraints, indexes, histograms and other meta information about the data that you’re transforming. Using the Stream API is like using all possible optimisation hints in SQL to force the SQL engine to choose one particular execution plan over another. However, while SQL is a higher level algorithm abstraction, the Stream API may allow you to implement more customisable algorithms.