Using JDK Collectors to De-duplicate parent/child nested collections

In classic SQL (i.e. before jOOQ’s awesome MULTISET operator), nested collections were fetched using ordinary (outer) joins. An example of such a query would be a query running against the sakila database to fetch actors and their films. Using jOOQ:

Result<?> result =
ctx.select(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME,
        FILM.FILM_ID,
        FILM.TITLE)
   .from(ACTOR)
   .leftJoin(FILM_ACTOR).on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .leftJoin(FILM).on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .orderBy(
       ACTOR.ACTOR_ID,
       FILM.FILM_ID)
   .fetch();

The result from the jOOQ debug log would look something like this:

+--------+----------+---------+-------+---------------------+
|actor_id|first_name|last_name|film_id|title                |
+--------+----------+---------+-------+---------------------+
|       1|PENELOPE  |GUINESS  |      1|ACADEMY DINOSAUR     |
|       1|PENELOPE  |GUINESS  |     23|ANACONDA CONFESSIONS |
|       1|PENELOPE  |GUINESS  |     25|ANGELS LIFE          |
|       1|PENELOPE  |GUINESS  |    106|BULWORTH COMMANDMENTS|
|       1|PENELOPE  |GUINESS  |    140|CHEAPER CLYDE        |
+--------+----------+---------+-------+---------------------+

As expected for a SQL join operation, this denormalises the data, leading to duplicate entries for each actor, or if you sort differently, e.g. by FILM_ID, then you’d also see the duplicate entries per film:

+--------+----------+---------+-------+----------------+
|actor_id|first_name|last_name|film_id|title           |
+--------+----------+---------+-------+----------------+
|       1|PENELOPE  |GUINESS  |      1|ACADEMY DINOSAUR|
|      10|CHRISTIAN |GABLE    |      1|ACADEMY DINOSAUR|
|      20|LUCILLE   |TRACY    |      1|ACADEMY DINOSAUR|
|      30|SANDRA    |PECK     |      1|ACADEMY DINOSAUR|
|      40|JOHNNY    |CAGE     |      1|ACADEMY DINOSAUR|
+--------+----------+---------+-------+----------------+

This is just how a join operation works. It creates a cartesian product and then filters by primary key / foreign key matches, which JPA users know under the name MultipleBagFetchException.

De-duplicating and nesting the collections with fetchGroups()

What we usually want is some sort of nested data structure, e.g. fetch the films per actor. A simple utility in jOOQ is to just use fetchGroups():

Map<ActorRecord, Result<FilmRecord>> result =
ctx.select(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME,
        FILM.FILM_ID,
        FILM.TITLE)
   .from(ACTOR)
   .leftJoin(FILM_ACTOR).on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .leftJoin(FILM).on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .orderBy(
       ACTOR.ACTOR_ID,
       FILM.FILM_ID)
   .fetchGroups(ACTOR, FILM);

This is super convenient, it’s order preserving, but it has a flaw. It is not “smart” enough to remember the LEFT JOIN semantics and do The Right Thing, which is produce an empty list of FilmRecord in case an actor doesn’t have any films. If that’s the case, then there’s a NULL film in the result set in SQL:

+--------+----------+---------+-------+---------------------+
|actor_id|first_name|last_name|film_id|title                |
+--------+----------+---------+-------+---------------------+
|       1|PENELOPE  |GUINESS  |      1|ACADEMY DINOSAUR     |
|       1|PENELOPE  |GUINESS  |     23|ANACONDA CONFESSIONS |
|       1|PENELOPE  |GUINESS  |     25|ANGELS LIFE          |
|       1|PENELOPE  |GUINESS  |    106|BULWORTH COMMANDMENTS|
|       1|PENELOPE  |GUINESS  |    140|CHEAPER CLYDE        |
|     ...|...       |...      |    ...|...                  |
|     201|UNKNOWN   |ACTOR    | {null}|{null}               |
+--------+----------+---------+-------+---------------------+

So, we won’t get an empty list, but a list containing an empty FilmRecord, just like when you GROUP BY actor and COUNT(*) the number of films:

var r =
ctx.select(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME,
        count(),
        count(FILM.FILM_ID))
   .from(ACTOR)
   .leftJoin(FILM_ACTOR).on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .leftJoin(FILM).on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .groupBy(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME)
   .orderBy(
       ACTOR.ACTOR_ID)
   .fetch();

The result of this query might look like this:

+--------+----------+------------+-----+-----+
|actor_id|first_name|last_name   |count|count|
+--------+----------+------------+-----+-----+
|       1|PENELOPE  |GUINESS     |   19|   19|
|       2|NICK      |WAHLBERG    |   25|   25|
|       3|ED        |CHASE       |   22|   22|
|       4|JENNIFER  |DAVIS       |   22|   22|
|       5|JOHNNY    |LOLLOBRIGIDA|   29|   29|
|     ...|...       |...         |  ...|  ...|
|     201|UNKNOWN   |ACTOR       |    1|    0|
+--------+----------+------------+-----+-----+

Observe how the desired count value of 0 can only be achieved when we pass the nullable FILM.FILM_ID column as an argument. So, what would be the equivalent fetchGroups() call that exposes this behaviour?

Deduplicating and nesting the collections with JDK Collectors

A very much underestimated JDK feature are Collectors. While they were introduced specifically for usage with the Stream API, they could be used with any type of Iterable, in principle, and I’m still hoping that a future JDK will offer Iterable.collect(), among other things.

With jOOQ, you can collect the results of any query by calling ResultQuery.collect(). To translate the above fetchGroups() example, we can write this, producing almost the same result:

Map<ActorRecord, List<FilmRecord>> result =
ctx.select(
        ACTOR.ACTOR_ID,
        ACTOR.FIRST_NAME,
        ACTOR.LAST_NAME,
        FILM.FILM_ID,
        FILM.TITLE)
   .from(ACTOR)
   .leftJoin(FILM_ACTOR).on(ACTOR.ACTOR_ID.eq(FILM_ACTOR.ACTOR_ID))
   .leftJoin(FILM).on(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
   .orderBy(
       ACTOR.ACTOR_ID,
       FILM.FILM_ID)
   .collect(groupingBy(
       r -> r.into(ACTOR), filtering(
           r -> r.get(FILM.FILM_ID) != null, mapping(
               r -> r.into(FILM), toList()
           )
       )
   ));

The above collect() call nests these operations:

  • It groups by actor (just like fetchGroups())
  • It filters group contents by those films whose ID isn’t NULL (this can’t be achieved with fetchGroups()).
  • It maps the group contents to contain only FILM content, not the entire projection.

So, obviously more verbose, but also much more poweful and easy to introduce custom client side aggregation behaviour, in case you can’t move the aggregation logic to your SQL statement.

More on powerful collectors in this article here:

Nesting collections directly in SQL

No article on this blog would be complete without plugging the awesome MULTISET alternative to nest collections. After all, the above deduplication algorithm only really works if you’re joining down a single parent-child relationship path, and it’s quite inefficient when there are a lot of duplicate data sets.

Assuming these auxiliary data types:

record Film(String title) {}
record Actor(String firstName, String lastName) {}
record Category(String name) {}

You can write a query like this:

// We're importing the new Records::mapping method for convenience
import static org.jooq.Records.mapping;

Result<Record3<Film, List<Actor>, List<Category>>> result = ctx
    .select(
        FILM.TITLE.convertFrom(Film::new),
        multiset(
            select(
                FILM_ACTOR.actor().FIRST_NAME,
                FILM_ACTOR.actor().LAST_NAME
            )
            .from(FILM_ACTOR)
            .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
        ).convertFrom(r -> r.map(mapping(Actor::new))),
        multiset(
            select(FILM_CATEGORY.category().NAME)
            .from(FILM_CATEGORY)
            .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
        ).convertFrom(r -> r.map(mapping(Category::new)))
    )
    .from(FILM)
    .orderBy(FILM.TITLE)
    .fetch();

Notice that we’re now getting a result set containing

  • All the films
  • The actors per film as a nested collection
  • The categories per film as another nested collection

The example is using implicit joins to avoid some of the more verbose join syntaxes inside of the MULTISET expressions, but that’s not strictly relevant for this example.

The result being something like:

+----------------------------+--------------------------------------------------+----------------------------+
|title                       |multiset                                          |multiset                    |
+----------------------------+--------------------------------------------------+----------------------------+
|Film[title=ACADEMY DINOSAUR]|[Actor[firstName=PENELOPE, lastName=GUINESS], A...|[Category[name=Documentary]]|
|Film[title=ACE GOLDFINGER]  |[Actor[firstName=BOB, lastName=FAWCETT], Actor[...|[Category[name=Horror]]     |
|Film[title=ADAPTATION HOLES]|[Actor[firstName=NICK, lastName=WAHLBERG], Acto...|[Category[name=Documentary]]|
|Film[title=AFFAIR PREJUDICE]|[Actor[firstName=JODIE, lastName=DEGENERES], Ac...|[Category[name=Horror]]     |
|Film[title=AFRICAN EGG]     |[Actor[firstName=GARY, lastName=PHOENIX], Actor...|[Category[name=Family]]     |
+----------------------------+--------------------------------------------------+----------------------------+

Conclusion

There are many ways that lead to Rome. Classic SQL based approaches to nest collections used some sort of deduplication technique in the client. In jOOQ, you could always do this with fetchGroups(), and since recently also with collect() directly (collect() was always available via an intermediary stream() call).

For truly powerful nesting of collections, however, our recommendation is always to move your logic into SQL directly, using native array functionality, or using MULTISET