No More MultipleBagFetchException Thanks to Multiset Nested Collections

I’ve recently stumbled upon this interesting Stack Overflow question about Hibernate’s popular MultipleBagFetchException. The question is super popular, and the answers are plenty. The various limitations are discussed throughout the question, it all boils down to a simple fact:

Joins are the wrong tool to nest collections.

Given a schema like the Sakila database:

There are many to many relationships between:

  • ACTOR and FILM
  • FILM and CATEGORY

Nothing extraordinary. The problem is that when using an ORM, given the nature of the O (Object, Object Orientation), we want to represent this data in a hierarchy, as a graph, or at least a tree. The same holds true when we want to represent this in JSON or XML.

For example, in Java, the following DTOs are a natural representation of the above schema:

record Actor(
    String firstName, 
    String lastName
) {}

record Category(
    String name
) {}

record Film(
    String title,
    List<Actor> actors,
    List<Category> categories
) {}

In JSON, the data might look something like this:

[
  {
    "title": "ACADEMY DINOSAUR",
    "actors": [
      {
        "first_name": "PENELOPE",
        "last_name": "GUINESS"
      },
      {
        "first_name": "CHRISTIAN",
        "last_name": "GABLE"
      },
      {
        "first_name": "LUCILLE",
        "last_name": "TRACY"
      },
      {
        "first_name": "SANDRA",
        "last_name": "PECK"
      },

      ...
    ],
    "categories": [
      { "name": "Documentary" }
    ]
  },
  {
    "title": "ACE GOLDFINGER",
    "actors": [
      {
        "first_name": "BOB",
        "last_name": "FAWCETT"
      },
  ...

Emulated nesting using joins

But the problem in Hibernate and in SQL in general is the fact that joins create cartesian products. It’s not actually a problem. It’s a feature of SQL and relational algebra. We have an entire blog post about how our industry has been teaching joins wrong, using Venn Diagrams.

Joins are filtered cartesian products. Here’s an example of a cartesian product (without filter):

venn-cross-product

Now, if you want to produce the previous nested collection representation using joins only, then you’ll probably write something like this:

SELECT *
FROM film AS f
  JOIN film_actor AS fa USING (film_id)
    JOIN actor AS a USING (actor_id)
  JOIN film_category AS fc USING (film_id)
    JOIN category AS c USING (category_id)

I’ve indented the joins deliberately to illustrate the tree structure of this denormalisation. For each film, we join:

  • Many actors (e.g. M)
  • Many categories (e.g. N)

This means, that we duplicate the film M * N times due to the nature of joins being cartesian products. Not just that, but worse, we also duplicate:

  • Each actor N times (once per category)
  • Each category M times (once per actor)

Eventually, this might even lead to wrong results, e.g. when aggregating, as some combinations shouldn’t be combined.

Apart from the potential correctness problem, this is a very big performance problem. As the ubiquitous Vlad has explained in his answer, JOIN FETCH syntax is being suggested along with DISTINCT and multiple queries as a workaround. You then have to re-assemble the results manually (I stand corrected, Hibernate takes care of re-assembling, see Vlad’s comment below) and take proper care about eager and lazy loading (while you generally have to be careful with these, they don’t apply here, see again Vlad’s comment below). Quite the chore if you ask me!

This is my most favourite Google search on the subject matter:

To be fair, the chore used to be also present with jOOQ, in the past – at least you couldn’t shoot yourself in the foot with accidentally loading all the entire database.

Actual nesting

Ever since ORDBMS were introduced (e.g. Informix, Oracle, PostgreSQL), and the more popular SQL/XML and SQL/JSON extensions were added, it is possible to perform actual nesting directly in SQL. I’ve blogged about this many times now on this blog:

The correct way to nest collections is with SQL via one of the above 3 serialisation formats (native, JSON, XML).

With the above techniques, you can nest your data to any nested DTO structure in Java, or to any nested JSON format. This is possible with native SQL or with jOOQ. It might also be possible with Hibernate in the future, or with other ORMs that follow jOOQ’s lead in this area.

Given the popularity of this Stack Overflow question, it is hard to ignore how important of a problem the nesting of multiple to-many relationships is, and how both SQL (the language) and ORMs have ignored this problem for so long, offering only quirky workarounds that leave users to implement their serialisation manually, when jOOQ has shown how simple and transparent it could be.

Try jOOQ’s MULTISET operator today, no need to wait. It’s as simple as this:

List<Film> result =
dsl.select(
      FILM.TITLE,
      multiset(
        select(
          FILM_ACTOR.actor().FIRST_NAME, 
          FILM_ACTOR.actor().LAST_NAME)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
      ).as("actors").convertFrom(r -> r.map(mapping(Actor::new))),
      multiset(
        select(FILM_CATEGORY.category().NAME)
        .from(FILM_CATEGORY)
        .where(FILM_CATEGORY.FILM_ID.eq(FILM.FILM_ID))
      ).as("categories").convertFrom(r -> r.map(Record1::value1))
   )
   .from(FILM)
   .orderBy(FILM.TITLE)
   .fetch(mapping(Film::new));

And the above query is type safe! As soon as you modify your DTO, the query no longer compiles. Not just that! jOOQ also has a parser, so you can pretend your favourite SQL dialect already has MULTISET support today. Try this query here: https://www.jooq.org/translate/

SELECT
  f.title,
  MULTISET(
    SELECT a.first_name, a.last_name
    FROM film_actor AS fa
    JOIN actor AS a USING (actor_id)
    WHERE fa.film_id = f.film_id
  ) AS actors,
  MULTISET(
    SELECT c.name
    FROM film_category AS fc
    JOIN category AS c USING (category_id)
    WHERE fc.film_id = f.film_id
  ) AS categories
FROM film AS f
ORDER BY f.title

jOOQ’s translator will translate this to the following on PostgreSQL:

SELECT
  f.title,
  (
    SELECT coalesce(
      jsonb_agg(jsonb_build_array("v0", "v1")),
      jsonb_build_array()
    )
    FROM (
      SELECT
        a.first_name AS "v0",
        a.last_name AS "v1"
      FROM film_actor AS fa
        JOIN actor AS a
          USING (actor_id)
      WHERE fa.film_id = f.film_id
    ) AS "t"
  ) AS actors,
  (
    SELECT coalesce(
      jsonb_agg(jsonb_build_array("v0")),
      jsonb_build_array()
    )
    FROM (
      SELECT c.name AS "v0"
      FROM film_category AS fc
        JOIN category AS c
          USING (category_id)
      WHERE fc.film_id = f.film_id
    ) AS "t"
  ) AS categories
FROM film AS f
ORDER BY f.title

9 thoughts on “No More MultipleBagFetchException Thanks to Multiset Nested Collections

  1. > JOIN FETCH syntax is being suggested along with DISTINCT and multiple queries as a workaround. You then have to re-assemble the results manually

    Well, no. You don’t have to re-assemble anything. Hibernate assembles everything behind the scenes.

    > and take proper care about eager and lazy loading.

    Wait, what? The query itself defines the projection of what you need to fetch. There’s no extra lazy/eager loading to take care of. Your narrative is flawed.

    > Quite the chore if you ask me!

    Not really! It’s actually very simple. There’s no Cartesian Product, and the extra collections are fetched using secondary queries.

    1. > Well, no. You don’t have to re-assemble anything. Hibernate assembles everything behind the scenes.

      I must have misunderstood your Stack Overflow answer. So, in “The proper solution,” you’re saying that the Post elements of your result list will be patched by the second query? The Nth query somehow remembers that previous queries had fetched additional data?

      > Not really! It’s actually very simple. There’s no Cartesian Product, and the extra collections are fetched using secondary queries.

      I have to think an awful lot about what is happening exactly behind the scenes in order to get it right and avoid all pitfalls. I’m convinced that if Hibernate pulls off https://hibernate.atlassian.net/browse/HHH-9558, you will think the same. Want to bet? :-)

  2. > the Post elements of your result list will be patched by the second query?

    Some Hibernate context:
    Ultimately, a Hibernate Session instances is just a HashMap of JDBC “data values” keyed by primary key – this is also known as the “first level cache” (well, used to be, been a long time since I used HB in anger).
    When HB returns objects to the user as “Entity instances”, it iterates across each instance to be returned and populates all the fields from the first level cache.

    What’s actually happening:
    In the case of running two queries as Vlad suggested, Query-1 primes the first level cache with all the DB data values read from the DB by the first query for the entities and for Collection-1. The actual instances returned at this point do not have the Colletion-2 populated yet. You then issue Query-2, which reads all the DB values for Collection-2. Then, when HB is returned the list of entities from Query-2, all the values for both collections are contained in the first level cache, so it populates them.

    1. Yeah, that’s how I understood Vlad’s criticism as well, thanks for clarifying. Well… I think that’s even worse 😅 A query should return exactly what’s written in the query. I see the value of populating existing data with additional data if done explicitly, and I might actually work on that, myself, in the near future – some way of merging data trees, or even graphs.

      But I guess 1) I’m biased, and 2) it does answer the Stack Overflow question with what seems to be the most efficient solution in Hibernate 5.x.

  3. Great post! Seeing performance comparison on all approaches would be very interesting. And since you fetch the relations in the same query it means you may have less problems when serializable isolation should be used for multiple queries. But using this approach (json,…) without having good ORM can be fatal.

    1. Yes, the performance comparison is still due on this blog. It’s quite a research investment, given:

      – Different emulation possibilities (native using ARRAY and ROW e.g. in PostgreSQL, BigQuery, SQL/XML, SQL/JSON)
      – Different serialisation possibilities (Currently using JSON array of arrays, for example, not array of objects)
      – Different RDBMS and their caveats
      – Different query complexities, including trees of varying depths and breadths (the deeper and wider a tree, the more I’d expect this approach to outperform anything else)
      – Different data set sizes (nesting in this way tends to work with nested loop joins, so only useful for smaller nested data sets. Hash joins might be better for larger ones)

  4. JSON looks like a good solution but it would be more performant (according to payload) to use one of the binary serializations like Protobuf. I know this is challenging because there are variety of databases to support.

    1. What database product can serialise protobuf style binary data from relational operators? E.g. protobuf_record(…) or protobuf_agg() would be needed…

Leave a Reply