Use MULTISET Predicates to Compare Data Sets

Questions that might be a bit more difficult to solve using ordinary SQL are questions of the kind:

What films have the same actors as a given film X?

As always, we’re using the sakila database for this example. What would be a possible way to solve this with SQL (for example, PostgreSQL, to be specific)? The following query gives an overview of actors per film:

SELECT
  film_id,
  array_agg(actor_id ORDER BY actor_id) actors
FROM film_actor
GROUP BY film_id

It produces something like this:

|film_id|actors                            |
|-------|----------------------------------|
|1      |{1,10,20,30,40,53,108,162,188,198}|
|2      |{19,85,90,160}                    |
|3      |{2,19,24,64,123}                  |
|4      |{41,81,88,147,162}                |
|5      |{51,59,103,181,200}               |
|6      |{21,23,62,108,137,169,197}        |
|...    |...                               |

Note that in SQL, arrays behave like lists, i.e. they maintain their ordering, so ordering the array explicitly is important to be able to compare the actors with each other. Now, we want to find all films that share the same actor set, from the above:

WITH t AS (
  -- Previous query
  SELECT
    film_id,
    array_agg(actor_id ORDER BY actor_id) actors
  FROM film_actor
  GROUP BY film_id
)
SELECT 
  array_agg(film_id ORDER BY film_id) AS films,
  actors
FROM t
GROUP BY actors
ORDER BY count(*) DESC, films

The result is now:

|films   |actors                            |
|--------|----------------------------------|
|{97,556}|{65}                              |
|{1}     |{1,10,20,30,40,53,108,162,188,198}|
|{2}     |{19,85,90,160}                    |
|{3}     |{2,19,24,64,123}                  |
|{4}     |{41,81,88,147,162}                |
|{5}     |{51,59,103,181,200}               |
|{6}     |{21,23,62,108,137,169,197}        |
|...     |...                               |

So, as we can see, there are only 2 films which share the same set of actors, and those films are FILM_ID IN (97, 556). (The Sakila database is a bit boring as the data sets are generated).

Using MULTISET comparisons

While the above is already quite cool, in this article, I’d like to showcase a lesser known feature of the jOOQ 3.15 MULTISET support, namely the fact that they can be compared with one another.

And as is the nature of SQL standard MULTISET, ordering is irrelevant, so we don’t have to add any explicit ORDER BY clause for such a comparison. In fact, it’s not 100% irrelevant. You can order a MULTISET for projection purposes, so the ordering will be maintained by jOOQ. But when you use them in predicates, jOOQ will override your ORDER BY clause.

Using jOOQ, we can write:

ctx.select(FILM.FILM_ID, FILM.TITLE)
   .from(FILM)
   .where(
       multiset(
           select(FILM_ACTOR.ACTOR_ID)
           .from(FILM_ACTOR)
           .where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
       ).eq(multiset(
           select(FILM_ACTOR.ACTOR_ID)
           .from(FILM_ACTOR)
           .where(FILM_ACTOR.FILM_ID.eq(97L))
       ))
   )
   .orderBy(FILM_ID)
   .fetch();

It’s a bit less efficient than a query of the previous form as it accesses the FILM_ACTOR table from two subqueries, though only one of them is correlated. Using the default JSONB emulation, the following query is generated:

SELECT film.film_id, film.title
FROM film
WHERE (
  SELECT coalesce(
    jsonb_agg(jsonb_build_array(v0) ORDER BY t.v0),
    jsonb_build_array()
  )
  FROM (
    SELECT film_actor.actor_id AS v0
    FROM film_actor
    WHERE film_actor.film_id = film.film_id
  ) AS t
) = (
  SELECT coalesce(
    jsonb_agg(jsonb_build_array(v0) ORDER BY t.v0),
    jsonb_build_array()
  )
  FROM (
    SELECT film_actor.actor_id AS v0
    FROM film_actor
    WHERE film_actor.film_id = 97
  ) AS t
)
ORDER BY film.film_id

I promised that no ORDER BY clause was needed for MULTISET, and this is still true for the jOOQ code. However, behind the scenes, jOOQ has to order the JSON arrays by their contents to make sure that two MULTISET values are the same, irrespective of their order.

The result is the same two IDs as the previous result showed:

+-------+--------------+
|film_id|title         |
+-------+--------------+
|     97|BRIDE INTRIGUE|
|    556|MALTESE HOPE  |
+-------+--------------+

Comparing MULTISET_AGG, instead

If you prefer using joins and GROUP BY to generate the film’s actor MULTISET, you can do that as well, with jOOQ. This time, we’re using:

  • Implicit joins to simplify access to the FILM.TITLE from FILM_ACTOR
  • A MULTISET predicate in the HAVING clause, using MULTISET_AGG

Here’s the jOOQ version:

ctx.select(FILM_ACTOR.FILM_ID, FILM_ACTOR.film().TITLE)
   .from(FILM_ACTOR)
   .groupBy(FILM_ACTOR.FILM_ID, FILM_ACTOR.film().TITLE)
   .having(multisetAgg(FILM_ACTOR.ACTOR_ID).eq(multiset(
        select(FILM_ACTOR.ACTOR_ID)
        .from(FILM_ACTOR)
        .where(FILM_ACTOR.FILM_ID.eq(97L))
    )))
   .orderBy(FILM_ACTOR.FILM_ID)
   .fetch();

The backing, generated SQL looks like this:

SELECT film_actor.film_id, alias_75379701.title
FROM film_actor
  JOIN film AS alias_75379701
    ON film_actor.film_id = alias_75379701.film_id
GROUP BY film_actor.film_id, alias_75379701.title
HAVING jsonb_agg(
  jsonb_build_array(film_actor.actor_id) ORDER BY film_actor.actor_id
) = (
  SELECT coalesce(
    jsonb_agg(jsonb_build_array(v0) ORDER BY t.v0),
    jsonb_build_array()
  )
  FROM (
    SELECT film_actor.actor_id AS v0
    FROM film_actor
    WHERE film_actor.film_id = 97
  ) AS t
)
ORDER BY film_actor.film_id

Notice how the implicit join is expanded automatically, whereas the HAVING predicate again uses the usual JSONB emulation for MULTISET and MULTISET_AGG.

Alternatives

In the above examples, we’ve compared MULTISET expressions that project single columns, in other words, Result<Record1<Long>> nested collection types. Nothing keeps you from adding more columns to the equation. jOOQ will always ensure that your query type checks and that the generated SQL is correct.

An alternative to using MULTISET would be using ARRAY_AGG and ARRAY (now you have to ORDER BY explicitly, again). With jOOQ:

ctx.select(FILM_ACTOR.FILM_ID, FILM_ACTOR.film().TITLE)
   .from(FILM_ACTOR)
   .groupBy(FILM_ACTOR.FILM_ID, FILM_ACTOR.film().TITLE)
   .having(arrayAgg(FILM_ACTOR.ACTOR_ID)
        .orderBy(FILM_ACTOR.ACTOR_ID).eq(array(
            select(FILM_ACTOR.ACTOR_ID)
            .from(FILM_ACTOR)
            .where(FILM_ACTOR.FILM_ID.eq(97L))
            .orderBy(FILM_ACTOR.ACTOR_ID)
    )))
    .orderBy(FILM_ACTOR.FILM_ID)
    .fetch();

With SQL:

SELECT film_actor.film_id, film.title
FROM film_actor
  JOIN film
    ON film_actor.film_id = film.film_id
GROUP BY film_actor.film_id, film.title
HAVING array_agg(film_actor.actor_id ORDER BY film_actor.actor_id) = 
  ARRAY (
    SELECT film_actor.actor_id
    FROM film_actor
    WHERE film_actor.film_id = 97
    ORDER BY film_actor.actor_id
  )
ORDER BY film_actor.film_id

One thought on “Use MULTISET Predicates to Compare Data Sets

Leave a Reply