# Use MULTISET Predicates to Compare Data Sets

Questions that might be a bit more difficult to solve using ordinary SQL are questions of the kind:

What films have the same actors as a given film X?

As always, we’re using the sakila database for this example. What would be a possible way to solve this with SQL (for example, PostgreSQL, to be specific)? The following query gives an overview of actors per film:

```SELECT
film_id,
array_agg(actor_id ORDER BY actor_id) actors
FROM film_actor
GROUP BY film_id
```

It produces something like this:

```|film_id|actors                            |
|-------|----------------------------------|
|1      |{1,10,20,30,40,53,108,162,188,198}|
|2      |{19,85,90,160}                    |
|3      |{2,19,24,64,123}                  |
|4      |{41,81,88,147,162}                |
|5      |{51,59,103,181,200}               |
|6      |{21,23,62,108,137,169,197}        |
|...    |...                               |```

Note that in SQL, arrays behave like lists, i.e. they maintain their ordering, so ordering the array explicitly is important to be able to compare the actors with each other. Now, we want to find all films that share the same actor set, from the above:

```WITH t AS (
-- Previous query
SELECT
film_id,
array_agg(actor_id ORDER BY actor_id) actors
FROM film_actor
GROUP BY film_id
)
SELECT
array_agg(film_id ORDER BY film_id) AS films,
actors
FROM t
GROUP BY actors
ORDER BY count(*) DESC, films
```

The result is now:

```|films   |actors                            |
|--------|----------------------------------|
|{97,556}|{65}                              |
|{1}     |{1,10,20,30,40,53,108,162,188,198}|
|{2}     |{19,85,90,160}                    |
|{3}     |{2,19,24,64,123}                  |
|{4}     |{41,81,88,147,162}                |
|{5}     |{51,59,103,181,200}               |
|{6}     |{21,23,62,108,137,169,197}        |
|...     |...                               |```

So, as we can see, there are only 2 films which share the same set of actors, and those films are `FILM_ID IN (97, 556)`. (The Sakila database is a bit boring as the data sets are generated).

### Using MULTISET comparisons

While the above is already quite cool, in this article, I’d like to showcase a lesser known feature of the jOOQ 3.15 `MULTISET` support, namely the fact that they can be compared with one another.

And as is the nature of SQL standard `MULTISET`, ordering is irrelevant, so we don’t have to add any explicit `ORDER BY` clause for such a comparison. In fact, it’s not 100% irrelevant. You can order a `MULTISET` for projection purposes, so the ordering will be maintained by jOOQ. But when you use them in predicates, jOOQ will override your `ORDER BY` clause.

Using jOOQ, we can write:

```ctx.select(FILM.FILM_ID, FILM.TITLE)
.from(FILM)
.where(
multiset(
select(FILM_ACTOR.ACTOR_ID)
.from(FILM_ACTOR)
.where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
).eq(multiset(
select(FILM_ACTOR.ACTOR_ID)
.from(FILM_ACTOR)
.where(FILM_ACTOR.FILM_ID.eq(97L))
))
)
.orderBy(FILM_ID)
.fetch();
```

It’s a bit less efficient than a query of the previous form as it accesses the `FILM_ACTOR` table from two subqueries, though only one of them is correlated. Using the default `JSONB` emulation, the following query is generated:

```SELECT film.film_id, film.title
FROM film
WHERE (
SELECT coalesce(
jsonb_agg(jsonb_build_array(v0) ORDER BY t.v0),
jsonb_build_array()
)
FROM (
SELECT film_actor.actor_id AS v0
FROM film_actor
WHERE film_actor.film_id = film.film_id
) AS t
) = (
SELECT coalesce(
jsonb_agg(jsonb_build_array(v0) ORDER BY t.v0),
jsonb_build_array()
)
FROM (
SELECT film_actor.actor_id AS v0
FROM film_actor
WHERE film_actor.film_id = 97
) AS t
)
ORDER BY film.film_id
```

I promised that no `ORDER BY` clause was needed for `MULTISET`, and this is still true for the jOOQ code. However, behind the scenes, jOOQ has to order the JSON arrays by their contents to make sure that two `MULTISET` values are the same, irrespective of their order.

The result is the same two IDs as the previous result showed:

```+-------+--------------+
|film_id|title         |
+-------+--------------+
|     97|BRIDE INTRIGUE|
|    556|MALTESE HOPE  |
+-------+--------------+
```

### Comparing MULTISET_AGG, instead

If you prefer using joins and `GROUP BY` to generate the film’s actor `MULTISET`, you can do that as well, with jOOQ. This time, we’re using:

• Implicit joins to simplify access to the `FILM.TITLE` from `FILM_ACTOR`
• A `MULTISET` predicate in the `HAVING` clause, using `MULTISET_AGG`

Here’s the jOOQ version:

```ctx.select(FILM_ACTOR.FILM_ID, FILM_ACTOR.film().TITLE)
.from(FILM_ACTOR)
.groupBy(FILM_ACTOR.FILM_ID, FILM_ACTOR.film().TITLE)
.having(multisetAgg(FILM_ACTOR.ACTOR_ID).eq(multiset(
select(FILM_ACTOR.ACTOR_ID)
.from(FILM_ACTOR)
.where(FILM_ACTOR.FILM_ID.eq(97L))
)))
.orderBy(FILM_ACTOR.FILM_ID)
.fetch();
```

The backing, generated SQL looks like this:

```SELECT film_actor.film_id, alias_75379701.title
FROM film_actor
JOIN film AS alias_75379701
ON film_actor.film_id = alias_75379701.film_id
GROUP BY film_actor.film_id, alias_75379701.title
HAVING jsonb_agg(
jsonb_build_array(film_actor.actor_id) ORDER BY film_actor.actor_id
) = (
SELECT coalesce(
jsonb_agg(jsonb_build_array(v0) ORDER BY t.v0),
jsonb_build_array()
)
FROM (
SELECT film_actor.actor_id AS v0
FROM film_actor
WHERE film_actor.film_id = 97
) AS t
)
ORDER BY film_actor.film_id
```

Notice how the implicit join is expanded automatically, whereas the `HAVING` predicate again uses the usual `JSONB` emulation for `MULTISET` and `MULTISET_AGG`.

### Alternatives

In the above examples, we’ve compared `MULTISET` expressions that project single columns, in other words, `Result<Record1<Long>>` nested collection types. Nothing keeps you from adding more columns to the equation. jOOQ will always ensure that your query type checks and that the generated SQL is correct.

An alternative to using `MULTISET` would be using `ARRAY_AGG` and `ARRAY` (now you have to `ORDER BY` explicitly, again). With jOOQ:

```ctx.select(FILM_ACTOR.FILM_ID, FILM_ACTOR.film().TITLE)
.from(FILM_ACTOR)
.groupBy(FILM_ACTOR.FILM_ID, FILM_ACTOR.film().TITLE)
.having(arrayAgg(FILM_ACTOR.ACTOR_ID)
.orderBy(FILM_ACTOR.ACTOR_ID).eq(array(
select(FILM_ACTOR.ACTOR_ID)
.from(FILM_ACTOR)
.where(FILM_ACTOR.FILM_ID.eq(97L))
.orderBy(FILM_ACTOR.ACTOR_ID)
)))
.orderBy(FILM_ACTOR.FILM_ID)
.fetch();
```

With SQL:

```SELECT film_actor.film_id, film.title
FROM film_actor
JOIN film
ON film_actor.film_id = film.film_id
GROUP BY film_actor.film_id, film.title
HAVING array_agg(film_actor.actor_id ORDER BY film_actor.actor_id) =
ARRAY (
SELECT film_actor.actor_id
FROM film_actor
WHERE film_actor.film_id = 97
ORDER BY film_actor.actor_id
)
ORDER BY film_actor.film_id
```

## One thought on “Use MULTISET Predicates to Compare Data Sets”

1. Joe says:

Oh for a DIVIDE operator.