Use NATURAL FULL JOIN to compare two tables in SQL

There are a few ways to compare two similar tables in SQL. Assuming PostgreSQL syntax, we might have this schema:

CREATE TABLE t1 (a INT, b INT, c INT);
CREATE TABLE t2 (a INT, b INT, c INT);
INSERT INTO t1 VALUES (1, 2, 3), (4, 5, 6), (7, 8, 9);
INSERT INTO t2 VALUES            (4, 5, 6), (7, 8, 9), (10, 11, 12);

It is now possible to use UNION and EXCEPT as suggested by Chris Saxon:

In PostgreSQL, we can write:

(TABLE t1 EXCEPT TABLE t2) 
UNION
(TABLE t2 EXCEPT TABLE t1)
ORDER BY a, b, c

Notice how TABLE x is just standard SQL, and PostgreSQL, syntax sugar for SELECT * FROM x.

And we’ll get:

a |b |c |
--|--|--|
 1| 2| 3|
10|11|12|

Unfortunately, this requires two accesses to each table. Can we do it with a single access?

Using NATURAL FULL JOIN

Yes! Using NATURAL FULL JOIN, another rare use-case for this esoteric operator.

Assuming there are no NULL values, we can write this:

SELECT *
FROM (
  SELECT 't1' AS t1, t1.* FROM t1
) t1 NATURAL FULL JOIN (
  SELECT 't2' AS t2, t2.* FROM t2
) t2 
WHERE NOT (t1, t2) IS NOT NULL;

This produces:

a |b |c |t1|t2|
--|--|--|--|--|
 1| 2| 3|t1|  |
10|11|12|  |t2|

Why? Because a NATURAL JOIN is syntax sugar for joining using all the shared column names of the two tables, and the FULL JOIN makes sure we can retrieve also the columns that are not matched by the join predicate. Another way to write this is:

-- Use JOIN .. USING, instead of NATURAL JOIN
SELECT *
FROM (
  SELECT 't1' AS t1, t1.* FROM t1
) t1 FULL JOIN (
  SELECT 't2' AS t2, t2.* FROM t2
) t2 USING (a, b, c)
WHERE NOT (t1, t2) IS NOT NULL;

Or:

-- Use JOIN .. ON, instead of JOIN .. USING
SELECT 
  coalesce(t1.a, t2.a) AS a,
  coalesce(t1.b, t2.b) AS b,
  coalesce(t1.c, t2.c) AS c,
  t1.t1,
  t2.t2
FROM (
  SELECT 't1' AS t1, t1.* FROM t1
) t1 FULL JOIN (
  SELECT 't2' AS t2, t2.* FROM t2
) t2 ON (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) 
WHERE NOT (t1, t2) IS NOT NULL;

Unfortunately, as of PostgreSQL 12, this produces an error:

ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions

Pros and cons

Pros and cons compared to the set operator solution using UNION and EXCEPT:

Pros

  • Each table is accessed only once
  • Comparison is now name based, not column index based, i.e. it can still work if only parts of the columns are the shared

Cons

  • If index based column comparison was desired (because the tables are the same structurally, but do not share the exact same column names), then we’d have to rename each individual column to a common column name.
  • If there’s duplicate data, there’s going to be a cartesian product, which might make this solution quite slower
  • UNION and EXCEPT treat NULL values as “not distinct”. This isn’t the case with NATURAL JOIN. See workaround below

When there are NULL values in the data

In the presence of NULL values, we can no longer use NATURAL JOIN or JOIN .. USING. We could use the DISTINCT predicate:

SELECT 
  coalesce(t1.a, t2.a) AS a,
  coalesce(t1.b, t2.b) AS b,
  coalesce(t1.c, t2.c) AS c,
  t1.t1,
  t2.t2
FROM (
  SELECT 't1' AS t1, t1.* FROM t1
) t1 FULL JOIN (
  SELECT 't2' AS t2, t2.* FROM t2
) t2 ON (t1.a, t1.b, t1.c) IS NOT DISTINCT FROM (t2.a, t2.b, t2.c) 
WHERE NOT (t1, t2) IS NOT NULL;

Row value expression NULL predicate

Observe the usage of the esoteric NULL predicate for row value expressions, which uses the following truth table:

+-----------------------+-----------+---------------+---------------+-------------------+
| Expression            | R IS NULL | R IS NOT NULL | NOT R IS NULL | NOT R IS NOT NULL |
+-----------------------+-----------+---------------+---------------+-------------------+
| degree 1: null        | true      | false         | false         |  true             |
| degree 1: not null    | false     | true          | true          |  false            |
| degree > 1: all null  | true      | false         | false         |  true             |
| degree > 1: some null | false     | false         | true          |  true             |
| degree > 1: none null | false     | true          | true          |  false            |
+-----------------------+-----------+---------------+---------------+-------------------+

Yes. R IS NULL and NOT R IS NOT NULL are not the same thing in SQL…

It’s just another way of writing:

SELECT *
FROM (
  SELECT 't1' AS t1, t1.* FROM t1
) t1 NATURAL FULL JOIN (
  SELECT 't2' AS t2, t2.* FROM t2
) t2 
WHERE t1 IS NULL
OR t2 IS NULL;

How to Write Multiset Conditions With Oracle VARRAY Types

Oracle is one of the few databases that implements the SQL standard ORDBMS extensions, which essentially allow for nested collections. Other databases that have these features to some extent are CUBRID, Informix, PostgreSQL.

Oracle has two types of nested collections:

-- Nested tables
CREATE TYPE t1 AS TABLE OF VARCHAR2(10);
/

-- Varrays
CREATE TYPE t2 AS VARRAY(10) OF VARCHAR2(10);
/

The main difference at first is that a nested table can be of arbitrary size, whereas a varray has a fixed maximum size. Other than that, they behave in similar ways.

When storing a nested collection in a table, there is another difference. Varrays can be inlined into the table just like any other data type, whereas nested tables have to be accompanied by an additional storage clause:

CREATE TABLE t (
  id NUMBER(10),
  t1 t1,
  t2 t2
)
NESTED TABLE t1 STORE AS t1_nt;

This is a minor hassle in terms of DDL. The runtime implications are more significant.

Multiset Conditions

The most important difference is the fact that all the useful multiset conditions are not available with varrays. For instance, consider running these statements:

INSERT INTO t VALUES (1, NULL, NULL);
INSERT INTO t VALUES (2, t1(), t2());
INSERT INTO t VALUES (
  3, 
  t1('abc', 'xyz', 'zzz'), 
  t2('abc', 'xyz', 'zzz')
);
INSERT INTO t VALUES (
  4, 
  t1('dup', 'dup', 'dup'), 
  t2('dup', 'dup', 'dup')
);

SELECT * FROM t WHERE 'abc' MEMBER OF t1;
SELECT * FROM t WHERE 'abc' MEMBER OF t2;

The result of these queries is:

ID  T1                        T2
-----------------------------------------------------
3   T1('abc', 'xyz', 'zzz')   T2('abc', 'xyz', 'zzz')

ORA-00932: inconsistent datatypes: expected UDT got TEST.T2

Bummer. The documentation is a bit unclear about this. It reads (emphasis mine):

he return value is TRUE if expr is equal to a member of the specified nested table or varray. The return value is NULL if expr is null or if the nested table is empty.

There is some explicit mention of varrays supporting these operations, but in most of the documentation, varrays are not mentioned. So, how can we write such operations with varrays? Here’s an list of translations of the nested table operator to the equivalent SQL expression for use with varrays.

These are the multiset conditions:

IS A SET condition

In SQL, everything is a (partially ordered) multiset by default. Sometimes, however, we want to work with sets, i.e. a special type of multiset that has no duplicate values. We can easily check whether nested tables are sets (or whether they aren’t):

-- Nested table version
SELECT * FROM t WHERE t1 IS A SET;

-- Varray version
SELECT * 
FROM t 
WHERE t2 IS NOT NULL
AND (SELECT count(*) FROM TABLE(t2)) 
  = (SELECT count(DISTINCT column_value) FROM TABLE(t2));

The IS A SET operation yields UNKNOWN if the nested table is NULL, so we have to take that into account as well. If it isn’t NULL, we can count the total values in the varray and compare that with the total distinct values in the varray.

The result is:

ID  T1                        T2
-----------------------------------------------------
2   T1()                      T2()
3   T1('abc', 'xyz', 'zzz')   T2('abc', 'xyz', 'zzz')

IS EMPTY condition

This predicate needs no explanation. It can be written as such:

-- Nested table version
SELECT * FROM t WHERE t1 IS EMPTY;

-- Varray version
SELECT * 
FROM t 
WHERE t2 IS NOT NULL
AND NOT EXISTS (
  SELECT * FROM TABLE (t2)
);

The result being:

ID  T1                 T2
---------------------------------------
2   T1()               T2()

MEMBER condition

This handy predicate can help check if a specific value is contained in a nested collection. It can be written as such:

-- Nested table version
SELECT * FROM t WHERE 'abc' MEMBER OF t1;

-- Varray version
SELECT *
FROM t
WHERE t2 IS NOT NULL
AND EXISTS (
  SELECT 1 FROM TABLE(t2) WHERE column_value = 'abc'
);

Yielding:

ID  T1                        T2
-----------------------------------------------------
3   T1('abc', 'xyz', 'zzz')   T2('abc', 'xyz', 'zzz')

SUBMULTISET condition

Just like the previous MEMBER condition, this predicate can help check if specific values (more than one) are contained in a nested collection. This is a bit more tricky than the previous emulations. The MEMBER condition works the same way for sets and multisets, as we’re checking if exactly one element is contained in the (multi)set.

When working with multisets, duplicates are allowed, and in the case of the SUBMULTISET operation, the following can be observed:

-- Equal multisets
t1() SUBMULTISET OF t1();
t1('a', 'a') SUBMULTISET OF t1('a', 'a');

-- Subsets
t1('a') SUBMULTISET OF t1('a', 'a');

-- But this is not true
t1('a', 'a') SUBMULTISET OF t1('a');

When we omit the fact that nested collections can be multisets and pretend we’re working with sets only, then the emulation of the SUBMULTISET operator is relatively easy:

-- Nested table version
SELECT * FROM t WHERE t1('abc', 'xyz') SUBMULTISET OF t1;

-- Varray version
SELECT *
FROM t
WHERE t2 IS NOT NULL
AND EXISTS (
  SELECT 1 FROM TABLE(t2) 
  WHERE column_value = 'abc'
  INTERSECT
  SELECT 1 FROM TABLE(t2) 
  WHERE column_value = 'xyz'
);

Yielding, once more:

ID  T1                        T2
-----------------------------------------------------
3   T1('abc', 'xyz', 'zzz')   T2('abc', 'xyz', 'zzz')

If we’re really working with multisets, things are a bit more tricky:

-- Nested table version
SELECT * FROM t WHERE t1('dup', 'dup') SUBMULTISET OF t1;

-- Varray version
SELECT *
FROM t
WHERE t2 IS NOT NULL
AND NOT EXISTS (
  SELECT column_value, count(*)
  FROM TABLE (t2('dup', 'dup')) x
  GROUP BY column_value
  HAVING count(*) > (
    SELECT count(*)
    FROM TABLE (t2) y
    WHERE y.column_value = x.column_value
  )
);

Yielding:

ID  T1                        T2
-----------------------------------------------------
4   T1('dup', 'dup', 'dup')   T2('dup', 'dup', 'dup')

How does it work? In the NOT EXISTS correlated subquery, we’re counting the number of duplicate values in the potential SUBMULTISET, effectively turning that SUBMULTISET into a SET using the GROUP BY operation.

We’re then comparing that count value from the left operand with the corresponding count value from the right operand. If there is no value in the left operand whose number of occurrences is bigger than the number of occurrences of that value in the right operand, then the whole left operand is a SUBMULTISET of the right operand.

Cool, eh? We’ll talk about performance another time :-)

MULTISET operators

Also very interesting, the multiset operators:

  • MULTISET EXCEPT [ ALL | DISTINCT ]
  • MULTISET INTERSECT [ ALL | DISTINCT ]
  • MULTISET UNION [ ALL | DISTINCT ]

Notice how there are some differences to the ordinary set operators that can be used in SELECT statements. In particular:

  • EXCEPT is used as defined in the standard, not MINUS
  • ALL is supported on all three operators, not just on UNION
  • ALL is the default, not DISTINCT

How can we work with these operators? Consider these queries:

SELECT id, t1 MULTISET EXCEPT t1('aaa', 'abc', 'dup', 'dup') r 
FROM t;

SELECT id, t1 MULTISET EXCEPT ALL t1('aaa', 'abc', 'dup', 'dup') r 
FROM t;

Both yielding:

ID   R
---------------------
1    (null)
2    T1()
3    T1('xyz', 'zzz')
4    T1('dup')

With this operator, we’re removing each element of the right operand once from the left operand:

  • 'aaa' does not appear in the left operand, so nothing happens
  • 'abc' appears on row with ID = 3 and we remove it
  • 'dup' appears on row with ID = 4, 3 times, and we remove it twice, leaving one value

Conversely, when adding DISTINCT, we’ll get:

SELECT t1 MULTISET EXCEPT DISTINCT t1('aaa', 'abc', 'dup') FROM t;

Yielding:

ID   R
---------------------
1    (null)
2    T1()
3    T1('xyz', 'zzz')
4    T1('')

The only difference is on row with ID = 4, where all 'dup' values were removed, regardless how many there were on either side of the MULTISET EXCEPT DISTINCT operator.

How to emulate this for varrays?

DISTINCT version

This is a bit easier, because we can now use MINUS:

-- Nested table version
SELECT t1 MULTISET EXCEPT DISTINCT t1('aaa', 'abc', 'dup', 'dup') 
FROM t;

-- Varray version
SELECT 
  id,
  CASE 
    WHEN t2 IS NULL THEN NULL 
    ELSE 
      CAST(MULTISET(
        SELECT column_value
        FROM TABLE (t2)
        MINUS
        SELECT column_value
        FROM TABLE (t2('aaa', 'abc', 'dup', 'dup'))
      ) AS t2)
  END r
FROM t;

Luckily, we can still cast a structural MULTISET type that we can obtain using the MULTISET() operator to a varray type. This greatly simplifies the task.

ALL version

If we want the MULTISET EXCEPT or MULTISET EXCEPT ALL semantics, things are trickier. Here’s a solution that resorts to using window functions, in order to turn a MULTISET back into a SET:

-- Nested table version
SELECT t1 MULTISET EXCEPT ALL t1('aaa', 'abc', 'dup', 'dup') 
FROM t;

-- Varray version
SELECT 
  id,
  CASE 
    WHEN t2 IS NULL THEN NULL 
    ELSE 
      CAST(MULTISET(
        SELECT column_value
        FROM (
          SELECT 
            column_value,
            row_number() OVER (
              PARTITION BY column_value 
              ORDER BY column_value) rn
          FROM TABLE (t2)
          MINUS
          SELECT 
            column_value, 
            row_number() OVER (
              PARTITION BY column_value 
              ORDER BY column_value) rn
          FROM TABLE (t2('aaa', 'abc', 'dup', 'dup'))
        )
      ) AS t2)
  END r
FROM t;

How does this work? Ideally, we’ll look at what this ROW_NUMBER() evaluates to on each row. For this, we use OUTER APPLY:

SELECT id, t2, column_value, rn
FROM t
OUTER APPLY (
  SELECT 
    column_value,
    row_number() OVER (
      PARTITION BY column_value
      ORDER BY column_value) rn
  FROM TABLE (t2)
);

The result is:

ID      T2                       COLUMN_VALUE  RN
-----------------------------------------------------
1       (null)                   (null)        (null)
2       T2()                     (null)        (null)
3       T2('abc', 'xyz', 'zzz')  abc           1
3       T2('abc', 'xyz', 'zzz')  xyz           1
3       T2('abc', 'xyz', 'zzz')  zzz           1
4       T2('dup', 'dup', 'dup')  dup           1
4       T2('dup', 'dup', 'dup')  dup           2
4       T2('dup', 'dup', 'dup')  dup           3

As can be seen, each duplicate value gets assigned a unique row number due to the nature of how ROW_NUMBER() works (this property can be very useful for solving the gaps-and-islands-problem. See trick #4).

Now that we turned our (COLUMN_VALUE) multiset into a (COLUMN_VALUE, RN) set (without duplicates), we can use MINUS again.

MULTISET INTERSECT and MULTISET UNION

MULTISET INTERSECT works exactly the same way as MULTISET EXCEPT, with the same window function based emulation in the MULTISET INTERSECT ALL case. MULTISET UNION is simpler, because Oracle knows UNION ALL, so we do not need to resort to such trickery.

Conclusion

Nested collections are a very powerful tool in Oracle SQL. Oracle knows two types of nested collections:

  • Nested tables
  • Varrays

Nested tables are trickier to maintain as you have to think of their storage more explicitly. Varrays can just be embedded into ordinary tables like any other column. But there’s a price to pay for using varrays. Oracle regrettably doesn’t support all of the above very useful multiset conditions and multiset operators.

Luckily, when you encounter a situation where you have varrays and cannot change that, you can still emulate each of the operators using more traditional SQL.

You Probably don’t Use SQL INTERSECT or EXCEPT Often Enough

When people talk about SQL JOIN, they often use Venn Diagrams to illustrate inclusion and exclusion of the two joined sets:

venn

While these Venn diagrams are certainly useful to understand (and remember) SQL JOIN syntax, they’re not entirely accurate, because SQL JOIN is a special type of a cartesian product, the CROSS JOIN.

Illustration by Wikipedia user Quartl
Illustration by Wikipedia user Quartl

In a cartesian product between two sets A and B, the result is the multiplication of each set, meaning that each element a ∈ A is combined with each element b ∈ B to form a set of tuples (a, b).

Ordinary SQL JOINs do precisely this. When you join BOOK to AUTHOR, you will probably get a combination of every author ∈ AUTHOR with each book ∈ BOOK, such that for each combination (author, book), the author actually wrote the book.

The true meaning of Venn diagrams

The true meaning of Venn diagrams is much better described by the operations

  • UNION
  • INTERSECT
  • EXCEPT (or MINUS in Oracle)

In the following sections, we’ll see that these operations match exactly the semantics of operations that can be illustrated by Venn diagrams, even if you will be able to “abuse” JOIN operations to achieve the same result.

UNION

The UNION operation is the most well-known among these set operations. It is often also referred to as “concatenation” of two sets of tuples, where the result is the concatenation of a set B to a set A.

In the following example, we’ll see that we might be interested in all the different people from our database, given their first and last names, regardless if they’re customer or staff:

set-union

The original Venn diagrams used FULL OUTER JOIN to model the “same” concept, although the two things are not strictly same. Consider the following query, which we’ll run against the Sakila database:

SELECT first_name, last_name
FROM customer
UNION
SELECT first_name, last_name
FROM staff
ORDER BY first_name, last_name

The result looks like:

first_name   last_name
------------------------------------
AARON        SELBY
ADAM         GOOCH
ADRIAN       CLARY
AGNES        BISHOP
ALAN         KAHN
ALBERT       CROUSE
ALBERTO      HENNING
ALEX         GRESHAM
ALEXANDER    FENNELL
ALFRED       CASILLAS
ALFREDO      MCADAMS
ALICE        STEWART
ALICIA       MILLS
...

Now, run the following “equivalent” query (it isn’t actually equivalent with respect to duplicates):

SELECT first_name, last_name
FROM customer
FULL OUTER JOIN staff 
  USING (first_name, last_name)
ORDER BY first_name, last_name

The result will again yield:

first_name   last_name
------------------------------------
AARON        SELBY
ADAM         GOOCH
ADRIAN       CLARY
AGNES        BISHOP
ALAN         KAHN
ALBERT       CROUSE
ALBERTO      HENNING
...

This only works because we’re using the USING clause, which not every database supports natively. If we did our JOIN with the more commonly used ON clause, we’d have to write the more tedious:

SELECT
  COALESCE(c.first_name, s.first_name) AS first_name,
  COALESCE(c.last_name, s.last_name) AS last_name
FROM customer c
FULL OUTER JOIN staff s
  ON (c.first_name, c.last_name)
  =  (s.first_name, s.last_name)
ORDER BY first_name, last_name

In this case, most people probably default to using UNION already, as it is a much better known operation than FULL OUTER JOIN.

All of jOOQ’s currently supported RDBMS support UNION and UNION ALL (the latter doesn’t remove duplicates).

In the following, we’ll see that equivalent comparisons can be made with other set operations:

INTERSECT

The INTERSECT operation is really useful when you want to keep only those tuples that are present in both sets that are combined using INTERSECT:

set-intersect

As you can see, we may want to retain only those customers that are also actors. Let’s run this query:

SELECT first_name, last_name
FROM customer
INTERSECT
SELECT first_name, last_name
FROM actor
first_name   last_name
------------------------------------
JENNIFER     DAVIS

One of our customers is also an actor. The same query could have been written with an INNER JOIN as such:

SELECT first_name, last_name
FROM customer
INNER JOIN actor 
  USING (first_name, last_name)

… or with the ON syntax

SELECT c.first_name, c.last_name
FROM customer c
INNER JOIN actor a
  ON (c.first_name, c.last_name)
  =  (a.first_name, a.last_name)

This time, no COALESCE is needed, as INNER JOIN retains only those tuples from the cartesian product, which are present on “both sides” of the JOIN, so we can pick any of the tables to prefix our columns.

You may even decide to use a semi-join instead, which would yield the same results:

SELECT first_name, last_name
FROM customer
WHERE (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM actor
)

or, using the more verbose, yet equivalent EXISTS predicate:

SELECT first_name, last_name
FROM customer c
WHERE EXISTS (
  SELECT 1
  FROM actor a
  WHERE (c.first_name, c.last_name)
      = (a.first_name, a.last_name)
)

All of the above, again, yield:

first_name   last_name
------------------------------------
JENNIFER     DAVIS

EXCEPT

The EXCEPT operation is useful when you want to keep only those tuples that are present in one set, but not in another:

set-difference

Running this query:

SELECT first_name, last_name
FROM customer
EXCEPT
SELECT first_name, last_name
FROM staff
ORDER BY first_name, last_name

… will yield:

first_name   last_name
------------------------------------
AARON        SELBY
ADAM         GOOCH
ADRIAN       CLARY
AGNES        BISHOP
ALAN         KAHN
ALBERT       CROUSE
ALBERTO      HENNING
...

According to the original Venn diagrams, this can be tweaked using LEFT JOIN and a IS NULL predicate:

SELECT first_name, last_name
FROM customer
LEFT JOIN staff
  USING (first_name, last_name)
WHERE staff_id IS NULL
ORDER BY first_name, last_name

or with an ON clause:

SELECT c.first_name, c.last_name
FROM customer c
LEFT JOIN staff s
  ON (c.first_name, c.last_name)
  =  (s.first_name, s.last_name)
WHERE staff_id IS NULL
ORDER BY first_name, last_name

This is completely unreadable and doesn’t communicate the fact that we’re removing tuples from a set CUSTOMER, given their presence in another set STAFF.

An equivalent version using anti-join might be more readable (watch out for NULLs in NOT IN predicates, though!):

SELECT c.first_name, c.last_name
FROM customer c
WHERE (first_name, last_name) NOT IN (
  SELECT first_name, last_name
  FROM staff
)
ORDER BY first_name, last_name

… or, using NOT EXISTS:

SELECT c.first_name, c.last_name
FROM customer c
WHERE NOT EXISTS (
  SELECT 1
  FROM staff s
  WHERE (c.first_name, c.last_name)
      = (s.first_name, s.last_name)
)
ORDER BY first_name, last_name

Conclusion

UNION, INTERSECT, and EXCEPT are very simple, yet very useful operations that can add a lot of value every now and then in your daily SQL tasks. While JOIN operations are much more versatile, they are also more complex for the simple tasks that can be solved by UNION, INTERSECT, and EXCEPT

Did you like this article? It’s part of the Data Geekery SQL Training – a 1-day workshop helping you to get the most out of the awesome SQL language.

Read more articles about awesome SQL here: