How to Use SQL INTERSECT to Work Around SQL’s NULL Logic

ANOTHER SQL Post this week? I got nerd-sniped:

Oooooh, challenge accepted!

So, let’s assume we have a table T with columns (A, B, C) like this:

WITH t(a, b, c) AS (
  SELECT 'a', 'b', null FROM dual UNION ALL
  SELECT 'a', null, 'c' FROM dual UNION ALL
  SELECT 'a', 'b', 'c'  FROM dual
)
SELECT * FROM t

As expected, this yields:

A       B       C
-----------------
a       b
a               c
a       b       c

Truly exciting.

Now we want to find all those rows that “match” either ('a', 'b', NULL) or ('a', NULL, 'b'). Clearly, this should produce the first two rows, right?

A       B       C
-----------------
a       b
a               c

Yes. Now the canonical solution would be to tediously write out the entire predicate as such:

WITH t(a, b, c) AS (...)
SELECT * FROM t
WHERE (a = 'a' AND b = 'b' AND c IS NULL)
OR (a = 'a' AND b IS NULL AND c = 'c')

That’s really boring. Sure, we could have factored out the first, common predicate:

WITH t(a, b, c) AS (...)
SELECT * FROM t
WHERE a = 'a' AND (
     (b = 'b' AND c IS NULL)
  OR (b IS NULL AND c = 'c')
)

That’s certainly better from a performance perspective, but Rafael had a nifty idea. Let’s use row value expressions (tuples) in our predicates:

WITH t(a, b, c) AS (...)
SELECT * FROM t
WHERE (a, b, c) IN (('a', 'b', NULL), ('a', NULL, 'c'))

Unfortunately this doesn’t yield any results, because nothing is equal to NULL in SQL (not even NULL itself). The above query is the same as this one:

WITH t(a, b, c) AS (...)
SELECT * FROM t
WHERE (a = 'a' AND b = 'b' AND c = NULL /* oops */)
OR (a = 'a' AND b = NULL /* oops */ AND c = 'c')

D’oh.

Solutions

The lame one

The canonical solution then would be a really lame (but perfectly valid) one. Encode NULL to be some “impossible” string value. Rafael suggested yolo. Fair enough.

This works:

WITH t(a, b, c) AS (...)
SELECT * FROM t
WHERE (a, NVL(b, 'yolo'), NVL(c, 'yolo')) 
  IN (('a', 'b', 'yolo'), ('a', 'yolo', 'c'))

All we have to do now is to always remember the term yolo which means “NULL, but not NULL thank you SQL”

The hipster one

But wait! SQL is painstakingly inconsistent when it comes to NULL. See, NULL really means UNKNOWN in three-valued logic, and this means, we never know if SQL abides to its own rules.

Come in INTERSECT. Like UNION or EXCEPT (MINUS) in Oracle, as well as SELECT DISTINCT, these set operations handle two NULL values as NOT DISTINCT. Yes, they’re not equal but also not distinct. Whatever. Just remember: That’s how it is 🙂

So, we can write this hipster solution to Rafael’s problem:

WITH t(a, b, c) AS (...)
SELECT *
FROM t
WHERE EXISTS (
  SELECT a, b, c FROM dual
  INTERSECT (
    SELECT 'a', 'b', null FROM dual
    UNION ALL
    SELECT 'a', null, 'c' FROM dual
  )
)

We create an intersection of the tuple (a, b, c), the left side of Rafael’s IN predicate, and the desired values on the right side of the IN predicate, and we’re done.

Clearly less tedious than writing the original predicates, right? (We won’t look into performance this time)

Cheers, and a happy weekend.

Say NO to Venn Diagrams When Explaining JOINs

In recent times, there have been a couple of tremendously popular blog posts explaining JOINs using Venn Diagrams. After all, relational algebra and SQL are set oriented theories and languages, so it only makes sense to illustrate set operations like JOINs using Venn Diagrams. Right?

Google seems to say so:

venn-google

Everyone uses Venn Diagrams to explain JOINs. But that’s…

PLAIN WRONG!

Venn Diagrams are perfect to illustrate … actual set operations! SQL knows three of them:

  • UNION
  • INTERSECT
  • EXCEPT

And they can be explained as such:

venn-union

venn-intersection

venn-difference

(all of these slides are taken from our Data Geekery SQL Training, do check it out!)

Most of you use UNION occasionally. INTERSECT and EXCEPT are more exotic, but do come in handy every now and then.

The point here is: these set operations operate on sets of elements (tuples), which are all of the same type. As in the examples above, all elements are people with first and last names. This is also why INTERSECT and EXCEPT are more exotic, because they’re usually not very useful. JOIN is much more useful. For instance, you want to combine the set of actors with their corresponding set of films.

A JOIN is really a cartesian product (also cross product) with a filter. Here’s a nice illustration of a cartesian product:

venn-cross-product

So, what’s a better way to illustrate JOIN operations?

JOIN diagrams! Let’s look at CROSS JOIN first, because all other JOIN types can be derived from CROSS JOIN:

venn-cross-join

Remember, in a cross join (in SQL also written with a comma separated table list, historically) is just taking every item on the left side, and combines it with every item on the right side. When you CROSS JOIN a table of 3 rows with a table of 4 rows, you will get 3×4=12 result rows. See, I’m using an “x” character to write the multiplication. I.e. a “cross”.

INNER JOIN

All other joins are still based on cross joins, but with additional filters, and perhaps unions. Here’s an explanation of each individual JOIN type.

venn-join

In plain text, an INNER JOIN is a CROSS JOIN in which only those combinations are retained which fulfil a given predicate. For instance:

-- "Classic" ANSI JOIN syntax
SELECT *
FROM author a
JOIN book b ON a.author_id = b.author_id

-- "Nice" ANSI JOIN syntax
SELECT *
FROM author a
JOIN book b USING (author_id)

-- "Old" syntax using a "CROSS JOIN"
SELECT *
FROM author a, book b
WHERE a.author_id = b.author_id

OUTER JOIN

OUTER JOIN types help where we want to retain those rows from either the LEFT side or the RIGHT or both (FULL) sides, for which there was no matching row where the predicate yielded true.

A LEFT OUTER JOIN in relational algebra is defined as such:

dd81ee1373d922122ce1b3e0da74cb28

Or more verbosely in SQL:

SELECT *
FROM author a
LEFT JOIN book b USING (author_id)

This will produce all the authors and their books, but if an author doesn’t have any book, we still want to get the author with NULL as their only book value. So, it’s the same as writing:

SELECT *
FROM author a
JOIN book b USING (author_id)

UNION

SELECT a.*, NULL, NULL, NULL, ..., NULL
FROM (
  SELECT a.*
  FROM author a
  
  EXCEPT
  
  SELECT a.*
  FROM author a
  JOIN book b USING (author_id)
) a

But no one wants to write that much SQL, so OUTER JOIN was implemented.

Conclusion: Say NO to Venn Diagrams

JOINs are relatively easy to understand intuitively. And they’re relatively easy to explain using Venn Diagrams. But whenever you do that, remember, that you’re making a wrong analogy. A JOIN is not strictly a set operation that can be described with Venn Diagrams. A JOIN is always a cross product with a predicate, and possibly a UNION to add additional rows to the OUTER JOIN result.

So, if in doubt, please use JOIN diagrams rather than Venn Diagrams. They’re more accurate and visually more useful.

venn-google-say-no

(Remember, all of these slides are taken from our Data Geekery SQL Training, do get in touch, if you’re interested)

You Probably don’t Use SQL INTERSECT or EXCEPT Often Enough

When people talk about SQL JOIN, they often use Venn Diagrams to illustrate inclusion and exclusion of the two joined sets:

venn

While these Venn diagrams are certainly useful to understand (and remember) SQL JOIN syntax, they’re not entirely accurate, because SQL JOIN is a special type of a cartesian product, the CROSS JOIN.

Illustration by Wikipedia user Quartl

Illustration by Wikipedia user Quartl

In a cartesian product between two sets A and B, the result is the multiplication of each set, meaning that each element a ∈ A is combined with each element b ∈ B to form a set of tuples (a, b).

Ordinary SQL JOINs do precisely this. When you join BOOK to AUTHOR, you will probably get a combination of every author ∈ AUTHOR with each book ∈ BOOK, such that for each combination (author, book), the author actually wrote the book.

The true meaning of Venn diagrams

The true meaning of Venn diagrams is much better described by the operations

  • UNION
  • INTERSECT
  • EXCEPT (or MINUS in Oracle)

In the following sections, we’ll see that these operations match exactly the semantics of operations that can be illustrated by Venn diagrams, even if you will be able to “abuse” JOIN operations to achieve the same result.

UNION

The UNION operation is the most well-known among these set operations. It is often also referred to as “concatenation” of two sets of tuples, where the result is the concatenation of a set B to a set A.

In the following example, we’ll see that we might be interested in all the different people from our database, given their first and last names, regardless if they’re customer or staff:

set-union

The original Venn diagrams used FULL OUTER JOIN to model the “same” concept, although the two things are not strictly same. Consider the following query, which we’ll run against the Sakila database:

SELECT first_name, last_name
FROM customer
UNION
SELECT first_name, last_name
FROM staff
ORDER BY 1, 2

The result looks like:

first_name   last_name
------------------------------------
AARON        SELBY
ADAM         GOOCH
ADRIAN       CLARY
AGNES        BISHOP
ALAN         KAHN
ALBERT       CROUSE
ALBERTO      HENNING
ALEX         GRESHAM
ALEXANDER    FENNELL
ALFRED       CASILLAS
ALFREDO      MCADAMS
ALICE        STEWART
ALICIA       MILLS
...

Now, run the following “equivalent” query:

SELECT first_name, last_name
FROM customer
FULL OUTER JOIN staff 
  USING (first_name, last_name)
ORDER BY 1, 2

The result will again yield:

first_name   last_name
------------------------------------
AARON        SELBY
ADAM         GOOCH
ADRIAN       CLARY
AGNES        BISHOP
ALAN         KAHN
ALBERT       CROUSE
ALBERTO      HENNING
...

This only works because we’re using the USING clause, which not every database supports natively. If we did our JOIN with the more commonly used ON clause, we’d have to write the more tedious:

SELECT
  COALESCE(c.first_name, s.first_name) AS first_name,
  COALESCE(c.last_name, s.last_name) AS last_name
FROM customer c
FULL OUTER JOIN staff s
  ON (c.first_name, c.last_name)
  =  (s.first_name, s.last_name)
ORDER BY 1, 2

In this case, most people probably default to using UNION already, as it is a much better known operation than FULL OUTER JOIN.

All of jOOQ’s currently supported RDBMS support UNION and UNION ALL (the latter doesn’t remove duplicates).

In the following, we’ll see that equivalent comparisons can be made with other set operations:

INTERSECT

The INTERSECT operation is really useful when you want to keep only those tuples that are present in both sets that are combined using INTERSECT:

set-intersect

As you can see, we may want to retain only those customers that are also actors. Let’s run this query:

SELECT first_name, last_name
FROM customer
INTERSECT
SELECT first_name, last_name
FROM actor
first_name   last_name
------------------------------------
JENNIFER     DAVIS

One of our customers is also an actor. The same query could have been written with an INNER JOIN as such:

SELECT first_name, last_name
FROM customer
INNER JOIN actor 
  USING (first_name, last_name)

… or with the ON syntax

SELECT c.first_name, c.last_name
FROM customer c
INNER JOIN actor a
  ON (c.first_name, c.last_name)
  =  (a.first_name, a.last_name)

This time, no COALESCE is needed, as INNER JOIN retains only those tuples from the cartesian product, which are present on “both sides” of the JOIN, so we can pick any of the tables to prefix our columns.

You may even decide to use a semi-join instead, which would yield the same results:

SELECT first_name, last_name
FROM customer
WHERE (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM actor
)

or, using the more verbose, yet equivalent EXISTS predicate:

SELECT first_name, last_name
FROM customer c
WHERE EXISTS (
  SELECT 1
  FROM actor a
  WHERE (c.first_name, c.last_name)
      = (a.first_name, a.last_name)
)

All of the above, again, yield:

first_name   last_name
------------------------------------
JENNIFER     DAVIS

EXCEPT

The EXCEPT operation is useful when you want to keep only those tuples that are present in one set, but not in another:

set-difference

Running this query:

SELECT first_name, last_name
FROM customer
EXCEPT
SELECT first_name, last_name
FROM staff
ORDER BY 1, 2

… will yield:

first_name   last_name
------------------------------------
AARON        SELBY
ADAM         GOOCH
ADRIAN       CLARY
AGNES        BISHOP
ALAN         KAHN
ALBERT       CROUSE
ALBERTO      HENNING
...

According to the original Venn diagrams, this can be tweaked using LEFT JOIN and a IS NULL predicate:

SELECT first_name, last_name
FROM customer
LEFT JOIN staff
  USING (first_name, last_name)
WHERE staff_id IS NULL
ORDER BY 1, 2

or with an ON clause:

SELECT c.first_name, c.last_name
FROM customer c
LEFT JOIN staff s
  ON (c.first_name, c.last_name)
  =  (s.first_name, s.last_name)
WHERE staff_id IS NULL
ORDER BY 1, 2

This is completely unreadable and doesn’t communicate the fact that we’re removing tuples from a set CUSTOMER, given their presence in another set STAFF.

An equivalent version using anti-join might be more readable (watch out for NULLs in NOT IN predicates, though!):

SELECT c.first_name, c.last_name
FROM customer c
WHERE (first_name, last_name) NOT IN (
  SELECT first_name, last_name
  FROM staff
)
ORDER BY 1, 2

… or, using NOT EXISTS:

SELECT c.first_name, c.last_name
FROM customer c
WHERE NOT EXISTS (
  SELECT 1
  FROM staff s
  WHERE (c.first_name, c.last_name)
      = (s.first_name, s.last_name)
)
ORDER BY 1, 2

Conclusion

UNION, INTERSECT, and EXCEPT are very simple, yet very useful operations that can add a lot of value every now and then in your daily SQL tasks. While JOIN operations are much more versatile, they are also more complex for the simple tasks that can be solved by UNION, INTERSECT, and EXCEPT

Did you like this article? It’s part of the Data Geekery SQL Training – a 1-day workshop helping you to get the most out of the awesome SQL language.

Read more articles about awesome SQL here: