## How to Use SQL INTERSECT to Work Around SQL’s NULL Logic

ANOTHER SQL Post this week? I got nerd-sniped:
Oooooh, challenge accepted! So, let’s assume we have a table `T` with columns `(A, B, C)` like this:
```
WITH t(a, b, c) AS (
SELECT 'a', 'b', null FROM dual UNION ALL
SELECT 'a', null, 'c' FROM dual UNION ALL
SELECT 'a', 'b', 'c'  FROM dual
)
SELECT * FROM t

```
As expected, this yields:
```A       B       C
-----------------
a       b
a               c
a       b       c
```
Truly exciting. Now we want to find all those rows that “match” either `('a', 'b', NULL)` or `('a', NULL, 'b')`. Clearly, this should produce the first two rows, right?
```A       B       C
-----------------
a       b
a               c
```
Yes. Now the canonical solution would be to tediously write out the entire predicate as such:
```
WITH t(a, b, c) AS (...)
SELECT * FROM t
WHERE (a = 'a' AND b = 'b' AND c IS NULL)
OR (a = 'a' AND b IS NULL AND c = 'c')

```
That’s really boring. Sure, we could have factored out the first, common predicate:
```
WITH t(a, b, c) AS (...)
SELECT * FROM t
WHERE a = 'a' AND (
(b = 'b' AND c IS NULL)
OR (b IS NULL AND c = 'c')
)

```
That’s certainly better from a performance perspective, but Rafael had a nifty idea. Let’s use row value expressions (tuples) in our predicates:
```
WITH t(a, b, c) AS (...)
SELECT * FROM t
WHERE (a, b, c) IN (('a', 'b', NULL), ('a', NULL, 'c'))

```
Unfortunately this doesn’t yield any results, because nothing is equal to `NULL` in SQL (not even `NULL` itself). The above query is the same as this one:
```
WITH t(a, b, c) AS (...)
SELECT * FROM t
WHERE (a = 'a' AND b = 'b' AND c = NULL /* oops */)
OR (a = 'a' AND b = NULL /* oops */ AND c = 'c')

```
D’oh.

### Solutions

The lame one The canonical solution then would be a really lame (but perfectly valid) one. Encode `NULL` to be some “impossible” string value. Rafael suggested `yolo`. Fair enough. This works:
```
WITH t(a, b, c) AS (...)
SELECT * FROM t
WHERE (a, NVL(b, 'yolo'), NVL(c, 'yolo'))
IN (('a', 'b', 'yolo'), ('a', 'yolo', 'c'))

```
All we have to do now is to always remember the term `yolo` which means “`NULL`, but not `NULL` thank you SQL” The hipster one But wait! SQL is painstakingly inconsistent when it comes to `NULL`. See, `NULL` really means `UNKNOWN` in three-valued logic, and this means, we never know if SQL abides to its own rules. Come in `INTERSECT`. Like `UNION` or `EXCEPT` (`MINUS`) in Oracle, as well as `SELECT DISTINCT`, these set operations handle two `NULL` values as `NOT DISTINCT`. Yes, they’re not equal but also not distinct. Whatever. Just remember: That’s how it is :) So, we can write this hipster solution to Rafael’s problem:
```
WITH t(a, b, c) AS (...)
SELECT *
FROM t
WHERE EXISTS (
SELECT a, b, c FROM dual
INTERSECT (
SELECT 'a', 'b', null FROM dual
UNION ALL
SELECT 'a', null, 'c' FROM dual
)
)

```
We create an intersection of the tuple `(a, b, c)`, the left side of Rafael’s `IN` predicate, and the desired values on the right side of the `IN` predicate, and we’re done. Clearly less tedious than writing the original predicates, right? (We won’t look into performance this time) Cheers, and a happy weekend.

## SQL JOIN or EXISTS? Chances Are, You’re Doing it Wrong

I’ve noticed this very consistently with a lot of customers, and also with participants of our Data Geekery SQL Workshop (which I highly recommend to everyone, if you excuse the advertising): A lot of developers get the distinction between JOIN and SEMI-JOIN wrong. Let me explain…

## What are JOIN and SEMI-JOIN

A little bit of relational algebra first. What is an `(INNER) JOIN`? An `JOIN ` is nothing but a filtered cartesian product. And what is a cartesian product? Wikipedia explains this very nicely:
for sets A and B, the Cartesian product A × B is the set of all ordered pairs (a, b) where a ∈ A and b ∈ B.
That was the technical way of putting it. A more understandable way might be the following:
ranks = {A, K, Q, J, 10, 9, 8, 7, 6, 5, 4, 3, 2} suits = {♠, ♥, ♦, ♣} so, ranks × suits = {(A, ♠), (A, ♥), (A, ♦), (A, ♣), (K, ♠), …, (3, ♣), (2, ♠), (2, ♥), (2, ♦), (2, ♣)}
Or, as an image: The above cartesian product models the combination of each rank with each suite. Simple, right? In SQL, a cartesian product can be written as either a `CROSS JOIN`, or a table list in the `FROM` clause. The following query combines every customer with every staff member:
```
-- CROSS JOIN
SELECT c.last_name, s.last_name
FROM customer AS c
CROSS JOIN staff AS s

-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
staff AS s

```
Now, as I mentioned before, an `(INNER) JOIN` is nothing but a filtered `CROSS JOIN`, where the filter is applied in a dedicated `USING` or `ON` clause.
```
-- INNER JOIN with USING
SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s
USING (last_name)

-- INNER JOIN with ON
SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s
ON c.last_name = s.last_name

```
The above query will match only those customers with those users whose `last_name` are the same. As I’ve told you before, an `(INNER) JOIN` is just a filtered `CROSS JOIN`, so the below queries will be semantically equivalent to the above:
```
-- CROSS JOIN
SELECT c.last_name, s.last_name
FROM customer AS c
CROSS JOIN staff AS s
WHERE c.last_name = s.last_name

-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
staff AS s
WHERE c.last_name = s.last_name

```
Specifically the last version is still used in many SQL codebases, which have not yet migrated to the ANSI JOIN syntax (even if ANSI joins should be preferred for readability reasons).

## But that might be wrong

Unfortunately, I’m seeing this mistake all the time, as I’ve mentioned before. `JOIN` might appear like a useful tool to match rows between tables. But remember one thing, and I’m starting to repeat myself:
`(INNER) JOIN` is just a filtered `CROSS JOIN`
This means that if you choose `INNER JOIN` to find those customers for which there are matching staff, you will create a cartesian product between customer and staff, and then apply a filter. Why is that a problem? Let’s assume the following:
```Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Alice      | Miller    |
| Max        | Doe       |
+------------+-----------+

Staff:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Alice      | Peterson  |
| Jane       | Doe       |
+------------+-----------+
```
What happens when you run the above queries that use `(INNER) JOIN` to match customers with staff? Exactly. You’ll form a cartesian product first:
```{ (John Doe, John Doe),
(John Doe, Alice Peterson),
(John Doe, Jane Doe),
(Alice Miller, John Doe),
(Alice Miller, Alice Peterson),
(Alice Miller, Jane Doe),
(Max Doe, John Doe),
(Max Doe, Alice Peterson),
(Max Doe, Jane Doe) }
```
… and then filter out the tuples that shouldn’t be in the result, i.e. the ones that don’t have matching last names (of course, the database might choose to optimise this and not materialise the entire cross product):
```{ (John Doe, John Doe),
(John Doe, Jane Doe),
(Max Doe, John Doe),
(Max Doe, Jane Doe) }
```
We’re now left with 4 tuples. That’s great, if that’s what you were after in the first place. A combination of all customers with all staff, for which the combination shares the same last name. But maybe you were asking yourself something else, namely:
Do we have any customers who are staff family members?
Use-case: Exclude such customers from a raffle (let’s assume that last names are a sufficient criteria here). In that case, we’ll get “duplicate” records. Because the query that some of you might’ve written would have been:
```
-- INNER JOIN with USING
SELECT c.*
FROM customer AS c
INNER JOIN staff AS s
USING (last_name)

```
Yielding:
```Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| John       | Doe       |
| Max        | Doe       |
| Max        | Doe       |
+------------+-----------+
```
Bummer. How to remove duplicates? With `DISTINCT` you might think:
```
-- INNER JOIN with USING
SELECT DISTINCT c.*
FROM customer AS c
INNER JOIN staff AS s
USING (last_name)

```
Yielding:
```Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Max        | Doe       |
+------------+-----------+
```

## What’s wrong with DISTINCT?

Using `DISTINCT` in this situation is a big mistake. Why?
• Your accidental cartesian product loads too many records from disk, and produces too many records in memory, which have to be removed again
• `DISTINCT` can be expensive in some databases, that implement it via sorting, rather than via hashing
• `DISTINCT` may change the semantics of your `SELECT` clause, with nasty side-effects
• In order to prevent those side-effects, you might even resort to wrapping this `DISTINCT` query in a subselect, making performance even worse

## How to do it right?

By using a `SEMI-JOIN`. It is called semi join (i.e. “half” join) in relational algebra, because we only care about one side of the `JOIN` operation in the results, not the other side. In this example, we only care about customers in the result. We don’t want to have any staff records. The relational algebra notation would be
```Customer ⋉ Staff
```
Unfortunately, SQL doesn’t have `SEMI JOIN` keywords, so the following isn’t possible:
```
SELECT *
FROM customer AS c
LEFT SEMI JOIN staff AS s
USING (last_name)

```
The SQL way to express a `SEMI JOIN` is by using `EXISTS ()` or `IN ()`. The following two are equivalent:
```
-- Using EXISTS
SELECT *
FROM customer AS c
WHERE EXISTS (
SELECT *
FROM staff AS s
WHERE c.last_name = s.last_name
)

-- Using IN
SELECT *
FROM customer
WHERE last_name IN (
SELECT last_name
FROM staff
)

```
(Note, that `NOT EXISTS` and `NOT IN` are NOT equivalent) Not only are these queries more correct, they are also much faster in most SQL databases for a simple reason. The database can stop searching for staff as soon as it has encountered at least one staff for which there is a matching customer. This is also nicely explained in Dan Martensen’s article SQL Performance of JOIN and WHERE EXISTS. And we’ve blogged about a related topic here: SQL Tip of the Day: Be Wary of SELECT COUNT(*).

## Semi Join and Anti Join in jOOQ

We believe that these useful relational operators should be first class citizens in SQL as we have stated in our blog post: http://blog.jooq.org/2015/10/13/semi-join-and-anti-join-should-have-its-own-syntax-in-sql Semi join
```
ctx.select()
.from(Employee)
.leftSemiJoin(Dept)
.on(Employee.DeptName.eq(Dept.DeptName))
.fetch();

```
Anti join
```
ctx.select()
.from(Employee)
.leftAntiJoin(Dept)
.on(Employee.DeptName.eq(Dept.DeptName))
.fetch();

```
The above is much easier to write, and will transform into the corresponding `(NOT) EXISTS` predicate.

## Exception

There are some databases that may unfortunately show worse performance for some of these semi join / anti join operators. See, for instance this outdated article on MySQL performance: http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql Do measure first, before you believe any of these articles, though! Another exception is when you have a primary key / foreign key relationship that guarantees that an `(INNER) JOIN` produces no duplicate values, i.e. when you’re joining a one-to-one or many-to-one relationship, then `JOIN` is a correct solution, but it is usually equally fast, so semi join will still be more readable.

## Conclusion

If you need to check whether you have any matches between a table A and a table B, but you only really care about the results from table A, do make sure you’re using a `SEMI-JOIN` (i.e. an `EXISTS` or `IN` predicate), not an `(INNER) JOIN`. Further reading: