SQL JOIN or EXISTS? Chances Are, You’re Doing it Wrong

I’ve noticed this very consistently with a lot of customers, and also with participants of our Data Geekery SQL Workshop (which I highly recommend to everyone, if you excuse the advertising): A lot of developers get the distinction between JOIN and SEMI-JOIN wrong. Let me explain…

What are JOIN and SEMI-JOIN

A little bit of relational algebra first. What is an (INNER) JOIN? An JOIN is nothing but a filtered cartesian product. And what is a cartesian product? Wikipedia explains this very nicely:

for sets A and B, the Cartesian product A × B is the set of all ordered pairs (a, b) where a ∈ A and b ∈ B.

That was the technical way of putting it. A more understandable way might be the following:

ranks = {A, K, Q, J, 10, 9, 8, 7, 6, 5, 4, 3, 2}
suits = {♠, ♥, ♦, ♣}

so, ranks × suits =
{(A, ♠), (A, ♥), (A, ♦), (A, ♣), (K, ♠),
…,
(3, ♣), (2, ♠), (2, ♥), (2, ♦), (2, ♣)}

Or, as an image:

By Trainler - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=7104281
By Trainler – Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=7104281

The above cartesian product models the combination of each rank with each suite. Simple, right?

In SQL, a cartesian product can be written as either a CROSS JOIN, or a table list in the FROM clause. The following query combines every customer with every staff member:

-- CROSS JOIN
SELECT c.last_name, s.last_name
FROM customer AS c
CROSS JOIN staff AS s

-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
     staff AS s

Now, as I mentioned before, an (INNER) JOIN is nothing but a filtered CROSS JOIN, where the filter is applied in a dedicated USING or ON clause.

-- INNER JOIN with USING
SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

-- INNER JOIN with ON
SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s 
  ON c.last_name = s.last_name

The above query will match only those customers with those users whose last_name are the same. As I’ve told you before, an (INNER) JOIN is just a filtered CROSS JOIN, so the below queries will be semantically equivalent to the above:

-- CROSS JOIN
SELECT c.last_name, s.last_name
FROM customer AS c
CROSS JOIN staff AS s
WHERE c.last_name = s.last_name

-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
     staff AS s
WHERE c.last_name = s.last_name

Specifically the last version is still used in many SQL codebases, which have not yet migrated to the ANSI JOIN syntax (even if ANSI joins should be preferred for readability reasons).

But that might be wrong

Unfortunately, I’m seeing this mistake all the time, as I’ve mentioned before. JOIN might appear like a useful tool to match rows between tables. But remember one thing, and I’m starting to repeat myself:

(INNER) JOIN is just a filtered CROSS JOIN

This means that if you choose INNER JOIN to find those customers for which there are matching staff, you will create a cartesian product between customer and staff, and then apply a filter. Why is that a problem? Let’s assume the following:

Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Alice      | Miller    |
| Max        | Doe       |
+------------+-----------+

Staff:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Alice      | Peterson  |
| Jane       | Doe       |
+------------+-----------+

What happens when you run the above queries that use (INNER) JOIN to match customers with staff? Exactly. You’ll form a cartesian product first:

{ (John Doe, John Doe),
  (John Doe, Alice Peterson),
  (John Doe, Jane Doe),
  (Alice Miller, John Doe),
  (Alice Miller, Alice Peterson),
  (Alice Miller, Jane Doe),
  (Max Doe, John Doe),
  (Max Doe, Alice Peterson),
  (Max Doe, Jane Doe) }

… and then filter out the tuples that shouldn’t be in the result, i.e. the ones that don’t have matching last names (of course, the database might choose to optimise this and not materialise the entire cross product):

{ (John Doe, John Doe),
  (John Doe, Jane Doe),
  (Max Doe, John Doe),
  (Max Doe, Jane Doe) }

We’re now left with 4 tuples. That’s great, if that’s what you were after in the first place. A combination of all customers with all staff, for which the combination shares the same last name. But maybe you were asking yourself something else, namely:

Do we have any customers who are staff family members?

Use-case: Exclude such customers from a raffle (let’s assume that last names are a sufficient criteria here).

In that case, we’ll get “duplicate” records. Because the query that some of you might’ve written would have been:

-- INNER JOIN with USING
SELECT c.*
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

Yielding:

Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| John       | Doe       |
| Max        | Doe       |
| Max        | Doe       |
+------------+-----------+

Bummer. How to remove duplicates? With DISTINCT you might think:

-- INNER JOIN with USING
SELECT DISTINCT c.*
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

Yielding:

Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Max        | Doe       |
+------------+-----------+

What’s wrong with DISTINCT?

Using DISTINCT in this situation is a big mistake. Why?

  • Your accidental cartesian product loads too many records from disk, and produces too many records in memory, which have to be removed again
  • DISTINCT can be expensive in some databases, that implement it via sorting, rather than via hashing
  • DISTINCT may change the semantics of your SELECT clause, with nasty side-effects
  • In order to prevent those side-effects, you might even resort to wrapping this DISTINCT query in a subselect, making performance even worse

That’s horrible. See also this list of common SQL mistakes:
https://blog.jooq.org/2013/07/30/10-common-mistakes-java-developers-make-when-writing-sql

How to do it right?

By using a SEMI-JOIN. It is called semi join (i.e. “half” join) in relational algebra, because we only care about one side of the JOIN operation in the results, not the other side. In this example, we only care about customers in the result. We don’t want to have any staff records. The relational algebra notation would be

Customer ⋉ Staff

Unfortunately, SQL doesn’t have SEMI JOIN keywords, so the following isn’t possible:

SELECT *
FROM customer AS c
LEFT SEMI JOIN staff AS s 
  USING (last_name)

The SQL way to express a SEMI JOIN is by using EXISTS () or IN (). The following two are equivalent:

-- Using EXISTS
SELECT *
FROM customer AS c
WHERE EXISTS (
  SELECT *
  FROM staff AS s
  WHERE c.last_name = s.last_name
)

-- Using IN
SELECT *
FROM customer
WHERE last_name IN (
  SELECT last_name
  FROM staff
)

(Note, that NOT EXISTS and NOT IN are NOT equivalent)

Not only are these queries more correct, they are also much faster in most SQL databases for a simple reason. The database can stop searching for staff as soon as it has encountered at least one staff for which there is a matching customer. This is also nicely explained in Dan Martensen’s article SQL Performance of JOIN and WHERE EXISTS. And we’ve blogged about a related topic here: SQL Tip of the Day: Be Wary of SELECT COUNT(*).

Semi Join and Anti Join in jOOQ

We believe that these useful relational operators should be first class citizens in SQL as we have stated in our blog post:
https://blog.jooq.org/2015/10/13/semi-join-and-anti-join-should-have-its-own-syntax-in-sql

Semi join

ctx.select()
   .from(Employee)
   .leftSemiJoin(Dept)
   .on(Employee.DeptName.eq(Dept.DeptName))
   .fetch();

Anti join

ctx.select()
   .from(Employee)
   .leftAntiJoin(Dept)
   .on(Employee.DeptName.eq(Dept.DeptName))
   .fetch();

The above is much easier to write, and will transform into the corresponding (NOT) EXISTS predicate.

Exception

There are some databases that may unfortunately show worse performance for some of these semi join / anti join operators. See, for instance this outdated article on MySQL performance:
http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql

Do measure first, before you believe any of these articles, though!

Another exception is when you have a primary key / foreign key relationship that guarantees that an (INNER) JOIN produces no duplicate values, i.e. when you’re joining a one-to-one or many-to-one relationship, then JOIN is a correct solution, but it is usually equally fast, so semi join will still be more readable.

Conclusion

If you need to check whether you have any matches between a table A and a table B, but you only really care about the results from table A, do make sure you’re using a SEMI-JOIN (i.e. an EXISTS or IN predicate), not an (INNER) JOIN.

Further reading:

You Probably don’t Use SQL INTERSECT or EXCEPT Often Enough

When people talk about SQL JOIN, they often use Venn Diagrams to illustrate inclusion and exclusion of the two joined sets:

venn

While these Venn diagrams are certainly useful to understand (and remember) SQL JOIN syntax, they’re not entirely accurate, because SQL JOIN is a special type of a cartesian product, the CROSS JOIN.

Illustration by Wikipedia user Quartl
Illustration by Wikipedia user Quartl

In a cartesian product between two sets A and B, the result is the multiplication of each set, meaning that each element a ∈ A is combined with each element b ∈ B to form a set of tuples (a, b).

Ordinary SQL JOINs do precisely this. When you join BOOK to AUTHOR, you will probably get a combination of every author ∈ AUTHOR with each book ∈ BOOK, such that for each combination (author, book), the author actually wrote the book.

The true meaning of Venn diagrams

The true meaning of Venn diagrams is much better described by the operations

  • UNION
  • INTERSECT
  • EXCEPT (or MINUS in Oracle)

In the following sections, we’ll see that these operations match exactly the semantics of operations that can be illustrated by Venn diagrams, even if you will be able to “abuse” JOIN operations to achieve the same result.

UNION

The UNION operation is the most well-known among these set operations. It is often also referred to as “concatenation” of two sets of tuples, where the result is the concatenation of a set B to a set A.

In the following example, we’ll see that we might be interested in all the different people from our database, given their first and last names, regardless if they’re customer or staff:

set-union

The original Venn diagrams used FULL OUTER JOIN to model the “same” concept, although the two things are not strictly same. Consider the following query, which we’ll run against the Sakila database:

SELECT first_name, last_name
FROM customer
UNION
SELECT first_name, last_name
FROM staff
ORDER BY 1, 2

The result looks like:

first_name   last_name
------------------------------------
AARON        SELBY
ADAM         GOOCH
ADRIAN       CLARY
AGNES        BISHOP
ALAN         KAHN
ALBERT       CROUSE
ALBERTO      HENNING
ALEX         GRESHAM
ALEXANDER    FENNELL
ALFRED       CASILLAS
ALFREDO      MCADAMS
ALICE        STEWART
ALICIA       MILLS
...

Now, run the following “equivalent” query:

SELECT first_name, last_name
FROM customer
FULL OUTER JOIN staff 
  USING (first_name, last_name)
ORDER BY 1, 2

The result will again yield:

first_name   last_name
------------------------------------
AARON        SELBY
ADAM         GOOCH
ADRIAN       CLARY
AGNES        BISHOP
ALAN         KAHN
ALBERT       CROUSE
ALBERTO      HENNING
...

This only works because we’re using the USING clause, which not every database supports natively. If we did our JOIN with the more commonly used ON clause, we’d have to write the more tedious:

SELECT
  COALESCE(c.first_name, s.first_name) AS first_name,
  COALESCE(c.last_name, s.last_name) AS last_name
FROM customer c
FULL OUTER JOIN staff s
  ON (c.first_name, c.last_name)
  =  (s.first_name, s.last_name)
ORDER BY 1, 2

In this case, most people probably default to using UNION already, as it is a much better known operation than FULL OUTER JOIN.

All of jOOQ’s currently supported RDBMS support UNION and UNION ALL (the latter doesn’t remove duplicates).

In the following, we’ll see that equivalent comparisons can be made with other set operations:

INTERSECT

The INTERSECT operation is really useful when you want to keep only those tuples that are present in both sets that are combined using INTERSECT:

set-intersect

As you can see, we may want to retain only those customers that are also actors. Let’s run this query:

SELECT first_name, last_name
FROM customer
INTERSECT
SELECT first_name, last_name
FROM actor
first_name   last_name
------------------------------------
JENNIFER     DAVIS

One of our customers is also an actor. The same query could have been written with an INNER JOIN as such:

SELECT first_name, last_name
FROM customer
INNER JOIN actor 
  USING (first_name, last_name)

… or with the ON syntax

SELECT c.first_name, c.last_name
FROM customer c
INNER JOIN actor a
  ON (c.first_name, c.last_name)
  =  (a.first_name, a.last_name)

This time, no COALESCE is needed, as INNER JOIN retains only those tuples from the cartesian product, which are present on “both sides” of the JOIN, so we can pick any of the tables to prefix our columns.

You may even decide to use a semi-join instead, which would yield the same results:

SELECT first_name, last_name
FROM customer
WHERE (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM actor
)

or, using the more verbose, yet equivalent EXISTS predicate:

SELECT first_name, last_name
FROM customer c
WHERE EXISTS (
  SELECT 1
  FROM actor a
  WHERE (c.first_name, c.last_name)
      = (a.first_name, a.last_name)
)

All of the above, again, yield:

first_name   last_name
------------------------------------
JENNIFER     DAVIS

EXCEPT

The EXCEPT operation is useful when you want to keep only those tuples that are present in one set, but not in another:

set-difference

Running this query:

SELECT first_name, last_name
FROM customer
EXCEPT
SELECT first_name, last_name
FROM staff
ORDER BY 1, 2

… will yield:

first_name   last_name
------------------------------------
AARON        SELBY
ADAM         GOOCH
ADRIAN       CLARY
AGNES        BISHOP
ALAN         KAHN
ALBERT       CROUSE
ALBERTO      HENNING
...

According to the original Venn diagrams, this can be tweaked using LEFT JOIN and a IS NULL predicate:

SELECT first_name, last_name
FROM customer
LEFT JOIN staff
  USING (first_name, last_name)
WHERE staff_id IS NULL
ORDER BY 1, 2

or with an ON clause:

SELECT c.first_name, c.last_name
FROM customer c
LEFT JOIN staff s
  ON (c.first_name, c.last_name)
  =  (s.first_name, s.last_name)
WHERE staff_id IS NULL
ORDER BY 1, 2

This is completely unreadable and doesn’t communicate the fact that we’re removing tuples from a set CUSTOMER, given their presence in another set STAFF.

An equivalent version using anti-join might be more readable (watch out for NULLs in NOT IN predicates, though!):

SELECT c.first_name, c.last_name
FROM customer c
WHERE (first_name, last_name) NOT IN (
  SELECT first_name, last_name
  FROM staff
)
ORDER BY 1, 2

… or, using NOT EXISTS:

SELECT c.first_name, c.last_name
FROM customer c
WHERE NOT EXISTS (
  SELECT 1
  FROM staff s
  WHERE (c.first_name, c.last_name)
      = (s.first_name, s.last_name)
)
ORDER BY 1, 2

Conclusion

UNION, INTERSECT, and EXCEPT are very simple, yet very useful operations that can add a lot of value every now and then in your daily SQL tasks. While JOIN operations are much more versatile, they are also more complex for the simple tasks that can be solved by UNION, INTERSECT, and EXCEPT

Did you like this article? It’s part of the Data Geekery SQL Training – a 1-day workshop helping you to get the most out of the awesome SQL language.

Read more articles about awesome SQL here:

CROSS JOIN, a nice example for a rarely used operation

In 95% of the cases, cartesian products originate from accidental cross join operations and cause unnecessary high load on a database. Maybe the results aren’t even wrong, as someone may have applied a UNION or a DISTINCT keyword, to remove unwanted duplicates.

But there are those 5% of SQL queries, where the cartesian product is intentional, even desireable. Here’s a nice example of such a query. Let’s assume we have these two tables (for simplicity, constraints are omitted):

create table content_hits (
  content_id int,
  subscriber_id int
);

create table content_tags (
  content_id int,
  tag_id int
);

The above tables model some blog content, where content_hits shows how many subscribers are concerned with a given blog entry, and content_tags shows which tag is applied to a given blog entry. So let’s say, we want to have some statistics about how many times any given subscriber/tag combination was actually seen. We cannot just join and group by subscriber_id, tag_id if we want the complete list of combinations (including the ones that have a zero count). So a cross join comes to the rescue!

SELECT combinations.tag_id,
       combinations.subscriber_id,

-- correlated subquery to count the actual hits by tag/subscriber when
-- joining the two tables using content_id

       (SELECT count(*)
        FROM content_hits AS h
        JOIN content_tag AS t ON h.content_id = t.content_id
        WHERE h.subscriber_id = combinations.subscriber_id
        AND t.tag_id = combinations.tag_id) as cnt

-- Create all combinations of tag/subscribers first, before counting
-- anything. This will be necessary to have "zero-counts" for any
-- combination of tag/subscriber

FROM (
  SELECT DISTINCT tag_id, subscriber_id
  FROM content_tag
  CROSS JOIN content_hits
) AS combinations

Instead of using a correlated subquery, for performance reasons (depending on the database) it might be more interesting to left outer join the first subquery to the “combinations” relation and then group by subscriber_id, tag_id, and count(content_id) for every group, resulting in this query:

SELECT combinations.tag_id,
       combinations.subscriber_id,
       count(content_relation.content_id)

-- Create all combinations of tag/subscribers first, before counting
-- anything. This will be necessary to have "zero-counts" for any
-- combination of tag/subscriber

FROM (
  SELECT DISTINCT tag_id, subscriber_id
  FROM content_tag
  CROSS JOIN content_hits
) AS combinations

-- Now, outer join every expected combination to its 
-- actual relation joined on content_id

LEFT OUTER JOIN (
  SELECT h.content_id, 
         h.subscriber_id,
         t.tag_id
  FROM content_hits AS h
  JOIN content_tag AS t ON h.content_id = t.content_id
) AS content_relation
ON combinations.tag_id = content_relation.tag_id
AND combinations.subscriber_id = content_relation.subscriber_id

-- And now, finally, group, in order to get the count(content_id)
GROUP BY combinations.tag_id,
         combinations.subscriber_id

The original Stack Overflow question can be seen here:

http://stackoverflow.com/questions/9924433/mysql-select-query-using-count