A little bit of relational algebra first. What is an (INNER) JOIN? An JOIN is nothing but a filtered cartesian product. And what is a cartesian product? Wikipedia explains this very nicely:
for sets A and B, the Cartesian product A × B is the set of all ordered pairs (a, b) where a ∈ A and b ∈ B.
That was the technical way of putting it. A more understandable way might be the following:
Or, as an image:
By Trainler – Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=7104281
The above cartesian product models the combination of each rank with each suite. Simple, right?
In SQL, a cartesian product can be written as either a CROSS JOIN, or a table list in the FROM clause. The following query combines every customer with every staff member:
-- CROSS JOIN
SELECT c.last_name, s.last_name
FROM customer AS c
CROSS JOIN staff AS s
-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
staff AS s
Now, as I mentioned before, an (INNER) JOIN is nothing but a filtered CROSS JOIN, where the filter is applied in a dedicated USING or ON clause.
-- INNER JOIN with USING
SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s
USING (last_name)
-- INNER JOIN with ON
SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s
ON c.last_name = s.last_name
The above query will match only those customers with those users whose last_name are the same. As I’ve told you before, an (INNER) JOIN is just a filtered CROSS JOIN, so the below queries will be semantically equivalent to the above:
-- CROSS JOIN
SELECT c.last_name, s.last_name
FROM customer AS c
CROSS JOIN staff AS s
WHERE c.last_name = s.last_name
-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
staff AS s
WHERE c.last_name = s.last_name
Specifically the last version is still used in many SQL codebases, which have not yet migrated to the ANSI JOIN syntax (even if ANSI joins should be preferred for readability reasons).
But that might be wrong
Unfortunately, I’m seeing this mistake all the time, as I’ve mentioned before. JOIN might appear like a useful tool to match rows between tables. But remember one thing, and I’m starting to repeat myself:
(INNER) JOIN is just a filtered CROSS JOIN
This means that if you choose INNER JOIN to find those customers for which there are matching staff, you will create a cartesian product between customer and staff, and then apply a filter. Why is that a problem? Let’s assume the following:
Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John | Doe |
| Alice | Miller |
| Max | Doe |
+------------+-----------+
Staff:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John | Doe |
| Alice | Peterson |
| Jane | Doe |
+------------+-----------+
What happens when you run the above queries that use (INNER) JOIN to match customers with staff? Exactly. You’ll form a cartesian product first:
{ (John Doe, John Doe),
(John Doe, Alice Peterson),
(John Doe, Jane Doe),
(Alice Miller, John Doe),
(Alice Miller, Alice Peterson),
(Alice Miller, Jane Doe),
(Max Doe, John Doe),
(Max Doe, Alice Peterson),
(Max Doe, Jane Doe) }
… and then filter out the tuples that shouldn’t be in the result, i.e. the ones that don’t have matching last names (of course, the database might choose to optimise this and not materialise the entire cross product):
{ (John Doe, John Doe),
(John Doe, Jane Doe),
(Max Doe, John Doe),
(Max Doe, Jane Doe) }
We’re now left with 4 tuples. That’s great, if that’s what you were after in the first place. A combination of all customers with all staff, for which the combination shares the same last name. But maybe you were asking yourself something else, namely:
Do we have any customers who are staff family members?
Use-case: Exclude such customers from a raffle (let’s assume that last names are a sufficient criteria here).
In that case, we’ll get “duplicate” records. Because the query that some of you might’ve written would have been:
-- INNER JOIN with USING
SELECT c.*
FROM customer AS c
INNER JOIN staff AS s
USING (last_name)
Yielding:
Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John | Doe |
| John | Doe |
| Max | Doe |
| Max | Doe |
+------------+-----------+
Bummer. How to remove duplicates? With DISTINCT you might think:
-- INNER JOIN with USING
SELECT DISTINCT c.*
FROM customer AS c
INNER JOIN staff AS s
USING (last_name)
Yielding:
Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John | Doe |
| Max | Doe |
+------------+-----------+
What’s wrong with DISTINCT?
Using DISTINCT in this situation is a big mistake. Why?
Your accidental cartesian product loads too many records from disk, and produces too many records in memory, which have to be removed again
DISTINCT can be expensive in some databases, that implement it via sorting, rather than via hashing
By using a SEMI-JOIN. It is called semi join (i.e. “half” join) in relational algebra, because we only care about one side of the JOIN operation in the results, not the other side. In this example, we only care about customers in the result. We don’t want to have any staff records. The relational algebra notation would be
SELECT *
FROM customer AS c
LEFT SEMI JOIN staff AS s
USING (last_name)
The SQL way to express a SEMI JOIN is by using EXISTS () or IN (). The following two are equivalent:
-- Using EXISTS
SELECT *
FROM customer AS c
WHERE EXISTS (
SELECT *
FROM staff AS s
WHERE c.last_name = s.last_name
)
-- Using IN
SELECT *
FROM customer
WHERE last_name IN (
SELECT last_name
FROM staff
)
The above is much easier to write, and will transform into the corresponding (NOT) EXISTS predicate.
Exception
There are some databases that may unfortunately show worse performance for some of these semi join / anti join operators. See, for instance this outdated article on MySQL performance:
http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql
Do measure first, before you believe any of these articles, though!
Another exception is when you have a primary key / foreign key relationship that guarantees that an (INNER) JOIN produces no duplicate values, i.e. when you’re joining a one-to-one or many-to-one relationship, then JOIN is a correct solution, but it is usually equally fast, so semi join will still be more readable.
Conclusion
If you need to check whether you have any matches between a table A and a table B, but you only really care about the results from table A, do make sure you’re using a SEMI-JOIN (i.e. an EXISTS or IN predicate), not an (INNER) JOIN.
Further reading:
SQL syntax makes it a lot more clear what you want.
A SemiJoin is a simple check on a second table, you do not JOIN their data with it, which makes the name quit the oxymoron “A join that does not Join”.
In SQL with “Where exists” it is explicit what is meant – do all the stuff and only give the results where the condition exists/Not exists.
You are entitled to your opinion, of course, but semi join is a standard relational algebra operator. In relational algebra, you can achieve it by inner joining two tables A and B but projecting only columns from A (given that relational algebra operates on sets, unlike SQL, which operates on multisets).
So, thinking of these basic operations in relational algebra does help better express yourself in SQL as well. Yes, it can be implemented with the exists operator. No, they’re not the same thing, formally.
I usually follow another rule which might be helpful. If you do not need columns from a table, then move it from FROM clause to WHERE clause.
That’s a nice way of remembering it.
SQL syntax makes it a lot more clear what you want.
A SemiJoin is a simple check on a second table, you do not JOIN their data with it, which makes the name quit the oxymoron “A join that does not Join”.
In SQL with “Where exists” it is explicit what is meant – do all the stuff and only give the results where the condition exists/Not exists.
You are entitled to your opinion, of course, but semi join is a standard relational algebra operator. In relational algebra, you can achieve it by inner joining two tables A and B but projecting only columns from A (given that relational algebra operates on sets, unlike SQL, which operates on multisets).
So, thinking of these basic operations in relational algebra does help better express yourself in SQL as well. Yes, it can be implemented with the exists operator. No, they’re not the same thing, formally.
Hope this helps