Perhaps the most powerful SQL feature is the
JOIN
operation. It is the envy of all non-relational databases, because the concept is so simple, yet so universally applicable, when you want to “combine” two data sets.
Put simply, when joining two tables, you’re combining every row from one table with every row from another table, for which a certain predicate is true. An illustration from our
SQL Masterclass shows it.

See also our recent article about
using Venn diagrams to illustrate JOINs. The above illustration compares
INNER JOIN
with different
OUTER JOIN
operations, but those are far from all possibilities that we have. Let’s look at things from a more systematic angle.
Do note that whenever in this article I say “X happens before Y”, I mean that “X happens
logically before Y”. The database optimiser may still choose to execute Y before X, because that is faster without changing the outcome.
More information about the syntactic / logical order of operations here.
But let’s look into all the join types, individually!
CROSS JOIN
The most basic
JOIN
operation is really a
cartesian product. It just combines every row from one table with every row from another table. The best example for cartesian products is given by Wikipedia, using a deck of cards for illustration where we’re “cross joining” the ranks table with the suits table:

In real-world scenarios,
CROSS JOIN
can be very useful when doing reports, for instance you could generate a set of dates (e.g. days in a month) and cross join that with all the departments in the database, to create a complete days/departments table. Using PostgreSQL syntax:
SELECT *
-- This just generates all the days in January 2017
FROM generate_series(
'2017-01-01'::TIMESTAMP,
'2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
INTERVAL '1 day'
) AS days(day)
-- Here, we're combining all days with all departments
CROSS JOIN departments
Imagine we have the following data:
+--------+ +------------+
| day | | department |
+--------+ +------------+
| Jan 01 | | Dept 1 |
| Jan 02 | | Dept 2 |
| ... | | Dept 3 |
| Jan 30 | +------------+
| Jan 31 |
+--------+
The result would now look like this:
+--------+------------+
| day | department |
+--------+------------+
| Jan 01 | Dept 1 |
| Jan 01 | Dept 2 |
| Jan 01 | Dept 3 |
| Jan 02 | Dept 1 |
| Jan 02 | Dept 2 |
| Jan 02 | Dept 3 |
| ... | ... |
| Jan 31 | Dept 1 |
| Jan 31 | Dept 2 |
| Jan 31 | Dept 3 |
+--------+------------+
Now, for each days/departments combination, you could calculate the daily revenue for that department, or something similar.
Characteristics
A
CROSS JOIN
is a cartesian product, i.e. a
product as in “multiplication”. The mathematical notation uses the multiplication sign to express this operation:
A × B
, or in our case:
days × departments
.
Just like with “ordinary” arithmetic multiplication, if one of the two tables is empty (size zero), then the result will also be empty (size zero). This makes total sense. If we combine the previous 31 days with 0 departments, we’ll get 0 days/departments combinations. Likewise, if we combine an empty date range with any number of departments, we also get 0 days/departments combinations.
In other words:
size(result) = size(days) * size(departments)
Alternative syntaxes
In the old days, before the ANSI
JOIN
syntax was introduced to SQL, people just wrote comma-separated lists of tables in the
FROM
clause to write a
CROSS JOIN
. The above query is equivalent to this one:
SELECT *
FROM
generate_series(
'2017-01-01'::TIMESTAMP,
'2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
INTERVAL '1 day'
) AS days(day),
departments
In general, I strongly recommend to use the
CROSS JOIN
keyword instead of comma-separated lists of tables, because if you intentionally want to do a
CROSS JOIN
, nothing communicates this intent (to the next developer!) better than using the actual keywords. So many things can go wrong, accidentally, with comma-separated lists of tables, including accidental cross joins. You don’t want those!
INNER JOIN (Theta-JOIN)
Building on top of the previous
CROSS JOIN
operation,
INNER JOIN
(or just simply
JOIN
, sometimes also called
"THETA" JOIN
) allows for filtering the outcome of the cartesian product by some predicate. Most of the times, we’re putting that predicate in an
ON
clause, and it could be something like this:
SELECT *
-- Same as before
FROM generate_series(
'2017-01-01'::TIMESTAMP,
'2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
INTERVAL '1 day'
) AS days(day)
-- Now, exclude all days/departments combinations for
-- days before the department was created
JOIN departments AS d ON day >= d.created_at
In most databases, the
INNER
keyword is optional, so I’m just omitting it in this article.
Notice how the
INNER JOIN
operation allows for putting
arbitrary predicates in the
ON
clause, which is again very useful when doing reporting. Just like in the previous
CROSS JOIN
example, we’re combining all days with all departments, but then we retain only those days/departments combinations for which the department already existed, i.e. for which the department creation preceded the day.
Again, using this data:
+--------+ +------------+------------+
| day | | department | created_at |
+--------+ +------------+------------+
| Jan 01 | | Dept 1 | Jan 10 |
| Jan 02 | | Dept 2 | Jan 11 |
| ... | | Dept 3 | Jan 12 |
| Jan 30 | +------------+------------+
| Jan 31 |
+--------+
The result would now look like this:
+--------+------------+
| day | department |
+--------+------------+
| Jan 10 | Dept 1 |
| Jan 11 | Dept 1 |
| Jan 11 | Dept 2 |
| Jan 12 | Dept 1 |
| Jan 12 | Dept 2 |
| Jan 12 | Dept 3 |
| Jan 13 | Dept 1 |
| Jan 13 | Dept 2 |
| Jan 13 | Dept 3 |
| ... | ... |
| Jan 31 | Dept 1 |
| Jan 31 | Dept 2 |
| Jan 31 | Dept 3 |
+--------+------------+
So, we’d get no results prior to January 10, those rows were filtered out.
Characteristics
INNER JOIN
operations are filtered
CROSS JOIN
operations. This means that if one of the two tables is empty, then the result is also guaranteed to be empty. But unlike with
CROSS JOIN
, because of the predicate, we can always get less results than the
CROSS JOIN
would deliver.
In other words:
size(result) <= size(days) * size(departments)
Alternative syntaxes
While the
ON
clause is mandatory with the
INNER JOIN
operation, you’re not required to place
JOIN
predicates in there (although it is highly recommended from a readability perspective). Most databases will optimise the following, equivalent query in just the same way:
SELECT *
FROM generate_series(
'2017-01-01'::TIMESTAMP,
'2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
INTERVAL '1 day'
) AS days(day)
-- You can always JOIN .. ON true (or 1 = 1 in other DBs)
-- to turn an syntactic INNER JOIN into a semantic CROSS JOIN
JOIN departments AS d ON true
-- ... and then turn the CROSS JOIN back into an INNER JOIN
-- by putting the JOIN predicate in the WHERE clause:
WHERE day >= d.created_at
Of course, again, that is just obfuscating the query for the reader, but you may have your reasons, right? If we take this one step further, the following query is also equivalent, because most optimisers can figure out the equivalence and execute an
INNER JOIN
instead:
SELECT *
FROM generate_series(
'2017-01-01'::TIMESTAMP,
'2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
INTERVAL '1 day'
) AS days(day)
-- Now, this is really a syntactic CROSS JOIN
CROSS JOIN departments AS d
WHERE day >= d.created_at
… and, as seen before,
CROSS JOIN
is just syntax sugar for comma-separated table lists. In this case, we retain the
WHERE
clause to get what people have frequently done prior to the introduction of ANSI
JOIN
syntax:
SELECT *
FROM
generate_series(
'2017-01-01'::TIMESTAMP,
'2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
INTERVAL '1 day'
) AS days(day),
departments AS d
WHERE day >= d.created_at
All of these syntaxes do the same thing, usually without performance penalty, but clearly, they’re all less readable than the original
INNER JOIN
syntax.
EQUI JOIN
Sometimes, e.g. in literature, you will hear the term
EQUI JOIN
where
"EQUI"
isn’t really meant as a SQL
keyword, but just as a specific way of writing a special kind of
INNER JOIN
.
In fact, it is weird that
"EQUI" JOIN
is the special case, because it’s what we’re doing most in SQL, also in OLTP applications where we simply
JOIN
by primary key / foreign key relationship. For instance:
SELECT *
FROM actor AS a
JOIN film_actor AS fa ON a.actor_id = fa.actor_id
JOIN film AS f ON f.film_id = fa.film_id
The above query selects all actors and their films. There are two
INNER JOIN
operations, one connecting actors to the corresponding entries in the film_actor relationship table (because actors can have many films and films can have many actors), and the other connecting each film_actor relationship with additional information about the film itself.
Characteristics
The characteristics of this operation are the same as those of the “general”
INNER JOIN
operation. An
"EQUI" JOIN
is still a cartesian product (
CROSS JOIN
) with a reduced set of results, i.e. containing only those actor/film combinations for which a given actor actually played in the given film.
So again, in other words:
size(result) <= size(actor) * size(film)
The result size can only be equal to the actor size multiplied by the film size, if every actor played in every film, which is unlikely.
Alternative syntaxes: USING
Again, as before, we could write the
INNER JOIN
operation without an actual
INNER JOIN
syntax, but
CROSS JOIN
or comma-separated table lists instead. That’s boring, but much more interesting are the following two alternative syntaxes, one of which is very very useful:
SELECT *
FROM actor
JOIN film_actor USING (actor_id)
JOIN film USING (film_id)
The
USING
clause replaces the
ON
clause and allows for listing a set of columns that have to be present in both sides of the
JOIN
operation. If you carefully design your database in the same way as the
Sakila database, namely where every foreign key column(s) have the same names as their referenced primary key column(s) (e.g.
actor.actor_id = film_actor.actor_id
), then you can use
USING
for
"EQUI" JOIN
in at least these databases:
- Derby
- Firebird
- HSQLDB
- Ingres
- MariaDB
- MySQL
- Oracle
- PostgreSQL
- SQLite
- Vertica
These databases, unfortunately, don’t support the syntax:
- Access
- Cubrid
- DB2
- H2
- HANA
- Informix
- SQL Server
- Sybase ASE
- Sybase SQL Anywhere
While this produces exactly (well almost) the same result as the
ON
clause, it’s much quicker to read and write. I said
almost because some databases (and the SQL standard) specify that any column appearing in the
USING
clause loses its qualification. For instance:
SELECT
f.title, -- Ordinary column, can be qualified
f.film_id, -- USING column, shouldn't be qualified
film_id -- USING column, correct / non-ambiguous here
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
Also, of course, this syntax is a bit limited. Sometimes, you have several foreign keys in your tables, not all of which can have the primary key column name(s). For instance:
CREATE TABLE film (
..
language_id BIGINT REFERENCES language,
original_language_id BIGINT REFERENCES language,
)
If you want to join by
ORIGINAL_LANGUAGE_ID
, you’ll have to resort to the
ON
clause.
Alternative syntaxes: NATURAL JOIN
An more extreme and much less useful form of
"EQUI" JOIN
is the
NATURAL JOIN
clause. The previous example could be further “improved” by replacing
USING
by
NATURAL JOIN
like this:
SELECT *
FROM actor
NATURAL JOIN film_actor
NATURAL JOIN film
Notice how we no longer need to specify any
JOIN
criteria, because a
NATURAL JOIN
will automatically take all the columns that share the same name from both tables that it joins and place them in a “hidden”
USING
clause. As we’ve seen before, as primary keys and foreign keys have the same column name, this appears quite useful.
Truth is: It isn’t useful. In the
Sakila database, every table also has a
LAST_UPDATE
column, which is automatically being taken into consideration by
NATURAL JOIN
. The
NATURAL JOIN
query is thus equivalent to this one:
SELECT *
FROM actor
JOIN film_actor USING (actor_id, last_update)
JOIN film USING (film_id, last_update)
… which of course makes absolutely no sense at all. So, forget about
NATURAL JOIN
immediately (except for some very rare cases, e.g. when joining Oracle’s diagnostics views like
v$sql NATURAL JOIN v$sql_plan
, etc., for ad-hoc analytics)
OUTER JOIN
We’ve seen
INNER JOIN
before, which returns results only for combinations of the left / right table, for which the
ON
predicate yields
true
.
OUTER JOIN
allows us to
retain rows on the left / rigth side, for which we didn’t find a matching combination. Let’s go back to the days and departments example:
SELECT *
FROM generate_series(
'2017-01-01'::TIMESTAMP,
'2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
INTERVAL '1 day'
) AS days(day)
LEFT JOIN departments AS d ON day >= d.created_at
Again, the
OUTER
keyword is optional, so I’m omitting it in the examples.
This query is very subtly different from its
INNER JOIN
counterpart in that it will always return at least one row per day, even if for a given day, there is no department that has already been created at that day. For instance: All departments were created on January 10. The above query will still return January 1-9:
+--------+ +------------+------------+
| day | | department | created_at |
+--------+ +------------+------------+
| Jan 01 | | Dept 1 | Jan 10 |
| Jan 02 | | Dept 2 | Jan 11 |
| ... | | Dept 3 | Jan 12 |
| Jan 30 | +------------+------------+
| Jan 31 |
+--------+
In addition to the rows we got before in the
INNER JOIN
example, we now also have all the days from January 1-9 with
NULL
departments:
+--------+------------+
| day | department |
+--------+------------+
| Jan 01 | | -- Extra rows with no match here
| Jan 02 | | -- Extra rows with no match here
| ... | | -- Extra rows with no match here
| Jan 09 | | -- Extra rows with no match here
| Jan 10 | Dept 1 |
| Jan 11 | Dept 1 |
| Jan 11 | Dept 2 |
| Jan 12 | Dept 1 |
| Jan 12 | Dept 2 |
| Jan 12 | Dept 3 |
| Jan 13 | Dept 1 |
| Jan 13 | Dept 2 |
| Jan 13 | Dept 3 |
| ... | ... |
| Jan 31 | Dept 1 |
| Jan 31 | Dept 2 |
| Jan 31 | Dept 3 |
+--------+------------+
In other words, each day appears at least once in the result.
LEFT JOIN
does this for the
left table, i.e. it retains all rows from the
left table in the result.
Formally, a
LEFT OUTER JOIN
is an
INNER JOIN
with a
UNION
like this:
-- Convenient syntax:
SELECT *
FROM a LEFT JOIN b ON <predicate>
-- Cumbersome, equivalent syntax:
SELECT a.*, b.*
FROM a JOIN b ON <predicate>
UNION ALL
SELECT a.*, NULL, NULL, ..., NULL
FROM a
WHERE NOT EXISTS (
SELECT * FROM b WHERE <predicate>
)
We’ll talk about
NOT EXISTS
further down, when talking about
"SEMI" JOIN
RIGHT OUTER JOIN
RIGHT OUTER JOIN
does exactly the opposite. It retains all the rows from the
right table in the result. Let’s add more departments
+--------+ +------------+------------+
| day | | department | created_at |
+--------+ +------------+------------+
| Jan 01 | | Dept 1 | Jan 10 |
| Jan 02 | | Dept 2 | Jan 11 |
| ... | | Dept 3 | Jan 12 |
| Jan 30 | | Dept 4 | Apr 01 |
| Jan 31 | | Dept 5 | Apr 02 |
+--------+ +------------+------------+
The new departments 4 and 5 will not be in the previous results at all, because they were created on a day after January 31. But it will appear in a
RIGHT JOIN
result, because departments is the
right table of the join operation, and all the rows from the
right table will be retained.
Running this query:
SELECT *
FROM generate_series(
'2017-01-01'::TIMESTAMP,
'2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
INTERVAL '1 day'
) AS days(day)
RIGHT JOIN departments AS d ON day >= d.created_at
Will yield something like this:
+--------+------------+
| day | department |
+--------+------------+
| Jan 10 | Dept 1 |
| Jan 11 | Dept 1 |
| Jan 11 | Dept 2 |
| Jan 12 | Dept 1 |
| Jan 12 | Dept 2 |
| Jan 12 | Dept 3 |
| Jan 13 | Dept 1 |
| Jan 13 | Dept 2 |
| Jan 13 | Dept 3 |
| ... | ... |
| Jan 31 | Dept 1 |
| Jan 31 | Dept 2 |
| Jan 31 | Dept 3 |
| | Dept 4 | -- Extra rows with no match here
| | Dept 5 | -- Extra rows with no match here
+--------+------------+
In most cases (I have to yet run into a situation where this isn’t the case), Each
LEFT OUTER JOIN
expression can be transformed into an equivalent
RIGHT OUTER JOIN
expression, and vice-versa. Because
RIGHT OUTER JOIN
is usually a bit less readable, most people use
LEFT OUTER JOIN
only.
FULL OUTER JOIN
Finally, there is also
FULL OUTER JOIN
, which retains all rows from both sides of the
JOIN
operation. In our example, this means that each day appears at least once in the result, just like each department appears at least once in the result.
Let’s take again this data:
+--------+ +------------+------------+
| day | | department | created_at |
+--------+ +------------+------------+
| Jan 01 | | Dept 1 | Jan 10 |
| Jan 02 | | Dept 2 | Jan 11 |
| ... | | Dept 3 | Jan 12 |
| Jan 30 | | Dept 4 | Apr 01 |
| Jan 31 | | Dept 5 | Apr 02 |
+--------+ +------------+------------+
And now, let’s run this query:
SELECT *
FROM generate_series(
'2017-01-01'::TIMESTAMP,
'2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
INTERVAL '1 day'
) AS days(day)
FULL JOIN departments AS d ON day >= d.created_at
The result will now look something like this:
+--------+------------+
| day | department |
+--------+------------+
| Jan 01 | | -- row from the left table
| Jan 02 | | -- row from the left table
| ... | | -- row from the left table
| Jan 09 | | -- row from the left table
| Jan 10 | Dept 1 |
| Jan 11 | Dept 1 |
| Jan 11 | Dept 2 |
| Jan 12 | Dept 1 |
| Jan 12 | Dept 2 |
| Jan 12 | Dept 3 |
| Jan 13 | Dept 1 |
| Jan 13 | Dept 2 |
| Jan 13 | Dept 3 |
| ... | ... |
| Jan 31 | Dept 1 |
| Jan 31 | Dept 2 |
| Jan 31 | Dept 3 |
| | Dept 4 | -- row from the right table
| | Dept 5 | -- row from the right table
+--------+------------+
If you insist, formally, a
LEFT OUTER JOIN
is an
INNER JOIN
with a
UNION
like this:
-- Convenient syntax:
SELECT *
FROM a LEFT JOIN b ON <predicate>
-- Cumbersome, equivalent syntax:
SELECT a.*, b.*
FROM a JOIN b ON <predicate>
-- LEFT JOIN part
UNION ALL
SELECT a.*, NULL, NULL, ..., NULL
FROM a
WHERE NOT EXISTS (
SELECT * FROM b WHERE <predicate>
)
-- RIGHT JOIN part
UNION ALL
SELECT NULL, NULL, ..., NULL, b.*
FROM b
WHERE NOT EXISTS (
SELECT * FROM a WHERE <predicate>
)
Alternative syntaxes: “EQUI” OUTER JOIN
The above examples again used the “cartesian product with filter” kind of
JOIN
. Much more common, however, is the
"EQUI" OUTER JOIN
approach, where we join on a primary key / foreign key relationship. Let’s go back to the
Sakila database example. Some actors haven’t played in any film, and we might want to query them like this:
SELECT *
FROM actor
LEFT JOIN film_actor USING (actor_id)
LEFT JOIN film USING (film_id)
This query will return all actors at least once, regardless if they played in a film. If we also want all films that do not have any actors (yet), we could use a
FULL OUTER JOIN
to combine the results:
SELECT *
FROM actor
FULL JOIN film_actor USING (actor_id)
FULL JOIN film USING (film_id)
And of course, this also works with
NATURAL LEFT JOIN
,
NATURAL RIGHT JOIN
,
NATURAL FULL JOIN
, but again, these aren’t useful at all, as we’d be joining again
USING (..., LAST_UPDATE)
, which makes no sense at all.
Alternative syntaxes: Oracle and SQL Server style OUTER JOIN
These two databases had
OUTER JOIN
before the ANSI syntax was established. It looked like this:
-- Oracle
SELECT *
FROM actor a, film_actor fa, film f
WHERE a.actor_id = fa.actor_id(+)
AND fa.film_id = f.film_id(+)
-- SQL Server
SELECT *
FROM actor a, film_actor fa, film f
WHERE a.actor_id *= fa.actor_id
AND fa.film_id *= f.film_id
That’s nice, given that at some point in time (in the 80s??), ANSI didn’t specify
OUTER JOIN
yet. But the 80s are more than 30 years ago. So, it’s safe to say this stuff is obsolete.
SQL Server did the right thing and deprecated (and later removed) that syntax a long time ago. Oracle still supports it for backwards compatibility reasons.
But nothing about this syntax is reasonable or readable. Don’t use it. Replace it with ANSI JOIN.
PARTITIONED OUTER JOIN
This is Oracle specific, but I must say, it’s a real shame none of the other databases have stolen the feature yet. Remember the
CROSS JOIN
operation that we used to combine each day with each department? Because, sometimes, that’s what we want in the result: All combinations, and
if there is a match put also the matching values in the row.
That’s hard to describe in words, much more easy with an example. Here’s the query with Oracle syntax:
WITH
-- Using CONNECT BY to generate all dates in January
days(day) AS (
SELECT DATE '2017-01-01' + LEVEL - 1
FROM dual
CONNECT BY LEVEL <= 31
),
-- Our departments
departments(department, created_at) AS (
SELECT 'Dept 1', DATE '2017-01-10' FROM dual UNION ALL
SELECT 'Dept 2', DATE '2017-01-11' FROM dual UNION ALL
SELECT 'Dept 3', DATE '2017-01-12' FROM dual UNION ALL
SELECT 'Dept 4', DATE '2017-04-01' FROM dual UNION ALL
SELECT 'Dept 5', DATE '2017-04-02' FROM dual
)
SELECT *
FROM days
LEFT JOIN departments
PARTITION BY (department) -- This is where the magic happens
ON day >= created_at
Unfortunately,
PARTITION BY
is used in various contexts with different meanings (
e.g. for window functions). In this case, it means that we “partition” our data by the
departments.department
column, creating a “partition” for each department. Now, each partition will get a copy for each day, regardless if there’s a match in our predicate (unlike in the ordinary
LEFT JOIN
case, where we had a bunch of “department-less” days). The result of the above query is now this:
+--------+------------+------------+
| day | department | created_at |
+--------+------------+------------+
| Jan 01 | Dept 1 | | -- Didn't match, but still get row
| Jan 02 | Dept 1 | | -- Didn't match, but still get row
| ... | Dept 1 | | -- Didn't match, but still get row
| Jan 09 | Dept 1 | | -- Didn't match, but still get row
| Jan 10 | Dept 1 | Jan 10 | -- Matches, so get join result
| Jan 11 | Dept 1 | Jan 10 | -- Matches, so get join result
| Jan 12 | Dept 1 | Jan 10 | -- Matches, so get join result
| ... | Dept 1 | Jan 10 | -- Matches, so get join result
| Jan 31 | Dept 1 | Jan 10 | -- Matches, so get join result
| Jan 01 | Dept 2 | | -- Didn't match, but still get row
| Jan 02 | Dept 2 | | -- Didn't match, but still get row
| ... | Dept 2 | | -- Didn't match, but still get row
| Jan 09 | Dept 2 | | -- Didn't match, but still get row
| Jan 10 | Dept 2 | | -- Didn't match, but still get row
| Jan 11 | Dept 2 | Jan 11 | -- Matches, so get join result
| Jan 12 | Dept 2 | Jan 11 | -- Matches, so get join result
| ... | Dept 2 | Jan 11 | -- Matches, so get join result
| Jan 31 | Dept 2 | Jan 11 | -- Matches, so get join result
| Jan 01 | Dept 3 | | -- Didn't match, but still get row
| Jan 02 | Dept 3 | | -- Didn't match, but still get row
| ... | Dept 3 | | -- Didn't match, but still get row
| Jan 09 | Dept 3 | | -- Didn't match, but still get row
| Jan 10 | Dept 3 | | -- Didn't match, but still get row
| Jan 11 | Dept 3 | | -- Didn't match, but still get row
| Jan 12 | Dept 3 | Jan 12 | -- Matches, so get join result
| ... | Dept 3 | Jan 12 | -- Matches, so get join result
| Jan 31 | Dept 3 | Jan 12 | -- Matches, so get join result
| Jan 01 | Dept 4 | | -- Didn't match, but still get row
| Jan 02 | Dept 4 | | -- Didn't match, but still get row
| ... | Dept 4 | | -- Didn't match, but still get row
| Jan 31 | Dept 4 | | -- Didn't match, but still get row
| Jan 01 | Dept 5 | | -- Didn't match, but still get row
| Jan 02 | Dept 5 | | -- Didn't match, but still get row
| ... | Dept 5 | | -- Didn't match, but still get row
| Jan 31 | Dept 5 | | -- Didn't match, but still get row
+--------+------------+
As you can see, I’ve visually created 5 partitions for the 5 departments. Each partition combines the department with each day, but unlike when doing a
CROSS JOIN
, we’re now getting actual
LEFT JOIN .. ON ..
results in case there is a match for the predicate. This is a really nice feature for reporting in Oracle!
SEMI JOIN
In
relational algebra, there is a notion of a
semi join operation, which unfortunately doesn’t have a syntax representation in SQL. If it did, the syntax would probably be
LEFT SEMI JOIN
and RIGHT SEMI JOIN
, just like the
Cloudera Impala syntax extension offers.
What is a
"SEMI" JOIN
?
When writing something like the fictional query below:
SELECT *
FROM actor
LEFT SEMI JOIN film_actor USING (actor_id)
What we really mean is we want all actors that played in films. But we don’t want any films in the results, just the actors. More specifically, we don’t want each actor several times, once per film. We want each actor only once (or zero times) in the result.
Semi is latin for “half”, i.e. we implement only “half the join”, in this case, the left half.
In SQL, there are two alternative syntaxes that can emulate
"SEMI" JOIN
Alternative syntax: EXISTS
This is the more powerful and a bit more verbose syntax
SELECT *
FROM actor a
WHERE EXISTS (
SELECT * FROM film_actor fa
WHERE a.actor_id = fa.actor_id
)
We’re looking for all the actors for which there
exists a film, i.e. actors that played in a film. With this syntax (i.e.
"SEMI" JOIN
placed in the
WHERE
clause), it’s immediately clear that we can get each actor at most once in the result. There’s no actual
JOIN
in the syntax.
Nonetheless, most databases will be able to recognise that what’s going on here is really a
"SEMI" JOIN
, and not just an ordinary
EXISTS()
predicate. For instance, consider the Oracle execution plan for the above query:

Note how Oracle calls the operation “HASH JOIN (SEMI)” – the SEMI keyword is present here. The same is true for PostgreSQL:

Or SQL Server:

Apart from being the optimal solution in terms of correctness,
there are also some performance benefits when using "SEMI" JOIN
rather than
INNER JOIN
, as the database can stop looking for matches as soon as it found the first!
Alternative syntax: IN
IN
and
EXISTS
are exactly equivalent
"SEMI" JOIN
emulations. The following query will produce the same plan in most databases (not MySQL), as the previous
EXISTS
query:
SELECT *
FROM actor
WHERE actor_id IN (
SELECT actor_id FROM film_actor
)
If your database supports both syntaxes for the
"SEMI" JOIN
operation, you may choose whichever you prefer from a stylistic point of view.
This is different with …
ANTI JOIN
In principle,
"ANTI" JOIN
is just the opposite of
"SEMI" JOIN
. When writing something like the fictional query below:
SELECT *
FROM actor
LEFT ANTI JOIN film_actor USING (actor_id)
… what we’re doing is we take all the actors that
didn’t play in any film. Unfortunately, again,
SQL doesn’t have a built-in syntax for this operation, but we can emulate it with
EXISTS
:
Alternative syntax: NOT EXISTS
The following query has exactly the expected semantics:
SELECT *
FROM actor a
WHERE NOT EXISTS (
SELECT * FROM film_actor fa
WHERE a.actor_id = fa.actor_id
)
(Dangerous) alternative syntax: NOT IN
Watch out! While
EXISTS
and
IN
are equivalent,
NOT EXISTS
and
NOT IN
are
not equivalent. This is because of
NULL
values.
In this particular case, the following
NOT IN
query is equivalent to the previous
NOT EXISTS
query, because our
film_actor
table has a
NOT NULL
constraint on
film_actor.actor_id
SELECT *
FROM actor
WHERE actor_id NOT IN (
SELECT actor_id FROM film_actor
)
If, however,
actor_id
became nullable, then the query would be wrong. Don’t believe it? Try running this:
SELECT *
FROM actor
WHERE actor_id NOT IN (1, 2, NULL)
It will not return any records. Why?
Because NULL
is the UNKNOWN
value in SQL. So, the predicate above is the same as this:
SELECT *
FROM actor
WHERE actor_id NOT IN (1, 2, UNKNOWN)
And because we cannot be sure whether
actor_id
is in a set of values of which one value is
UNKNOWN
(is it 4? Or 5? Or -1?), the whole predicate becomes
UNKNOWN
SELECT *
FROM actor
WHERE UNKNOWN
Here’s a nice article about three-valued logic by Joe Celko, if you want to learn more.
So, I cannot say this enough:
Do not use NOT IN
predicates in SQL, ever, unless you put constant, non-null values there.
— Lukas Eder. Just now.
Don’t even gamble on the presence of a
NOT NULL
constraint. Perhaps, some DBA might temporarily turn off the constraint to load some data and your query will be wrong for the time being. Just use
NOT EXISTS
. Or, in some cases…
(Dangerous) alternative syntax: LEFT JOIN / IS NULL
Strangely enough, some people prefer the following syntax:
SELECT *
FROM actor a
LEFT JOIN film_actor fa
USING (actor_id)
WHERE film_id IS NULL
It’s correct as we’re:
- Joining films to their actors
- Retaining all actors without films (
LEFT JOIN
)
- Retaining only actors without films (
film_id IS NULL
)
Now, I personally don’t like this syntax, as it doesn’t communicate the intent of an
"ANTI" JOIN
at all. And chances are, it is slower, because your optimiser doesn’t recognise this as an
"ANTI" JOIN
operation (or in fact, it cannot formally prove that it
probably is). So, again, use
NOT EXISTS
instead.
An interesting (but a bit outdated) blog post comparing the three syntaxes is this one here:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
LATERAL JOIN
LATERAL
is a relatively new keyword in the SQL standard, and it is supported in PostgreSQL and Oracle. SQL Server folks have had a vendor-specific alternative syntax using the
APPLY
keyword forever (which I personally much prefer). Let’s look at an example, using the PostgreSQL / Oracle
LATERAL
keyword:
WITH
departments(department, created_at) AS (
VALUES ('Dept 1', DATE '2017-01-10'),
('Dept 2', DATE '2017-01-11'),
('Dept 3', DATE '2017-01-12'),
('Dept 4', DATE '2017-04-01'),
('Dept 5', DATE '2017-04-02')
)
SELECT *
FROM departments AS d
CROSS JOIN LATERAL generate_series(
d.created_at, -- We can dereference a column from department!
'2017-01-31'::TIMESTAMP,
INTERVAL '1 day'
) AS days(day)
Indeed, instead of a
CROSS JOIN
between
all departments and
all days, why not just generate the necessary days for each department directly? That’s what
LATERAL
does. It is a prefix to the right-hand side of
any JOIN
operation (including
INNER JOIN
,
LEFT OUTER JOIN
, etc.) that allows the right-hand side to access columns from the left hand side.
This of course has nothing to do with relational algebra anymore, because it imposes a
JOIN
order (from left to right). But sometimes, that’s OK and sometimes, your table-valued function (or subquery) is so complex, that’s the only way you can actually use it.
Another very popular use-case is to join a “TOP-N” query to a normal table – e.g. if you want to find each actor, and their TOP 5 best selling films:
SELECT a.first_name, a.last_name, f.*
FROM actor AS a
LEFT OUTER JOIN LATERAL (
SELECT f.title, SUM(amount) AS revenue
FROM film AS f
JOIN film_actor AS fa USING (film_id)
JOIN inventory AS i USING (film_id)
JOIN rental AS r USING (inventory_id)
JOIN payment AS p USING (rental_id)
WHERE fa.actor_id = a.actor_id
GROUP BY f.film_id
ORDER BY revenue DESC
LIMIT 5
) AS f
ON true
The result being something like:

Don’t worry about the long list of joins in the derived table, that’s just how we have to get from the
FILM
table to the
PAYMENT
table in the Sakila database:

.
Essentially, the subquery is calculating thsoe TOP 5 best selling films per actor. So, it isn’t a “classic” derived table, but a correlated subquery that returns more than one row and one column. We’re all used to writing correlated subqueries like these:
SELECT
a.first_name,
a.last_name,
(SELECT count(*)
FROM film_actor AS fa
WHERE fa.actor_id = a.actor_id) AS films
FROM actor AS a
The result of the above correlated subquery is one row and one column. If you ever need to return more than one row and/or more than one column from a correlated subquery,
LATERAL
or
APPLY
will be your friend.
Note also how I’ve used
LEFT OUTER JOIN
together with
LATERAL
, and then I needed to put a dummy
ON true
clause for syntax correctness. Using
LATERAL
with
OUTER JOIN
will again always retain the
left side of the
JOIN
, i.e. we’ll retain actors that didn’t play in any film.
Characteristics:
The
LATERAL
keyword doesn’t really change the semantics of the
JOIN
type that it is applied to. If you’re running a
CROSS JOIN LATERAL
, the result size is still
size(result) = size(left table) * size(right table)
Even if the right table is produced on a per-row-of-the-left-table basis.
You can use
LATERAL
also with
OUTER JOIN
, in case of which the left rows will be retained even if your table function doesn’t return any rows on the right side.
Alternative syntax: APPLY
SQL Server didn’t choose the confusing
LATERAL
keyword, they introduced the
APPLY
keyword (more specifically:
CROSS APPLY
and
OUTER APPLY
) a long time ago, which makes more sense, because we’re applying a function to each row of a table. Let’s assume for a moment that we have a
generate_series()
function in SQL Server:
-- Use with care, this is quite inefficient!
CREATE FUNCTION generate_series(@d1 DATE, @d2 DATE)
RETURNS TABLE AS
RETURN
WITH t(d) AS (
SELECT @d1
UNION ALL
SELECT DATEADD(day, 1, d)
FROM t
WHERE d < @d2
)
SELECT * FROM t;
Then, we could use
CROSS APPLY
to call that function for each department:
WITH
departments AS (
SELECT * FROM (
VALUES ('Dept 1', CAST('2017-01-10' AS DATE)),
('Dept 2', CAST('2017-01-11' AS DATE)),
('Dept 3', CAST('2017-01-12' AS DATE)),
('Dept 4', CAST('2017-04-01' AS DATE)),
('Dept 5', CAST('2017-04-02' AS DATE))
) d(department, created_at)
)
SELECT *
FROM departments AS d
CROSS APPLY dbo.generate_series(
d.created_at, -- We can dereference a column from department!
CAST('2017-01-31' AS DATE)
)
What’s so nice about this syntax is that – again – we’re applying a
function to each row of a table, and that function produces rows. Does it ring a bell? In Java 8, we would use
Stream.flatMap()
for that! Consider the following stream usage:
departments.stream()
.flatMap(department -> generateSeries(
department.createdAt,
LocalDate.parse("2017-01-31"))
.map(day -> tuple(department, day))
);
What’s going on here?
- The
DEPARTMENTS
table is just a Java departments
stream
- We
flatMap
the departments
stream using a function that produces tuples for each department
- Those tuples include the department itself, and a day generated from a series of days starting with the department’s
createdAt
day
Same story! SQL
CROSS APPLY
/
CROSS JOIN LATERAL
is the same thing as Java’s
Stream.flatMap()
.
In fact, SQL and streams aren’t too different anyway. For more info, read this blog post here.
Note: Just like we could write
LEFT OUTER JOIN LATERAL
, we can also write
OUTER APPLY
in case we want to retain the
left side of the
JOIN
expression.
MULTISET
Few databases implement this (in fact, only Oracle), but if you think about it, it’s an awesome
JOIN
type. One that creates nested collections.
If all databases implemented it, we wouldn’t need ORMs!
A hypothetical example (that uses SQL standard syntax, not Oracle’s) looks like this:
SELECT a.*, MULTISET (
SELECT f.*
FROM film AS f
JOIN film_actor AS fa USING (film_id)
WHERE a.actor_id = fa.actor_id
) AS films
FROM actor
The
MULTISET
operator takes a correlated subquery argument, and aggregates all its resulting rows in a nested collection. This works in a similar fashion as a
LEFT OUTER JOIN
(we’re getting all actors, and if they have films, we’re getting all their films too), but instead of duplicating all the actors in the result set, we’re collecting them into a nested collection.
Just like what we would be doing in an ORM, when fetching stuff into this structure:
@Entity
class Actor {
@ManyToMany
List<Film> films;
}
@Entity
class Film {
}
Ignore the incompleteness of used JPA annotations, I just want to show the strength of nested collections. Unlike as in ORMs, the SQL
MULTISET
operator allows for collecting
arbitrary results from correlated subqueries into nested collections – not just actual entities. This is million times more powerful than what any ORM could ever dream of.
Alternative syntaxes: Oracle
As I said, Oracle actually supports
MULTISET
, but you cannot create ad-hoc nested collections. For some reason, Oracle chose to implement nominal typing for these nested collections, rather than the usual SQL-style structural typing. So you have to declare your types in advance:
CREATE TYPE film_t AS OBJECT ( ... );
CREATE TYPE film_tt AS TABLE OF FILM;
SELECT
a.*,
CAST (
MULTISET (
SELECT f.*
FROM film AS f
JOIN film_actor AS fa USING (film_id)
WHERE a.actor_id = fa.actor_id
) AS film_tt
) AS films
FROM actor
A bit more verbose, but still does the trick! Excellent!
Alternative syntaxes: PostgreSQL
For once, the awesome PostgreSQL is missing out on an excellent SQL standard feature, but there’s a workaround: arrays! And this time, we can use structural types, yay! So the following query will return a nested array of rows in PostgreSQL:
SELECT
a AS actor,
array_agg(
f
) AS films
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
GROUP BY a
The result is everyone’s ORDBMS dream! Nested records and collections everywhere (in only two columns):
actor films
-------------------- ----------------------------------------------------
(1,PENELOPE,GUINESS) {(1,ACADEMY DINOSAUR),(23,ANACONDA CONFESSIONS),...}
(2,NICK,WAHLBERG) {(3,ADAPTATION HOLES),(31,APACHE DIVINE),...}
(3,ED,CHASE) {(17,ALONE TRIP),(40,ARMY FLINTSTONES),...}
Tell me that you don’t find this exciting and I don’t know what is.
Conclusion
This was A Probably Incomplete, Comprehensive Guide to the Many Different Ways to JOIN Tables in SQL. I hope you’ve found 1-2 new tricks in this article.
JOIN
is only one of many very interesting SQL operations. If you’ve found this information useful, you will also like:
Like this:
Like Loading...
Very good article. I wish I could participate to your Master Class too.
There will be public performances again, soon. Plus, I’ll have an online version too, soon, hopefully! :) (still undecided whether I should write the book(s) first in 2017, though)
The online version is indeed very convenient. I really need to do the same, but I’ll probably start than in the 2nd half of this year.
I surely agree that NATURAL JOIN is useless. Oftentimes, the USING clause is useless as well, as a typical query looks like
SELECT book b JOIN author a ON b.author_id = a.id
It would work, when the id in table author was called author_id, but I’ve never seen anyone doing that.
Moreover, when joining multiple tables, USING may break because of a column present in multiple tables, right?
What do you think about the “HQL JOIN”, which I’d write using a fantasy syntax like
SELECT book b JOIN (b.author_id) a
It’s bit hacky as it assumes that there’s exactly one foreign key on the column book.author_id. AFAICT this assumption holds pretty often (maybe 90% of my queries could use it).
Yes, that’s also mentioned in the article. I’ve seen such schemas in the past. This is mostly the case when people follow the relational model more thoroughly, in case of which this is really common practice.
Yes it may break if multiple occurrences of the same name have a different semantics. If your schema is normalised, this is often not the case, though.
This exists in jOOQ, you can “JOIN .. ON KEY” and if there is key ambiguity, specify the key or the column list. Some databases also support something like KEY JOIN, in particular Sybase SQL Anywhere.
So, you see, I don’t think it’s hacky at all
My schema was normalized, but I had something like
… a JOIN b USING (owner) …
and then I joined more tables using other fields and one of them had an owner, too. The semantics was right, but the owner of the last entity was unrelated. Actually, I can’t recall what happened. I’d consider reporting an error due to ambiguity to be OK, but simply requiring a.owner = b.owner = z.owner to be dangerous.
I wonder why I called it hacky. Maybe because of it depending on the FK, which is in a sense not part of either table. OTOH it’s a semantically important thing and such a join is very useful. Nice that you support it!
Excellent article! I believe DB2 also supports LATERAL. Regarding the NOT EXISTS/NOT IN (in-)congruency, you’re likely familiar with Dan Tow’s comments on this in Chapter 7 of his book SQL Tuning, but perhaps other visitors might find them helpful (searchable here: https://library.oreilly.com/book/9780596005733/sql-tuning/toc).
Thanks for the hint. No, I don’t know his comments, and chapter 7 is quite large to look for them… ;)
I can not start this design in Oracle. Can you show me how to remake the code?
———-
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 – 64bit Production
Thanks for your message. The article uses PostgreSQL syntax. In Oracle, use:
It would be great if you could show us how to implement the multiset alternative postgres syntax with JOOQ.
I might be able to. What particular query do you have in mind? By the way, this might be a really good question for Stack Overflow. That way, more people would profit!
Regarding the first lateral join example, doesn’t it return the same result set with or without the LATERAL keyword in the CROSS JOIN?
It might, for backwards compatibility reasons, because the standard SQL LATERAL keyword was introduced only after the functionality was supported for some cases.