Faster SQL Through Occasionally Choosing Natural Keys Over Surrogate Keys


There are many many opinions out there regarding the old surrogate key vs. natural key debate. Most of the times, surrogate keys (e.g. sequence generated IDs) win because they’re much easier to design:

  • They’re easy to keep consistent across a schema (e.g. every table has an ID column, and that’s always the primary key)
  • They’re thus a no-brainer to add. When you create a new table, you don’t need to worry about any candidate keys
  • They’re guaranteed to be unique, because they have absolutely no business value, only a technical value

Great. So why bother using natural keys in the first place?

Well, there is a very compelling reason!

Performance!

Whenever you introduce surrogate keys, this means that your key data becomes completely meaningless. From a design perspective, that’s not too bad. You can easily join that other table to get the interesting, meaningful information that hides behind the surrogate foreign key value. For example, in our Sakila database

… we have a typical many-to-many relationship modelled with a relationship table between the FILM table and the CATEGORY table – state-of-the-art normalisation. But check out this interesting thing:

  • The FILM_CATEGORY relationship table doesn’t contain any interesting information at all. Just the relationships
  • The category table only contains a single useful column: The NAME column
  • The remaining columns (CATEGORY_ID and LAST_UPDATE) have absolutely no meaning

With this in mind, we could design a much simpler schema, where we use the category name as a natural key, and in fact, we don’t even need the CATEGORY table anymore, we can now remove it (that’s optional here. To ensure data correctness, we could keep it around, containing only the NAME column as a primary key). Check this out:

Now, if we run a query like the following one against our entire Sakila schema:

SELECT c.name, count(*)
FROM film_actor fa USING (actor_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

The query finds all categories a given actor played in, and the number of films that the given actor played in each category. For instance, this could be the result:

NAME       COUNT(*)
-------------------
Horror     3
Classics   2
Family     2
New        2
Games      2
Animation  1
Sports     1
Children   1
...

With an alternative schema where the category NAME has been moved to a new FILM_CATEGORY_NATURAL table, we could run this much simpler query:

SELECT fc.name, count(*)
FROM film_actor fa 
JOIN film_category_natural fc 
  USING (film_id)
WHERE actor_id = 1
GROUP BY fc.name
ORDER BY count(*) DESC

Notice how we can omit an entire JOIN.

The execution plans (here on Oracle) are quite different. Check this out:

Before:

After:

Unfortunately, the cost difference (8 vs 5) cannot be taken as a tool to compare actual costs between the two queries/plans. But the plans are otherwise very similar, except that we’re simply missing one table access (CATEGORY and an entire JOIN). That’s a significant improvement for something this simple. Imagine the improvement if we could roll out this kind of better query throughout the system?

We could look into more execution plan measurements (especially from the actual plan results), but what if we simply benchmark the two queries using the same silly benchmark, as always, repeating each statement 100 times:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT c.name, count(*)
      FROM film_actor fa USING (actor_id)
      JOIN film_category fc USING (film_id)
      JOIN category c USING (category_id)
      WHERE actor_id = 1
      GROUP BY c.name
      ORDER BY count(*) DESC
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT fc.name, count(*)
      FROM film_actor fa 
      JOIN film_category_natural fc 
        USING (film_id)
      WHERE actor_id = 1
      GROUP BY fc.name
      ORDER BY count(*) DESC
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The results are drastic:

Statement 1 : 00:00:00.122070000
Statement 2 : 00:00:00.051179000

A factor of 2.5x faster!

(Disclaimer: Benchmarks are an inaccurate way of measuring things because they suffer (or benefit) from “unfair” side-effects like heavy caching, but they’re a helpful tool to assess the order of magnitude of a difference between two queries).

The JOIN is completely unnecessary

The only reason why we joined the CATEGORY table each time is because we needed to display the meaningful business value of a category to the user, the CATEGORY.NAME. We could have avoided the JOIN if that was not a requirement, but displaying surrogate keys to the user (the CATEGORY_ID) would be rather harsh, wouldn’t it?

And we’re doing this all the time with all sorts of tables. We have:

  • Category tables (category name is a good candidate key)
  • Translation tables (label is a good candidate key)
  • Country tables (ISO 3166 country codes are a good candidate key)
  • Language tables (ISO 639 language codes are a good candidate key)
  • Currency tables (ISO 4217 currency codes are a good candidate key)
  • Stock symbol tables (ISIN security codes are a good candidate key)
  • … and many more

Working with natural keys can be quite cumbersome. But in some entities, the internationally standardised codes are really good candidate keys, and most of the time, they’re sufficient. What does the LANGUAGE_ID 47 even mean? It means nothing. An experienced DBA will remember, after a while, that it means “English”. But wouldn’t EN be a much better value?

You would have EN as a primary key AND foreign key value, so chances are, because everyone (including frontend developers who probably hard-code some translations anyway) knows what language EN is (but no one knows what 47 means), you will almost never need to join the language table again – except in those rare cases where you want to work with secondary language columns, such as, for instance, DESCRIPTION.

Now, imagine what happens if we search for English entries, or as in our previous example, for films of category “Classics”? Our entire JOIN graph would be simplified.

(Another place where we don’t need additional surrogate keys is the relationship table. In this particular case, there’s no such key anyway)

Caveats

Our category strings are quite short. If natural keys become longer, then the duplication itself can become a problem on a lower storage level, as you might need more pages and blocks to store the same amount of rows.

Please, do take this advice in this article with a grain of salt. The essence here is to not always follow strict rules that were established only as a good default. There are always tradeoffs!

Conclusion

This should be quite a straightforward refactoring for many applications. If your tables are extremely obvious picks for a natural key (like the above), then do use natural keys. Your queries will immediately be faster – not necessarily much faster, but probably you can speed up a significant number of queries, i.e. take load off your entire system. Plus: your database will be more user-friendly.

And all of this at the price of not using the identical table design everywhere. I mean – when was having an identical table design a real business case anyway, right?

Side-note

In some cases, you could take this even one step further and denormalise your schema by putting categories as arrays or XML or JSON data structures directly inside your films. You’ll lose the normalisation benefits, but you could further win in terms of performance. For more details, read this very interesting article (about PostgreSQL) here:

http://www.databasesoup.com/2015/01/tag-all-things.html

A Probably Incomplete, Comprehensive Guide to the Many Different Ways to JOIN Tables in SQL


Perhaps the most powerful SQL feature is the JOIN operation. It is the envy of all non-relational databases, because the concept is so simple, yet so universally applicable, when you want to “combine” two data sets.

Put simply, when joining two tables, you’re combining every row from one table with every row from another table, for which a certain predicate is true. An illustration from our SQL Masterclass shows it.

venn-join

See also our recent article about using Venn diagrams to illustrate JOINs. The above illustration compares INNER JOIN with different OUTER JOIN operations, but those are far from all possibilities that we have. Let’s look at things from a more systematic angle.

Do note that whenever in this article I say “X happens before Y”, I mean that “X happens logically before Y”. The database optimiser may still choose to execute Y before X, because that is faster without changing the outcome. More information about the syntactic / logical order of operations here.

But let’s look into all the join types, individually!

CROSS JOIN

The most basic JOIN operation is really a cartesian product. It just combines every row from one table with every row from another table. The best example for cartesian products is given by Wikipedia, using a deck of cards for illustration where we’re “cross joining” the ranks table with the suits table:

venn-cross-product

In real-world scenarios, CROSS JOIN can be very useful when doing reports, for instance you could generate a set of dates (e.g. days in a month) and cross join that with all the departments in the database, to create a complete days/departments table. Using PostgreSQL syntax:

SELECT *

-- This just generates all the days in January 2017
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)

-- Here, we're combining all days with all departments
CROSS JOIN departments

Imagine we have the following data:

+--------+   +------------+
| day    |   | department |
+--------+   +------------+
| Jan 01 |   | Dept 1     |
| Jan 02 |   | Dept 2     |
| ...    |   | Dept 3     |
| Jan 30 |   +------------+
| Jan 31 |
+--------+

The result would now look like this:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 01 | Dept 1     |
| Jan 01 | Dept 2     |
| Jan 01 | Dept 3     |

| Jan 02 | Dept 1     |
| Jan 02 | Dept 2     |
| Jan 02 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
+--------+------------+

Now, for each days/departments combination, you could calculate the daily revenue for that department, or something similar.

Characteristics

A CROSS JOIN is a cartesian product, i.e. a product as in “multiplication”. The mathematical notation uses the multiplication sign to express this operation: A × B, or in our case: days × departments.

Just like with “ordinary” arithmetic multiplication, if one of the two tables is empty (size zero), then the result will also be empty (size zero). This makes total sense. If we combine the previous 31 days with 0 departments, we’ll get 0 days/departments combinations. Likewise, if we combine an empty date range with any number of departments, we also get 0 days/departments combinations.

In other words:

size(result) = size(days) * size(departments)

Alternative syntaxes

In the old days, before the ANSI JOIN syntax was introduced to SQL, people just wrote comma-separated lists of tables in the FROM clause to write a CROSS JOIN. The above query is equivalent to this one:

SELECT *
FROM 
  generate_series(
    '2017-01-01'::TIMESTAMP,
    '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
    INTERVAL '1 day'
  ) AS days(day),
  departments

In general, I strongly recommend to use the CROSS JOIN keyword instead of comma-separated lists of tables, because if you intentionally want to do a CROSS JOIN, nothing communicates this intent (to the next developer!) better than using the actual keywords. So many things can go wrong, accidentally, with comma-separated lists of tables, including accidental cross joins. You don’t want those!

INNER JOIN (Theta-JOIN)

Building on top of the previous CROSS JOIN operation, INNER JOIN (or just simply JOIN, sometimes also called "THETA" JOIN) allows for filtering the outcome of the cartesian product by some predicate. Most of the times, we’re putting that predicate in an ON clause, and it could be something like this:

SELECT *

-- Same as before
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)

-- Now, exclude all days/departments combinations for
-- days before the department was created
JOIN departments AS d ON day >= d.created_at

In most databases, the INNER keyword is optional, so I’m just omitting it in this article.

Notice how the INNER JOIN operation allows for putting arbitrary predicates in the ON clause, which is again very useful when doing reporting. Just like in the previous CROSS JOIN example, we’re combining all days with all departments, but then we retain only those days/departments combinations for which the department already existed, i.e. for which the department creation preceded the day.

Again, using this data:

+--------+   +------------+------------+
| day    |   | department | created_at |
+--------+   +------------+------------+
| Jan 01 |   | Dept 1     | Jan 10     |
| Jan 02 |   | Dept 2     | Jan 11     |
| ...    |   | Dept 3     | Jan 12     |
| Jan 30 |   +------------+------------+
| Jan 31 |
+--------+

The result would now look like this:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 10 | Dept 1     |

| Jan 11 | Dept 1     |
| Jan 11 | Dept 2     |

| Jan 12 | Dept 1     |
| Jan 12 | Dept 2     |
| Jan 12 | Dept 3     |

| Jan 13 | Dept 1     |
| Jan 13 | Dept 2     |
| Jan 13 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
+--------+------------+

So, we’d get no results prior to January 10, those rows were filtered out.

Characteristics

INNER JOIN operations are filtered CROSS JOIN operations. This means that if one of the two tables is empty, then the result is also guaranteed to be empty. But unlike with CROSS JOIN, because of the predicate, we can always get less results than the CROSS JOIN would deliver.

In other words:

size(result) <= size(days) * size(departments)

Alternative syntaxes

While the ON clause is mandatory with the INNER JOIN operation, you’re not required to place JOIN predicates in there (although it is highly recommended from a readability perspective). Most databases will optimise the following, equivalent query in just the same way:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)

-- You can always JOIN .. ON true (or 1 = 1 in other DBs)
-- to turn an syntactic INNER JOIN into a semantic CROSS JOIN
JOIN departments AS d ON true

-- ... and then turn the CROSS JOIN back into an INNER JOIN
-- by putting the JOIN predicate in the WHERE clause:
WHERE day >= d.created_at

Of course, again, that is just obfuscating the query for the reader, but you may have your reasons, right? If we take this one step further, the following query is also equivalent, because most optimisers can figure out the equivalence and execute an INNER JOIN instead:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)

-- Now, this is really a syntactic CROSS JOIN
CROSS JOIN departments AS d
WHERE day >= d.created_at

… and, as seen before, CROSS JOIN is just syntax sugar for comma-separated table lists. In this case, we retain the WHERE clause to get what people have frequently done prior to the introduction of ANSI JOIN syntax:

SELECT *
FROM
  generate_series(
    '2017-01-01'::TIMESTAMP,
    '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
    INTERVAL '1 day'
  ) AS days(day),
  departments AS d
WHERE day >= d.created_at

All of these syntaxes do the same thing, usually without performance penalty, but clearly, they’re all less readable than the original INNER JOIN syntax.

EQUI JOIN

Sometimes, e.g. in literature, you will hear the term EQUI JOIN where "EQUI" isn’t really meant as a SQL keyword, but just as a specific way of writing a special kind of INNER JOIN.

In fact, it is weird that "EQUI" JOIN is the special case, because it’s what we’re doing most in SQL, also in OLTP applications where we simply JOIN by primary key / foreign key relationship. For instance:

SELECT *
FROM actor AS a
JOIN film_actor AS fa ON a.actor_id = fa.actor_id
JOIN film AS f ON f.film_id = fa.film_id

The above query selects all actors and their films. There are two INNER JOIN operations, one connecting actors to the corresponding entries in the film_actor relationship table (because actors can have many films and films can have many actors), and the other connecting each film_actor relationship with additional information about the film itself.

Characteristics

The characteristics of this operation are the same as those of the “general” INNER JOIN operation. An "EQUI" JOIN is still a cartesian product (CROSS JOIN) with a reduced set of results, i.e. containing only those actor/film combinations for which a given actor actually played in the given film.

So again, in other words:

size(result) <= size(actor) * size(film)

The result size can only be equal to the actor size multiplied by the film size, if every actor played in every film, which is unlikely.

Alternative syntaxes: USING

Again, as before, we could write the INNER JOIN operation without an actual INNER JOIN syntax, but CROSS JOIN or comma-separated table lists instead. That’s boring, but much more interesting are the following two alternative syntaxes, one of which is very very useful:

SELECT *
FROM actor
JOIN film_actor USING (actor_id)
JOIN film USING (film_id)

The USING clause replaces the ON clause and allows for listing a set of columns that have to be present in both sides of the JOIN operation. If you carefully design your database in the same way as the Sakila database, namely where every foreign key column(s) have the same names as their referenced primary key column(s) (e.g. actor.actor_id = film_actor.actor_id), then you can use USING for "EQUI" JOIN in at least these databases:

  • Derby
  • Firebird
  • HSQLDB
  • Ingres
  • MariaDB
  • MySQL
  • Oracle
  • PostgreSQL
  • SQLite
  • Vertica

These databases, unfortunately, don’t support the syntax:

  • Access
  • Cubrid
  • DB2
  • H2
  • HANA
  • Informix
  • SQL Server
  • Sybase ASE
  • Sybase SQL Anywhere

While this produces exactly (well almost) the same result as the ON clause, it’s much quicker to read and write. I said almost because some databases (and the SQL standard) specify that any column appearing in the USING clause loses its qualification. For instance:

SELECT
  f.title,   -- Ordinary column, can be qualified
  f.film_id, -- USING column, shouldn't be qualified
  film_id    -- USING column, correct / non-ambiguous here
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)

Also, of course, this syntax is a bit limited. Sometimes, you have several foreign keys in your tables, not all of which can have the primary key column name(s). For instance:

CREATE TABLE film (
  ..
  language_id          BIGINT REFERENCES language,
  original_language_id BIGINT REFERENCES language,
)

If you want to join by ORIGINAL_LANGUAGE_ID, you’ll have to resort to the ON clause.

Alternative syntaxes: NATURAL JOIN

An more extreme and much less useful form of "EQUI" JOIN is the NATURAL JOIN clause. The previous example could be further “improved” by replacing USING by NATURAL JOIN like this:

SELECT *
FROM actor
NATURAL JOIN film_actor
NATURAL JOIN film

Notice how we no longer need to specify any JOIN criteria, because a NATURAL JOIN will automatically take all the columns that share the same name from both tables that it joins and place them in a “hidden” USING clause. As we’ve seen before, as primary keys and foreign keys have the same column name, this appears quite useful.

Truth is: It isn’t useful. In the Sakila database, every table also has a LAST_UPDATE column, which is automatically being taken into consideration by NATURAL JOIN. The NATURAL JOIN query is thus equivalent to this one:

SELECT *
FROM actor
JOIN film_actor USING (actor_id, last_update)
JOIN film USING (film_id, last_update)

… which of course makes absolutely no sense at all. So, forget about NATURAL JOIN immediately (except for some very rare cases, e.g. when joining Oracle’s diagnostics views like v$sql NATURAL JOIN v$sql_plan, etc., for ad-hoc analytics)

OUTER JOIN

We’ve seen INNER JOIN before, which returns results only for combinations of the left / right table, for which the ON predicate yields true.

OUTER JOIN allows us to retain rowson the left / rigth side, for which we didn’t find a matching combination. Let’s go back to the days and departments example:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)
LEFT JOIN departments AS d ON day >= d.created_at

Again, the OUTER keyword is optional, so I’m omitting it in the examples.

This query is very subtly different from its INNER JOIN counter part in that it will always return at least one row per day, even if for a given day, there is no department that has already been created at that day. For instance: All departments were created on January 10. The above query will still return January 1-9:

+--------+   +------------+------------+
| day    |   | department | created_at |
+--------+   +------------+------------+
| Jan 01 |   | Dept 1     | Jan 10     |
| Jan 02 |   | Dept 2     | Jan 11     |
| ...    |   | Dept 3     | Jan 12     |
| Jan 30 |   +------------+------------+
| Jan 31 |
+--------+

In addition to the rows we got before in the INNER JOIN example, we now also have all the days from January 1-9 with NULL departments:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 01 |            | -- Extra rows with no match here
| Jan 02 |            | -- Extra rows with no match here
| ...    |            | -- Extra rows with no match here
| Jan 09 |            | -- Extra rows with no match here
| Jan 10 | Dept 1     |
| Jan 11 | Dept 1     |
| Jan 11 | Dept 2     |
| Jan 12 | Dept 1     |
| Jan 12 | Dept 2     |
| Jan 12 | Dept 3     |
| Jan 13 | Dept 1     |
| Jan 13 | Dept 2     |
| Jan 13 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
+--------+------------+

In other words, each day appears at least once in the result. LEFT JOIN does this for the left table, i.e. it retains all rows from the left table in the result.

Formally, a LEFT OUTER JOIN is an INNER JOIN with a UNION like this:

-- Convenient syntax:
SELECT *
FROM a LEFT JOIN b ON <predicate>

-- Cumbersome, equivalent syntax:
SELECT a.*, b.*
FROM a JOIN b ON <predicate>
UNION ALL
SELECT a.*, NULL, NULL, ..., NULL
FROM a
WHERE NOT EXISTS (
  SELECT * FROM b WHERE <predicate>
)

We’ll talk about NOT EXISTS further down, when talking about "SEMI" JOIN

RIGHT OUTER JOIN

RIGHT OUTER JOIN does exactly the opposite. It retains all the rows from the right table in the result. Let’s add more departments

+--------+   +------------+------------+
| day    |   | department | created_at |
+--------+   +------------+------------+
| Jan 01 |   | Dept 1     | Jan 10     |
| Jan 02 |   | Dept 2     | Jan 11     |
| ...    |   | Dept 3     | Jan 12     |
| Jan 30 |   | Dept 4     | Apr 01     |
| Jan 31 |   | Dept 5     | Apr 02     |
+--------+   +------------+------------+

The new departments 4 and 5 will not be in the previous results at all, because they were created on a day after January 31. But it will appear in a RIGHT JOIN result, because departments is the right table of the join operation, and all the rows from the right table will be retained.

Running this query:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)
RIGHT JOIN departments AS d ON day >= d.created_at

Will yield something like this:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 10 | Dept 1     |
| Jan 11 | Dept 1     |
| Jan 11 | Dept 2     |
| Jan 12 | Dept 1     |
| Jan 12 | Dept 2     |
| Jan 12 | Dept 3     |
| Jan 13 | Dept 1     |
| Jan 13 | Dept 2     |
| Jan 13 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
|        | Dept 4     | -- Extra rows with no match here
|        | Dept 5     | -- Extra rows with no match here
+--------+------------+

In most cases (I have to yet run into a situation where this isn’t the case), Each LEFT OUTER JOIN expression can be transformed into an equivalent RIGHT OUTER JOIN expression, and vice-versa. Because RIGHT OUTER JOIN is usually a bit less readable, most people use LEFT OUTER JOIN only.

FULL OUTER JOIN

Finally, there is also FULL OUTER JOIN, which retains all rows from both sides of the JOIN operation. In our example, this means that each day appears at least once in the result, just like each department appears at least once in the result.

Let’s take again this data:

+--------+   +------------+------------+
| day    |   | department | created_at |
+--------+   +------------+------------+
| Jan 01 |   | Dept 1     | Jan 10     |
| Jan 02 |   | Dept 2     | Jan 11     |
| ...    |   | Dept 3     | Jan 12     |
| Jan 30 |   | Dept 4     | Apr 01     |
| Jan 31 |   | Dept 5     | Apr 02     |
+--------+   +------------+------------+

And now, let’s run this query:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)
FULL JOIN departments AS d ON day >= d.created_at

The result will now look something like this:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 01 |            | -- row from the left table
| Jan 02 |            | -- row from the left table
| ...    |            | -- row from the left table
| Jan 09 |            | -- row from the left table
| Jan 10 | Dept 1     |
| Jan 11 | Dept 1     |
| Jan 11 | Dept 2     |
| Jan 12 | Dept 1     |
| Jan 12 | Dept 2     |
| Jan 12 | Dept 3     |
| Jan 13 | Dept 1     |
| Jan 13 | Dept 2     |
| Jan 13 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
|        | Dept 4     | -- row from the right table
|        | Dept 5     | -- row from the right table 
+--------+------------+

If you insist, formally, a LEFT OUTER JOIN is an INNER JOIN with a UNION like this:

-- Convenient syntax:
SELECT *
FROM a LEFT JOIN b ON <predicate>

-- Cumbersome, equivalent syntax:
SELECT a.*, b.*
FROM a JOIN b ON <predicate>
-- LEFT JOIN part
UNION ALL
SELECT a.*, NULL, NULL, ..., NULL
FROM a
WHERE NOT EXISTS (
  SELECT * FROM b WHERE <predicate>
)
-- RIGHT JOIN part
UNION ALL
SELECT NULL, NULL, ..., NULL, b.*
FROM b
WHERE NOT EXISTS (
  SELECT * FROM a WHERE <predicate>
)

Alternative syntaxes: “EQUI” OUTER JOIN

The above examples again used the “cartesian product with filter” kind of JOIN. Much more common, however, is the "EQUI" OUTER JOIN approach, where we join on a primary key / foreign key relationship. Let’s go back to the Sakila database example. Some actors haven’t played in any film, and we might want to query them like this:

SELECT *
FROM actor
LEFT JOIN film_actor USING (actor_id)
LEFT JOIN film USING (film_id)

This query will return all actors at least once, regardless if they played in a film. If we also want all films that do not have any actors (yet), we could use a FULL OUTER JOIN to combine the results:

SELECT *
FROM actor
FULL JOIN film_actor USING (actor_id)
FULL JOIN film USING (film_id)

And of course, this also works with NATURAL LEFT JOIN, NATURAL RIGHT JOIN, NATURAL FULL JOIN, but again, these aren’t useful at all, as we’d be joining again USING (..., LAST_UPDATE), which makes no sense at all.

Alternative syntaxes: Oracle and SQL Server style OUTER JOIN

These two databases had OUTER JOIN before the ANSI syntax was established. It looked like this:

-- Oracle
SELECT *
FROM actor a, film_actor fa, film f
WHERE a.actor_id = fa.actor_id(+)
AND fa.film_id = f.film_id(+)

-- SQL Server
SELECT *
FROM actor a, film_actor fa, film f
WHERE a.actor_id *= fa.actor_id
AND fa.film_id *= f.film_id

That’s nice, given that at some point in time (in the 80s??), ANSI didn’t specify OUTER JOIN yet. But the 80s are more than 30 years ago. So, it’s safe to say this stuff is obsolete.

SQL Server did the right thing and deprecated (and later removed) that syntax a long time ago. Oracle still supports it for backwards compatibility reasons.

But nothing about this syntax is reasonable or readable. Don’t use it. Replace it with ANSI JOIN.

PARTITIONED OUTER JOIN

This is Oracle specific, but I must say, it’s a real shame none of the other databases have stolen the feature yet. Remember the CROSS JOIN operation that we used to combine each day with each department? Because, sometimes, that’s what we want in the result: All combinations, and if there is a match put also the matching values in the row.

That’s hard to describe in words, much more easy with an example. Here’s the query with Oracle syntax:

WITH 

  -- Using CONNECT BY to generate all dates in January
  days(day) AS (
    SELECT DATE '2017-01-01' + LEVEL - 1
    FROM dual
    CONNECT BY LEVEL <= 31
  ),

  -- Our departments
  departments(department, created_at) AS (
    SELECT 'Dept 1', DATE '2017-01-10' FROM dual UNION ALL
    SELECT 'Dept 2', DATE '2017-01-11' FROM dual UNION ALL
    SELECT 'Dept 3', DATE '2017-01-12' FROM dual UNION ALL
    SELECT 'Dept 4', DATE '2017-04-01' FROM dual UNION ALL
    SELECT 'Dept 5', DATE '2017-04-02' FROM dual
  )
SELECT *
FROM days 
LEFT JOIN departments 
  PARTITION BY (department) -- This is where the magic happens
  ON day >= created_at

Unfortunately, PARTITION BY is used in various contexts with different meanings (e.g. for window functions). In this case, it means that we “partition” our data by the departments.department column, creating a “partition” for each department. Now, each partition will get a copy for each day, regardless if there’s a match in our predicate (unlike in the ordinary LEFT JOIN case, where we had a bunch of “department-less” days). The result of the above query is now this:

+--------+------------+------------+
| day    | department | created_at |
+--------+------------+------------+
| Jan 01 | Dept 1     |            | -- Didn't match, but still get row
| Jan 02 | Dept 1     |            | -- Didn't match, but still get row
| ...    | Dept 1     |            | -- Didn't match, but still get row
| Jan 09 | Dept 1     |            | -- Didn't match, but still get row
| Jan 10 | Dept 1     | Jan 10     | -- Matches, so get join result
| Jan 11 | Dept 1     | Jan 10     | -- Matches, so get join result
| Jan 12 | Dept 1     | Jan 10     | -- Matches, so get join result
| ...    | Dept 1     | Jan 10     | -- Matches, so get join result
| Jan 31 | Dept 1     | Jan 10     | -- Matches, so get join result

| Jan 01 | Dept 2     |            | -- Didn't match, but still get row
| Jan 02 | Dept 2     |            | -- Didn't match, but still get row
| ...    | Dept 2     |            | -- Didn't match, but still get row
| Jan 09 | Dept 2     |            | -- Didn't match, but still get row
| Jan 10 | Dept 2     |            | -- Didn't match, but still get row
| Jan 11 | Dept 2     | Jan 11     | -- Matches, so get join result
| Jan 12 | Dept 2     | Jan 11     | -- Matches, so get join result
| ...    | Dept 2     | Jan 11     | -- Matches, so get join result
| Jan 31 | Dept 2     | Jan 11     | -- Matches, so get join result

| Jan 01 | Dept 3     |            | -- Didn't match, but still get row
| Jan 02 | Dept 3     |            | -- Didn't match, but still get row
| ...    | Dept 3     |            | -- Didn't match, but still get row
| Jan 09 | Dept 3     |            | -- Didn't match, but still get row
| Jan 10 | Dept 3     |            | -- Didn't match, but still get row
| Jan 11 | Dept 3     |            | -- Didn't match, but still get row
| Jan 12 | Dept 3     | Jan 12     | -- Matches, so get join result
| ...    | Dept 3     | Jan 12     | -- Matches, so get join result
| Jan 31 | Dept 3     | Jan 12     | -- Matches, so get join result

| Jan 01 | Dept 4     |            | -- Didn't match, but still get row
| Jan 02 | Dept 4     |            | -- Didn't match, but still get row
| ...    | Dept 4     |            | -- Didn't match, but still get row
| Jan 31 | Dept 4     |            | -- Didn't match, but still get row

| Jan 01 | Dept 5     |            | -- Didn't match, but still get row
| Jan 02 | Dept 5     |            | -- Didn't match, but still get row
| ...    | Dept 5     |            | -- Didn't match, but still get row
| Jan 31 | Dept 5     |            | -- Didn't match, but still get row
+--------+------------+

As you can see, I’ve visually created 5 partitions for the 5 departments. Each partition combines the department with each day, but unlike when doing a CROSS JOIN, we’re now getting actual LEFT JOIN .. ON .. results in case there is a match for the predicate. This is a really nice feature for reporting in Oracle!

SEMI JOIN

In relational algebra, there is a notion of a semi join operation, which unfortunately doesn’t have a syntax representation in SQL. If it did, the syntax would probably be LEFT SEMI JOIN and RIGHT SEMI JOIN, just like the Cloudera Impala syntax extension offers.

What is a "SEMI" JOIN?

When writing something like the fictional query below:

SELECT *
FROM actor
LEFT SEMI JOIN film_actor USING (actor_id)

What we really mean is we want all actors that played in films. But we don’t want any films in the results, just the actors. More specifically, we don’t want each actor several times, once per film. We want each actor only once (or zero times) in the result.

Semi is latin for “half”, i.e. we implement only “half the join”, in this case, the left half.

In SQL, there are two alternative syntaxes that can emulate "SEMI" JOIN

Alternative syntax: EXISTS

This is the more powerful and a bit more verbose syntax

SELECT *
FROM actor a
WHERE EXISTS (
  SELECT * FROM film_actor fa
  WHERE a.actor_id = fa.actor_id
)

We’re looking for all the actors for which there exists a film, i.e. actors that played in a film. With this syntax (i.e. "SEMI" JOIN placed in the WHERE clause), it’s immediately clear that we can get each actor at most once in the result. There’s no actual JOIN in the syntax.

Nonetheless, most databases will be able to recognise that what’s going on here is really a "SEMI" JOIN, and not just an ordinary EXISTS() predicate. For instance, consider the Oracle execution plan for the above query:

hash-join-semi

Note how Oracle calls the operation “HASH JOIN (SEMI)” – the SEMI keyword is present here. The same is true for PostgreSQL:

nested-loop-semi-join

Or SQL Server:

nested-loops-left-semi-join

Apart from being the optimal solution in terms of correctness, there are also some performance benefits when using "SEMI" JOIN rather than INNER JOIN, as the database can stop looking for matches as soon as it found the first!

Alternative syntax: IN

IN and EXISTS are exactly equivalent "SEMI" JOIN emulations. The following query will produce the same plan in most databases (not MySQL), as the previous EXISTS query:

SELECT *
FROM actor
WHERE actor_id IN (
  SELECT actor_id FROM film_actor
)

If your database supports both syntaxes for the "SEMI" JOIN operation, you may choose whichever you prefer from a stylistic point of view.

This is different with …

ANTI JOIN

In principle, "ANTI" JOIN is just the opposite of "SEMI" JOIN. When writing something like the fictional query below:

SELECT *
FROM actor
LEFT ANTI JOIN film_actor USING (actor_id)

… what we’re doing is we take all the actors that didn’t play in any film. Unfortunately, again, SQL doesn’t have a built-in syntax for this operation, but we can emulate it with EXISTS:

Alternative syntax: NOT EXISTS

The following query has exactly the expected semantics:

SELECT *
FROM actor a
WHERE NOT EXISTS (
  SELECT * FROM film_actor fa
  WHERE a.actor_id = fa.actor_id
)

(Dangerous) alternative syntax: NOT IN

Watch out! While EXISTS and IN are equivalent, NOT EXISTS and NOT IN are not equivalent. This is because of NULL values.

In this particular case, the following NOT IN query is equivalent to the previous NOT EXISTS query, because our film_actor table has a NOT NULL constraint on film_actor.actor_id

SELECT *
FROM actor
WHERE actor_id NOT IN (
  SELECT actor_id FROM film_actor
)

If, however, actor_id became nullable, then the query would be wrong. Don’t believe it? Try running this:

SELECT *
FROM actor
WHERE actor_id NOT IN (1, 2, NULL)

It will not return any records. Why? Because NULL is the UNKNOWN value in SQL. So, the predicate above is the same as this:

SELECT *
FROM actor
WHERE actor_id NOT IN (1, 2, UNKNOWN)

And because we cannot be sure whether actor_id is in a set of values of which one value is UNKNOWN (is it 4? Or 5? Or -1?), the whole predicate becomes UNKNOWN

SELECT *
FROM actor
WHERE UNKNOWN

Here’s a nice article about three-valued logic by Joe Celko, if you want to learn more.

So, I cannot say this enough:

Do not use NOT IN predicates in SQL, ever, unless you put constant, non-null values there.

— Lukas Eder. Just now.

Don’t even gamble on the presence of a NOT NULL constraint. Perhaps, some DBA might temporarily turn off the constraint to load some data and your query will be wrong for the time being. Just use NOT EXISTS. Or, in some cases…

(Dangerous) alternative syntax: LEFT JOIN / IS NULL

Strangely enough, some people prefer the following syntax:

SELECT *
FROM actor a
LEFT JOIN film_actor fa
USING (actor_id)
WHERE film_id IS NULL

It’s correct as we’re:

  • Joining films to their actors
  • Retaining all actors without films (LEFT JOIN)
  • Retaining only actors without films (film_id IS NULL)

Now, I personally don’t like this syntax, as it doesn’t communicate the intent of an "ANTI" JOIN at all. And chances are, it is slower, because your optimiser doesn’t recognise this as an "ANTI" JOIN operation (or in fact, it cannot formally prove that it probably is). So, again, use NOT EXISTS instead.

An interesting (but a bit outdated) blog post comparing the three syntaxes is this one here:
https://explainextended.com/2009/09/15/not-in-vs-not-exists-vs-left-join-is-null-sql-server

LATERAL JOIN

LATERAL is a relatively new keyword in the SQL standard, and it is supported in PostgreSQL and Oracle. SQL Server folks have had a vendor-specific alternative syntax using the APPLY keyword forever (which I personally much prefer). Let’s look at an example, using the PostgreSQL / Oracle LATERAL keyword:

WITH 
  departments(department, created_at) AS (
    VALUES ('Dept 1', DATE '2017-01-10'),
           ('Dept 2', DATE '2017-01-11'),
           ('Dept 3', DATE '2017-01-12'),
           ('Dept 4', DATE '2017-04-01'),
           ('Dept 5', DATE '2017-04-02')
  )
SELECT *
FROM departments AS d
CROSS JOIN LATERAL generate_series(
  d.created_at, -- We can dereference a column from department!
  '2017-01-31'::TIMESTAMP, 
  INTERVAL '1 day'
) AS days(day)

Indeed, instead of a CROSS JOIN between all departments and all days, why not just generate the necessary days for each department directly? That’s what LATERAL does. It is a prefix to the right-hand side of any JOIN operation (including INNER JOIN, LEFT OUTER JOIN, etc.) that allows the right-hand side to access columns from the left hand side.

This of course has nothing to do with relational algebra anymore, because it imposes a JOIN order (from left to right). But sometimes, that’s OK and sometimes, your table-valued function (or subquery) is so complex, that’s the only way you can actually use it.

Another very popular use-case is to join a “TOP-N” query to a normal table – e.g. if you want to find each actor, and their TOP 5 best selling films:

SELECT a.first_name, a.last_name, f.*
FROM actor AS a
LEFT OUTER JOIN LATERAL (
  SELECT f.title, SUM(amount) AS revenue
  FROM film AS f
  JOIN film_actor AS fa USING (film_id)
  JOIN inventory AS i USING (film_id)
  JOIN rental AS r USING (inventory_id)
  JOIN payment AS p USING (rental_id)
  WHERE fa.actor_id = a.actor_id
  GROUP BY f.film_id
  ORDER BY revenue DESC
  LIMIT 5
) AS f
ON true

The result being something like:

outer-join-lateral-result

Don’t worry about the long list of joins in the derived table, that’s just how we have to get from the FILM table to the PAYMENT table in the Sakila database:

sakila1.

Essentially, the subquery is calculating thsoe TOP 5 best selling films per actor. So, it isn’t a “classic” derived table, but a correlated subquery that returns more than one row and one column. We’re all used to writing correlated subqueries like these:

SELECT 
  a.first_name, 
  a.last_name, 
  (SELECT count(*) 
   FROM film_actor AS fa 
   WHERE fa.actor_id = a.actor_id) AS films
FROM actor AS a

The result of the above correlated subquery is one row and one column. If you ever need to return more than one row and/or more than one column from a correlated subquery, LATERAL or APPLY will be your friend.

Note also how I’ve used LEFT OUTER JOIN together with LATERAL, and then I needed to put a dummy ON true clause for syntax correctness. Using LATERAL with OUTER JOIN will again always retain the left side of the JOIN, i.e. we’ll retain actors that didn’t play in any film.

Characteristics:

The LATERAL keyword doesn’t really change the semantics of the JOIN type that it is applied to. If you’re running a CROSS JOIN LATERAL, the result size is still

size(result) = size(left table) * size(right table)

Even if the right table is produced on a per-row-of-the-left-table basis.

You can use LATERAL also with OUTER JOIN, in case of which the left rows will be retained even if your table function doesn’t return any rows on the right side.

Alternative syntax: APPLY

SQL Server didn’t choose the confusing LATERAL keyword, they introduced the APPLY keyword (more specifically: CROSS APPLY and OUTER APPLY) a long time ago, which makes more sense, because we’re applying a function to each row of a table. Let’s assume for a moment that we have a generate_series() function in SQL Server:

-- Use with care, this is quite inefficient!
CREATE FUNCTION generate_series(@d1 DATE, @d2 DATE)
RETURNS TABLE AS
RETURN
  WITH t(d) AS (
    SELECT @d1 
    UNION ALL
    SELECT DATEADD(day, 1, d) 
    FROM t
    WHERE d < @d2
  ) 
  SELECT * FROM t;

Then, we could use CROSS APPLY to call that function for each department:

WITH 
  departments AS (
    SELECT * FROM (
      VALUES ('Dept 1', CAST('2017-01-10' AS DATE)),
             ('Dept 2', CAST('2017-01-11' AS DATE)),
             ('Dept 3', CAST('2017-01-12' AS DATE)),
             ('Dept 4', CAST('2017-04-01' AS DATE)),
             ('Dept 5', CAST('2017-04-02' AS DATE))
    ) d(department, created_at)
  )
SELECT *
FROM departments AS d
CROSS APPLY dbo.generate_series(
  d.created_at, -- We can dereference a column from department!
  CAST('2017-01-31' AS DATE)
)

What’s so nice about this syntax is that – again – we’re applying a function to each row of a table, and that function produces rows. Does it ring a bell? In Java 8, we would use Stream.flatMap() for that! Consider the following stream usage:

departments.stream()
           .flatMap(department -> generateSeries(
                             department.createdAt, 
                             LocalDate.parse("2017-01-31"))
                         .map(day -> tuple(department, day))
           );

What’s going on here?

  • The DEPARTMENTS table is just a Java departments stream
  • We flatMap the departments stream using a function that produces tuples for each department
  • Those tuples include the department itself, and a day generated from a series of days starting with the department’s createdAt day

Same story! SQL CROSS APPLY / CROSS JOIN LATERAL is the same thing as Java’s Stream.flatMap(). In fact, SQL and streams aren’t too different anyway. For more info, read this blog post here.

Note: Just like we could write LEFT OUTER JOIN LATERAL, we can also write OUTER APPLY in case we want to retain the left side of the JOIN expression.

MULTISET

Few databases implement this (in fact, only Oracle), but if you think about it, it’s an awesome JOIN type. One that creates nested collections. If all databases implemented it, we wouldn’t need ORMs!

A hypothetical example (that uses SQL standard syntax, not Oracle’s) looks like this:

SELECT a.*, MULTISET (
  SELECT f.*
  FROM film AS f
  JOIN film_actor AS fa USING (film_id)
  WHERE a.actor_id = fa.actor_id
) AS films
FROM actor

The MULTISET operator takes a correlated subquery argument, and aggregates all its resulting rows in a nested collection. This works in a similar fashion as a LEFT OUTER JOIN (we’re getting all actors, and if they have films, we’re getting all their films too), but instead of duplicating all the actors in the result set, we’re collecting them into a nested collection.

Just like what we would be doing in an ORM, when fetching stuff into this structure:

@Entity
class Actor {
  @ManyToMany
  List<Film> films;
}

@Entity
class Film {
}

Ignore the incompleteness of used JPA annotations, I just want to show the strength of nested collections. Unlike as in ORMs, the SQL MULTISET operator allows for collecting arbitrary results from correlated subqueries into nested collections – not just actual entities. This is million times more powerful than what any ORM could ever dream of.

Alternative syntaxes: Oracle

As I said, Oracle actually supports MULTISET, but you cannot create ad-hoc nested collections. For some reason, Oracle chose to implement nominal typing for these nested collections, rather than the usual SQL-style structural typing. So you have to declare your types in advance:

CREATE TYPE film_t AS OBJECT ( ... );
CREATE TYPE film_tt AS TABLE OF FILM;

SELECT 
  a.*, 
  CAST (
    MULTISET (
      SELECT f.*
      FROM film AS f
      JOIN film_actor AS fa USING (film_id)
      WHERE a.actor_id = fa.actor_id
    ) AS film_tt
  ) AS films
FROM actor

A bit more verbose, but still does the trick! Excellent!

Alternative syntaxes: PostgreSQL

For once, the awesome PostgreSQL is missing out on an excellent SQL standard feature, but there’s a workaround: arrays! And this time, we can use structural types, yay! So the following query will return a nested array of rows in PostgreSQL:

SELECT
  a AS actor,
  array_agg(
    f
  ) AS films
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
GROUP BY a

The result is everyone’s ORDBMS dream! Nested records and collections everywhere (in only two columns):

actor                  films
--------------------   ----------------------------------------------------
(1,PENELOPE,GUINESS)   {(1,ACADEMY DINOSAUR),(23,ANACONDA CONFESSIONS),...}
(2,NICK,WAHLBERG)      {(3,ADAPTATION HOLES),(31,APACHE DIVINE),...}
(3,ED,CHASE)           {(17,ALONE TRIP),(40,ARMY FLINTSTONES),...}

Tell me that you don’t find this exciting and I don’t know what is.

Conclusion

This was A Probably Incomplete, Comprehensive Guide to the Many Different Ways to JOIN Tables in SQL. I hope you’ve found 1-2 new tricks in this article. JOIN is only one of many very interesting SQL operations. If you’ve found this information useful, you will also like:

You Probably don’t Use SQL INTERSECT or EXCEPT Often Enough


When people talk about SQL JOIN, they often use Venn Diagrams to illustrate inclusion and exclusion of the two joined sets:

venn

While these Venn diagrams are certainly useful to understand (and remember) SQL JOIN syntax, they’re not entirely accurate, because SQL JOIN is a special type of a cartesian product, the CROSS JOIN.

Illustration by Wikipedia user Quartl

Illustration by Wikipedia user Quartl

In a cartesian product between two sets A and B, the result is the multiplication of each set, meaning that each element a ∈ A is combined with each element b ∈ B to form a set of tuples (a, b).

Ordinary SQL JOINs do precisely this. When you join BOOK to AUTHOR, you will probably get a combination of every author ∈ AUTHOR with each book ∈ BOOK, such that for each combination (author, book), the author actually wrote the book.

The true meaning of Venn diagrams

The true meaning of Venn diagrams is much better described by the operations

  • UNION
  • INTERSECT
  • EXCEPT (or MINUS in Oracle)

In the following sections, we’ll see that these operations match exactly the semantics of operations that can be illustrated by Venn diagrams, even if you will be able to “abuse” JOIN operations to achieve the same result.

UNION

The UNION operation is the most well-known among these set operations. It is often also referred to as “concatenation” of two sets of tuples, where the result is the concatenation of a set B to a set A.

In the following example, we’ll see that we might be interested in all the different people from our database, given their first and last names, regardless if they’re customer or staff:

set-union

The original Venn diagrams used FULL OUTER JOIN to model the “same” concept, although the two things are not strictly same. Consider the following query, which we’ll run against the Sakila database:

SELECT first_name, last_name
FROM customer
UNION
SELECT first_name, last_name
FROM staff
ORDER BY 1, 2

The result looks like:

first_name   last_name
------------------------------------
AARON        SELBY
ADAM         GOOCH
ADRIAN       CLARY
AGNES        BISHOP
ALAN         KAHN
ALBERT       CROUSE
ALBERTO      HENNING
ALEX         GRESHAM
ALEXANDER    FENNELL
ALFRED       CASILLAS
ALFREDO      MCADAMS
ALICE        STEWART
ALICIA       MILLS
...

Now, run the following “equivalent” query:

SELECT first_name, last_name
FROM customer
FULL OUTER JOIN staff 
  USING (first_name, last_name)
ORDER BY 1, 2

The result will again yield:

first_name   last_name
------------------------------------
AARON        SELBY
ADAM         GOOCH
ADRIAN       CLARY
AGNES        BISHOP
ALAN         KAHN
ALBERT       CROUSE
ALBERTO      HENNING
...

This only works because we’re using the USING clause, which not every database supports natively. If we did our JOIN with the more commonly used ON clause, we’d have to write the more tedious:

SELECT
  COALESCE(c.first_name, s.first_name) AS first_name,
  COALESCE(c.last_name, s.last_name) AS last_name
FROM customer c
FULL OUTER JOIN staff s
  ON (c.first_name, c.last_name)
  =  (s.first_name, s.last_name)
ORDER BY 1, 2

In this case, most people probably default to using UNION already, as it is a much better known operation than FULL OUTER JOIN.

All of jOOQ’s currently supported RDBMS support UNION and UNION ALL (the latter doesn’t remove duplicates).

In the following, we’ll see that equivalent comparisons can be made with other set operations:

INTERSECT

The INTERSECT operation is really useful when you want to keep only those tuples that are present in both sets that are combined using INTERSECT:

set-intersect

As you can see, we may want to retain only those customers that are also actors. Let’s run this query:

SELECT first_name, last_name
FROM customer
INTERSECT
SELECT first_name, last_name
FROM actor
first_name   last_name
------------------------------------
JENNIFER     DAVIS

One of our customers is also an actor. The same query could have been written with an INNER JOIN as such:

SELECT first_name, last_name
FROM customer
INNER JOIN actor 
  USING (first_name, last_name)

… or with the ON syntax

SELECT c.first_name, c.last_name
FROM customer c
INNER JOIN actor a
  ON (c.first_name, c.last_name)
  =  (a.first_name, a.last_name)

This time, no COALESCE is needed, as INNER JOIN retains only those tuples from the cartesian product, which are present on “both sides” of the JOIN, so we can pick any of the tables to prefix our columns.

You may even decide to use a semi-join instead, which would yield the same results:

SELECT first_name, last_name
FROM customer
WHERE (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM actor
)

or, using the more verbose, yet equivalent EXISTS predicate:

SELECT first_name, last_name
FROM customer c
WHERE EXISTS (
  SELECT 1
  FROM actor a
  WHERE (c.first_name, c.last_name)
      = (a.first_name, a.last_name)
)

All of the above, again, yield:

first_name   last_name
------------------------------------
JENNIFER     DAVIS

EXCEPT

The EXCEPT operation is useful when you want to keep only those tuples that are present in one set, but not in another:

set-difference

Running this query:

SELECT first_name, last_name
FROM customer
EXCEPT
SELECT first_name, last_name
FROM staff
ORDER BY 1, 2

… will yield:

first_name   last_name
------------------------------------
AARON        SELBY
ADAM         GOOCH
ADRIAN       CLARY
AGNES        BISHOP
ALAN         KAHN
ALBERT       CROUSE
ALBERTO      HENNING
...

According to the original Venn diagrams, this can be tweaked using LEFT JOIN and a IS NULL predicate:

SELECT first_name, last_name
FROM customer
LEFT JOIN staff
  USING (first_name, last_name)
WHERE staff_id IS NULL
ORDER BY 1, 2

or with an ON clause:

SELECT c.first_name, c.last_name
FROM customer c
LEFT JOIN staff s
  ON (c.first_name, c.last_name)
  =  (s.first_name, s.last_name)
WHERE staff_id IS NULL
ORDER BY 1, 2

This is completely unreadable and doesn’t communicate the fact that we’re removing tuples from a set CUSTOMER, given their presence in another set STAFF.

An equivalent version using anti-join might be more readable (watch out for NULLs in NOT IN predicates, though!):

SELECT c.first_name, c.last_name
FROM customer c
WHERE (first_name, last_name) NOT IN (
  SELECT first_name, last_name
  FROM staff
)
ORDER BY 1, 2

… or, using NOT EXISTS:

SELECT c.first_name, c.last_name
FROM customer c
WHERE NOT EXISTS (
  SELECT 1
  FROM staff s
  WHERE (c.first_name, c.last_name)
      = (s.first_name, s.last_name)
)
ORDER BY 1, 2

Conclusion

UNION, INTERSECT, and EXCEPT are very simple, yet very useful operations that can add a lot of value every now and then in your daily SQL tasks. While JOIN operations are much more versatile, they are also more complex for the simple tasks that can be solved by UNION, INTERSECT, and EXCEPT

Did you like this article? It’s part of the Data Geekery SQL Training – a 1-day workshop helping you to get the most out of the awesome SQL language.

Read more articles about awesome SQL here:

No CROSS JOIN in MS Access


For the upcoming jOOQ 3.3, we’re now integrating support for the MS Access database through the JDBC-ODBC bridge, which is included in the JDK up until Java SE 7. Note that it will be removed in Java 8! Alternative access to access databases (pun intended) can be obtained through a hack involving ucanaccess, which is basically combining the HSQLDB parser with Jackcess, a low-level I/O library for MS Access database files.

MS Access is still an immensely popular relational database, ranking in the top 10 at db-engines.com’s DBMS ranking. Yet it has quirks. Many of them. One is the fact that there is no formal CROSS JOIN operation. Remember, most databases support explicit CROSS JOINing as such:

SELECT     p1.name player1, 
           p2.name player2
FROM       player p1 
CROSS JOIN player p2

The above query will generate all the pairings between two players.
The same can be written in pre-ANSI SQL-92 with comma-separated table lists:

SELECT     p1.name player1, 
           p2.name player2
FROM       player p1, 
           player p2

The first syntax, however, is more powerful and more expressive as it can be used in nested JOIN expressions and it shows your intent more clearly. A nice example was given here in a previous blog post.

How to work around a missing CROSS JOIN

Usually, missing support for CROSS JOIN can be emulated trivially using an INNER JOIN with a TRUE predicate as such:

SELECT     p1.name player1, 
           p2.name player2
FROM       player p1 
JOIN       player p2
ON         1 = 1

This is what jOOQ does for the Sybase Adaptive Server Enterprise database. But this doesn’t work for MS Access, because the JOIN operation there explicitly requires column references from either table on either side. The documentation reads:

Syntax:

FROM table1 INNER JOIN table2 
ON table1.field1 compopr table2.field2

This is quite a bummer from many points of view, not only for CROSS JOIN emulation. Given that any ANSI-92 JOIN syntax can be transformed into an ANSI-86 join expression (table list in the FROM clause and all predicates in the WHERE clause), it is also a bit surprising.

A simple workaround that seems to work for some use-cases is to take any numeric column from either table, and multiply it by zero:

SELECT     p1.name player1, 
           p2.name player2
FROM       player p1 
JOIN       player p2
ON         p1.id * 0 = p2.id * 0

But if the database itself is already this quirky, I suspect that it might not be able to optimise the above SQL.

In short…

MS Access does not support CROSS JOIN. For the time being, try to work around it using comma-separated table lists, while we work out more sophisticated SQL transformation in jOOQ.

The Myth About Slow SQL JOIN Operations


In my recent SQL work for a large Swiss bank, I have maintained nested database view monsters whose unnested SQL code amounted up to 5k lines of code, joining the same table over and over again in separate subselects combined via UNION operations. This monster performed in way under 50ms, no matter how we queried it (see “10 more common mistakes” about the speed of queries). Of course, this performance was only achieved after lots of fine-tuning, load-testing and benchmarking. But it worked. Our Oracle database never let us down on these things.

Nonetheless, many SQL users think that JOIN operations are slow. Why? Perhaps because they are / used to be, in MySQL? I’m currently reading this interesting book by Markus Winand. The book is called SQL Performance Explained. He’s also the author of Use-The-Index-Luke.com where you can get free insight into his book. I still recommend reading the whole book, though. Even SQL old-timers and SQL nerds like me will find 1-2 novel, very interesting approaches, some of which will be incorporated into jOOQ very soon!

In particular, consider this page which explains very well how Hash JOIN operations work:
http://use-the-index-luke.com/sql/join/hash-join-partial-objects

You never stop learning about Oracle features


Oracle‘s name is no coincidence. It is truly an oracle, telling you mystical, secret things about your database. It may take great expertise and skill to optimally extract that knowledge from this monster. While it implements vast parts of the SQL:2008 standard, it ships with a lot of features no other database has, but which are likely to make it into the SQL:2011 and subsequent standards, eventually.

One particular syntax clause is the PARTITION BY clause. It is very similar to the GROUP BY clause, but it can appear in various other contexts. The most trivial one is that of window functions / analytical functions. The most advanced one is that of the MODEL clause (which I shall blog about soon). But one of the most hidden ones is that of the OUTER JOIN clause. With Oracle, you can issue a “partitioned outer join”.

The formal definition of an Oracle OUTER JOIN clause can be seen here:

 

The query_partition_clause is a regular “PARTITION BY expr” clause. Now, this useful addition that can be put to the left or to the right of the actual OUTER JOIN clause, allowing to specify a means of how to partition either side in a way that there will be at least one (possibly empty) record for every partition, in case the OUTER JOIN does not produce any records for such a partition. Whew.

In more intelligible English: This clause helps filling your result set with blank records. There!

So, whenever you feel you need to fill in some blank rows with Oracle, remember that you might just have found yourself a use-case for the partitioned outer join! (This clause will be supported by jOOQ 2.5.0)

Let’s revise the SQL FROM clause


Intuitive SQL

SQL is extremely simple and yet at times, tricky. Most SQL developers have an intuitive (as opposed to formal) understanding of how the language works, for two reasons:

  1. It is designed “intuitively”, like a natural language. Maybe that keeps us from studying it more formally
  2. The formal language specification is not really freely available (except SQL-92), nor easy to read, nor necessary to get started

Of course, there are also those to whom SQL is a mystery. Let’s ignore those and have a look at how a simple SELECT statement works:

SELECT *
FROM my_table
JOIN my_other_table ON id1 = id2
JOIN my_third_table ON id2 = id3
WHERE ...

This looks like we’re using 4 of the SELECT statement’s many clauses:

  • SELECT: specifying the projection
  • FROM: specifying the “main” table
  • JOIN: specifying several “joined” tables
  • WHERE: specifying the selection / filtering

Misinterpretation of the “JOIN clause”

But in fact, our intuition is slightly wrong. “Our intuition…” Of course, you might already know this but I feel it’s something we don’t think about every day… How are we wrong? By the fact that we tend to think there is a formal “JOIN” clause in a SELECT statement. There isn’t. Here’s a simplified T-SQL definition:

<query_specification> ::= 
SELECT [ ALL | DISTINCT ] < select_list > 
    [ FROM { <table_source> } [ ,...n ] ] 
    [ WHERE <search_condition> ] 
    [ <GROUP BY> ] 
    [ HAVING < search_condition > ] 

As you can see, there is no explicit JOIN clause in a SELECT statement, even if we tend to phrase our SQL in a way that would indicate so. Let’s have a closer look at the (simplified) definition of <table_source>

<table_source> ::= 
{
    table_or_view_name [ [ AS ] table_alias ]
    | user_defined_function [ [ AS ] table_alias ]
    | derived_table [ AS ] table_alias [ ( column_alias [ ,...n ] ) ] 
    | <joined_table> 
}
<joined_table> ::= 
{
    <table_source> <join_type> <table_source> ON <search_condition> 
    | <table_source> CROSS JOIN <table_source> 
    | [ ( ] <joined_table> [ ) ] 
}
<join_type> ::= 
    [ { INNER | { { LEFT | RIGHT | FULL } [ OUTER ] } } [ <join_hint> ] ]
    JOIN

So in fact, the true reading of our above SQL statement goes like this:

SELECT *
FROM ((my_table JOIN my_other_table ON id1 = id2)
                JOIN my_third_table ON id2 = id3)
WHERE ...

We’re actually creating a tree of table sources, first by joining “my_other_table” to “my_table” and then by joining “my_third_table” to the previously created table source. When this is understood, the following statement will no longer be surprising:

SELECT *
FROM t1 JOIN t2 ON id1 = id2,
     another_table,
     t3 JOIN t4 ON id3 = id4,
     yet_another_table,
     t5 JOIN t6 ON id5 = id6
WHERE ...

Here, we select from an overall cross product of 5 table sources, some of which are actual tables, some of which are created from joined tables…

Implications for jOOQ

In jOOQ, this is about to be supported for release 2.0.3. The org.jooq.Table API will be enhanced by several join methods. You will be able to write things like this:

// Create a specific type of table source
Table<Record> tableSource =
  MY_TABLE.join(MY_OTHER_TABLE).on(ID1.equal(ID2))
          .join(MY_THIRD_TABLE).on(ID2.equal(ID3));

// Use the above table source in a select statement
create.select()
      .from(tableSource)
      .where(...);

// Use several "complex" table sources
create.select()
      .from(T1.join(T2).on(ID1.equal(ID2)),
            ANOTHER_TABLE,
            T3.join(T4).on(ID3.equal(ID4)),
            YET_ANOTHER_TABLE,
            T5.join(T6).on(ID5.equal(ID6)))
      .where(...);

The existing syntax will be maintained as a fully compatible convenience API. These additions are an important step forward for creating more complex table sources, also involving Oracle (and SQL Server) PIVOT and UNPIVOT tables, about which I will blog in one of the next posts. Stay tuned!