Beautiful SQL: Lateral Unnesting of Array Columns


Sometimes, SQL can just be so beautiful. One of the less mainstream features in SQL is the array type (or nested collections). In fact, it’s so not mainstream that only 2 major databases actually support it: Oracle and PostgreSQL (and HSQLDB and H2 in the Java ecosystem).

In PostgreSQL, you can write:

CREATE TABLE blogs (
  id    SERIAL NOT NULL PRIMARY KEY,
  title text   NOT NULL,
  tags  text[]
)

Or in Oracle:

-- Oracle only knows nominal array types, so we have to declare
-- them in advance
CREATE TYPE tag_t AS VARRAY(100) OF VARCHAR2(100 CHAR);

CREATE TABLE blogs (
  id    NUMBER(18) GENERATED BY DEFAULT AS IDENTITY 
                   NOT NULL PRIMARY KEY,
  title VARCHAR2(100 CHAR) NOT NULL,
  tags  tag_t
)

So, roughly the same thing. Now, let’s insert some data. How about the 3 most recent posts on the jOOQ blog, prior to this one:

In PostgreSQL:

INSERT INTO blogs (title, tags)
VALUES (
  'How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ',
  ARRAY[
    'implicit cursor',
    'batch',
    'oracle',
    'jooq',
    'jdbc',
    'resultset'
  ]
), (
  'How to Execute SQL Batches With JDBC and jOOQ',
  ARRAY[
    'batch',
    'batch statement',
    'mysql',
    'jooq',
    'jdbc',
    'sql server',
    'sql'
  ]
), (
  'How to Emulate Partial Indexes in Oracle',
  ARRAY[
    'optimisation',
    'index',
    'partial index',
    'oracle',
    'sql',
    'postgresql',
    't-sql',
    'sql server'
  ]
)

Or in Oracle:

INSERT INTO blogs (title, tags)
VALUES (
  'How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ',
  tag_t(
    'implicit cursor',
    'batch',
    'oracle',
    'jooq',
    'jdbc',
    'resultset'
  ));
INSERT INTO blogs (title, tags)
VALUES (
  'How to Execute SQL Batches With JDBC and jOOQ',
  tag_t(
    'batch',
    'batch statement',
    'mysql',
    'jooq',
    'jdbc',
    'sql server',
    'sql'
  ));
INSERT INTO blogs (title, tags)
VALUES (
  'How to Emulate Partial Indexes in Oracle',
  tag_t(
    'optimisation',
    'index',
    'partial index',
    'oracle',
    'sql',
    'postgresql',
    't-sql',
    'sql server'
  ));

Now, the array type by itself is not very useful. When it gets really interesting is when we unnest it again into a table. For instance in PostgreSQL:

SELECT title, tag
FROM blogs, LATERAL unnest(tags) AS tags(tag);

Or in Oracle:

-- Classic style
SELECT title, tags.*
FROM blogs, TABLE(tags) tags;

-- Since Oracle 12c
SELECT title, tags.*
FROM blogs, LATERAL (SELECT * FROM TABLE(tags)) tags;

Note that we’re using the keyword LATERAL in some of the above queries. For those of you who are used to T-SQL syntax, it’s almost the same thing as APPLY. Both LATERAL and APPLY are also very useful with table valued functions (stay tuned for a blog post on those).

The idea behind LATERAL is that the table (derived table, subquery, function call, array unnesting) on the right side of LATERAL can “laterally” access stuff from the left side of LATERAL in order to produce new tables. In the above query, we’re producing a new table of tags for each blog post, and then we cross join the two tables.

Here’s what the above queries result in:

title                                                         tag
-----------------------------------------------------------------------------
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   implicit cursor
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   batch
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   oracle
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   jooq
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   jdbc
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   resultset
How to Execute SQL Batches With JDBC and jOOQ                 batch
How to Execute SQL Batches With JDBC and jOOQ                 batch statement
How to Execute SQL Batches With JDBC and jOOQ                 mysql
How to Execute SQL Batches With JDBC and jOOQ                 jooq
How to Execute SQL Batches With JDBC and jOOQ                 jdbc
How to Execute SQL Batches With JDBC and jOOQ                 sql server
How to Execute SQL Batches With JDBC and jOOQ                 sql
How to Emulate Partial Indexes in Oracle                      optimisation
How to Emulate Partial Indexes in Oracle                      index
How to Emulate Partial Indexes in Oracle                      partial index
How to Emulate Partial Indexes in Oracle                      oracle
How to Emulate Partial Indexes in Oracle                      sql
How to Emulate Partial Indexes in Oracle                      postgresql
How to Emulate Partial Indexes in Oracle                      t-sql
How to Emulate Partial Indexes in Oracle                      sql server

You can immediately see the cross join semantics here, as we’re combining each tag (per post) with its post.

Looking for ordinals (i.e. the tag number inside of the array) along with the array? Easy!

Just add the powerful WITH ORDINALITY clause after the UNNEST() call in PostgreSQL:

SELECT title, tag
FROM blogs, LATERAL unnest(tags) WITH ORDINALITY AS tags(tag);

A bit more complicated to emulate in Oracle:

-- Fancy, with a window function
SELECT title, tags.*
FROM blogs, LATERAL (
  SELECT tags.*, ROW_NUMBER() OVER (ORDER BY NULL)
  FROM TABLE(tags) tags
) tags;

-- Classic, with ROWNUM
SELECT title, tags.*
FROM blogs, LATERAL (
  SELECT tags.*, ROWNUM
  FROM TABLE(tags) tags
) tags;

The result now contains the tag “ID”, i.e the ordinal of the tag inside of the array:

title                                           tag               ordinal
-------------------------------------------------------------------------
How to Fetch ... Cursors with JDBC and jOOQ     implicit cursor   1
How to Fetch ... Cursors with JDBC and jOOQ     batch             2
How to Fetch ... Cursors with JDBC and jOOQ     oracle            3
How to Fetch ... Cursors with JDBC and jOOQ     jooq              4
How to Fetch ... Cursors with JDBC and jOOQ     jdbc              5
How to Fetch ... Cursors with JDBC and jOOQ     resultset         6
How to Execute SQL Batches With JDBC and jOOQ   batch             1
How to Execute SQL Batches With JDBC and jOOQ   batch statement   2
How to Execute SQL Batches With JDBC and jOOQ   mysql             3
How to Execute SQL Batches With JDBC and jOOQ   jooq              4
How to Execute SQL Batches With JDBC and jOOQ   jdbc              5
How to Execute SQL Batches With JDBC and jOOQ   sql server        6
How to Execute SQL Batches With JDBC and jOOQ   sql               7
How to Emulate Partial Indexes in Oracle        optimisation      1
How to Emulate Partial Indexes in Oracle        index             2
How to Emulate Partial Indexes in Oracle        partial index     3
How to Emulate Partial Indexes in Oracle        oracle            4
How to Emulate Partial Indexes in Oracle        sql               5
How to Emulate Partial Indexes in Oracle        postgresql        6
How to Emulate Partial Indexes in Oracle        t-sql             7
How to Emulate Partial Indexes in Oracle        sql server        8

Now, imagine looking for those blog posts that are tagged “jooq”. Easy!

PostgreSQL:

SELECT title
FROM blogs
WHERE 'jooq' = ANY(tags);

Oracle:

SELECT title
FROM blogs
WHERE 'jooq' IN (SELECT * FROM TABLE(tags));

Yielding:

title
-----------------------------------------------------------
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ
How to Execute SQL Batches With JDBC and jOOQ

Conclusion

These are just a few nice things we can do when we denormalise our data into nested collections / arrays, and then use features like UNNEST to bring them back to the table level. Both Oracle and PostgreSQL support a variety of really nice features building on top of arrays, so do check them out!

How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ


Earlier this week, I’ve blogged about how to execute SQL batches with JDBC and jOOQ. This was useful for the MySQL, SQL Server, and Sybase users among you.

Today, we’ll discuss a slightly more difficult task, how to fetch Oracle 12c implicit cursors – which are essentially the same thing.

What’s an implicit cursor?

Oracle 12c added new procedures to their dynamic SQL API DBMS_SQL. Just run the following query in SQL Developer to see the results:

DECLARE
  c1 sys_refcursor;
  c2 sys_refcursor;
BEGIN
  OPEN c1 FOR SELECT 1 AS a FROM dual;
  dbms_sql.return_result(c1);
  OPEN c2 FOR SELECT 2 AS b FROM dual;
  dbms_sql.return_result(c2);
END;

The anonymous PL/SQL block contains two cursors that are opened and returned to whoever calls this block using DBMS_SQL.RETURN_RESULT. This is kind of magic, as we’re calling a procedure, passing a cursor to it, and somehow, this has a side effect on the client of this program after the program ends.

Not only can you do this in anonymous PL/SQL blocks, you can nest these calls in any procedure, of course. So, in other words, from Oracle 12c onwards, you don’t know for sure if you call a procedure if there will be more results than what you can see. For instance:

BEGIN
  any_procedure();
END;

The above call might just as well yield some implicit cursors. You can’t know for sure.

How to discover implicit cursors with JDBC

With JDBC, if you don’t know for sure what your query will yield as a result, you use the Statement.execute(String), or the PreparedStatement.execute() method to find out. As mentioned in the previous post, this is what you would do:

try (PreparedStatement s = c.prepareStatement(sql)) {
    fetchLoop:
    for (int i = 0, updateCount = 0;; i++) {
        boolean result = (i == 0)
            ? s.execute()
            : s.getMoreResults();
 
        if (result)
            try (ResultSet rs = s.getResultSet()) {
                System.out.println("\nResult:");
 
                while (rs.next())
                    System.out.println("  " + rs.getInt(1));
            }
        else if ((updateCount = s.getUpdateCount()) != -1)
            System.out.println("\nUpdate Count: " + updateCount);
        else
            break fetchLoop;
    }
}

Unfortunately, that won’t work on Oracle as Oracle’s JDBC driver doesn’t implement the JDBC spec correctly. I’ve documented this flaw in length on this Stack Overflow question here.

Using ojdbc, the following “improved” loop needs to be written:

/* Alternatively, use this for non-PreparedStatements:
try (Statement s = cn.createStatement()) {
    Boolean result = s.execute(sql); */
try (PreparedStatement s = cn.prepareStatement(sql)) {
    // Use good old three-valued boolean logic
    Boolean result = s.execute();

    fetchLoop:
    for (int i = 0;; i++) {

        // Check for more results if not already done in 
        // this iteration
        if (i > 0 && result == null)
            result = s.getMoreResults();
        System.out.println(result);

        if (result) {
            result = null;

            try (ResultSet rs = s.getResultSet()) {
                System.out.println("Fetching result " + i);
            }
            catch (SQLException e) {
                // Ignore ORA-17283: No resultset available (1)
                if (e.getErrorCode() == 17283)
                    continue fetchLoop;
                else
                    throw e;
            }
        }
        else if (s.getUpdateCount() == -1)
            // Ignore -1 value if there is one more result! (2)
            if (result = s.getMoreResults())
                continue fetchLoop;
            else
                break fetchLoop;
    }
}

Two elements of the above logic need more explanation:

  1. There’s a possibility of an ORA-17283: No resultset available error being raised when accessing the Statement.getResultSet() despite the previous call to Statement.execute() yielding true. If that happens, we’ll just ignore the error and try fetching another result set
  2. In case we’re using PreparedStatement, the original call to PreparedStatement.execute() will yield false (!) and the Statement.getUpdateCount() value is -1, which would normally mean that we should stop. Not in this case. Let’s just try one more time to get a result set, and tah-dah, here are our implicit result sets.

Note that the algorithm now works with both static Statement and PreparedStatement, which (very unfortunately) behave differently when calling execute().

The above will now work with any SQL statement. In case you’re using the previous SQL statement returning implicit cursors:

String sql =
    "\nDECLARE"
  + "\n  c1 sys_refcursor;"
  + "\n  c2 sys_refcursor;"
  + "\nBEGIN"
  + "\n  OPEN c1 FOR SELECT 1 AS a FROM dual;"
  + "\n  dbms_sql.return_result(c1);"
  + "\n  OPEN c2 FOR SELECT 2 AS a FROM dual;"
  + "\n  dbms_sql.return_result(c2);"
  + "\nEND;";

… you will now be able to fetch all the results:

true
true
Fetching result 1
true
Fetching result 2
false

How to get those cursors with jOOQ?

With jOOQ 3.10 (as always), you don’t need to worry about those low level JDBC details. Just call the following code:

System.out.println(
    DSL.using(cn).fetchMany(sql)
);

And you’ll get a convenient, object oriented representation of your multiple result sets in the form of an org.jooq.Results:

Result set:
+----+
|   A|
+----+
|   1|
+----+
Result set:
+----+
|   A|
+----+
|   2|
+----+

Even better, when you use a code generator to return multiple implicit cursors like this in a stored procedure, just call the generated stored procedure object like this, to get all the cursors automatically:

MyProcedure p = new MyProcedure();
p.setParameter1(x);
p.setParameter2(y);
p.execute(configuration);
Results results = p.getResults();

for (Result<?> result : results)
  for (Record record : result)
    System.out.println(record);

Done!

How to Execute SQL Batches With JDBC and jOOQ


Some databases (in particular MySQL and T-SQL databases like SQL Server and Sybase) support a very nice feature: They allow for running a “batch” of statements in a single statement. For instance, in SQL Server, you can do something like this:

-- Statement #1
DECLARE @table AS TABLE (id INT);

-- Statement #2
SELECT * FROM @table;

-- Statement #3
INSERT INTO @table VALUES (1),(2),(3);

-- Statement #4
SELECT * FROM @table;

This is a batch of 4 statements, and it can be executed as a single statement both with JDBC and with jOOQ. Let’s see how:

Executing a batch with JDBC

Unfortunately, the term “batch” has several meanings, and in this case, I don’t mean the JDBC Statement.addBatch() method, which is actually a bit clumsy as it doesn’t allow for fetching mixed update counts and result sets.

Instead, what I’ll be doing is this:

String sql =
    "\n  -- Statement #1                              "
  + "\n  DECLARE @table AS TABLE (id INT);            "
  + "\n                                               "
  + "\n  -- Statement #2                              "
  + "\n  SELECT * FROM @table;                        "
  + "\n                                               "
  + "\n  -- Statement #3                              "
  + "\n  INSERT INTO @table VALUES (1),(2),(3);       "
  + "\n                                               "
  + "\n  -- Statement #4                              "
  + "\n  SELECT * FROM @table;                        ";

try (PreparedStatement s = c.prepareStatement(sql)) {
    fetchLoop:
    for (int i = 0, updateCount = 0;; i++) {
        boolean result = (i == 0)
            ? s.execute()
            : s.getMoreResults();

        if (result)
            try (ResultSet rs = s.getResultSet()) {
                System.out.println("\nResult:");

                while (rs.next())
                    System.out.println("  " + rs.getInt(1));
            }
        else if ((updateCount = s.getUpdateCount()) != -1)
            System.out.println("\nUpdate Count: " + updateCount);
        else
            break fetchLoop;
    }
}

The output of the above program being:

Result:

Update Count: 3

Result:
  1
  2
  3

The above API usage is a somewhat “hidden” – or at least not every day usage of the JDBC API. Mostly, you’ll be using Statement.executeQuery() when you’re expecting a ResultSet, or Statement.executeUpdate() otherwise.

But in our case, we don’t really know what’s happening. We’re going to discover the result types on the fly, when executing the statement. Here are the main JDBC API features that we’re using, along with an explanation:

  • Statement.execute(): This method should be used if we don’t know the result type. The method returns a boolean, which is true when the first statement in the batch produced a ResultSet and false otherwise.
  • Statement.getMoreResults(): This method returns the same kind of boolean value as the previous Statement.execute() method, but it does so for the next statement in the batch (i.e. for every statement except the first).
  • If the current result is a ResultSet (the boolean was true), then we’ll obtain that ResultSet through Statement.getResultSet() (we can obviously no longer call the usual Statement.executeQuery() to obtain the ResultSet).
  • If the current result is not a ResultSet (the boolean was true), then we’ll check the update count value through Statement.getUpdateCount().
  • If the update count is -1, then we’ve reached the end of the batch.

What a nice state machine!

The nice thing about this is that a batch may be completely nondeterministic. E.g. there may be triggers, T-SQL blocks (e.g. an IF statement), stored procedures, and many other things that contribute result sets and/or update counts. In some cases, we simply don’t know what we’ll get.

Executing a batch with jOOQ

It’s great that the JDBC API designers have thought of this exotic API usage on a rather low level. This makes JDBC extremely powerful. But who remembers the exact algorithm all the time? After all, the above minimalistic version required around 20 lines of code for something as simple as that.

Compare this to the following jOOQ API usage:

System.out.println(
    DSL.using(c).fetchMany(sql)
);

The result being:

Result set:
+----+
|  id|
+----+
Update count: 3
Result set:
+----+
|  id|
+----+
|   1|
|   2|
|   3|
+----+

Huh! Couldn’t get much simpler than that! Let’s walk through what happens:

The DSLContext.fetchMany() method is intended for use when users know there will be many result sets and/or update counts. Unlike JDBC which reuses ordinary JDBC API, jOOQ has a different API here to clearly distinguish between behaviours. The method then eagerly fetches all the results and update counts in one go (lazy fetching is on the roadmap with issue #4503).

The resulting type is org.jooq.Results, a type that extends List<Result>, which allows for iterating over the results only, for convenience. If a mix of results or update counts need to be consumed, the Results.resultsOrRows() method can be used.

How to Emulate Partial Indexes in Oracle


A very interesting feature of the SQL Server and PostgreSQL databases (and some others, including SQLite) is the partial index (sometimes also called “filtered index”). That’s an index that contains only “parts” of the table data. For instance, we can write the following index in SQL Server and PostgreSQL:

CREATE INDEX i ON message WHERE deleted = 1;

Let’s imagine you have a house keeping job that periodically removes deleted messages. Now, let’s assume you have discovered, that only 0.1% of all messages are really deleted, so an index on the DELETED column is very selective if you’re looking for deleted messages.

On the other hand, it’s not selective at all if you’re looking for non-deleted messages, as such a query would return 99.9% of all messages, in case of which a full table scan is more efficient.

So, since the index is never useful for non-deleted messages, why index those messages at all? If we can avoid indexing non-deleted messages, then we can:

  • Save a lot of disk space, as the index will be much smaller
  • Save a lot of time inserting into the messages table, since we don’t have to update the index all the time
  • Save a lot of time scanning the index, since it will contain a lot less blocks

Unfortunately, Oracle doesn’t support this feature

… but “luckily”, Oracle has another controversial “feature”. In Oracle, all indexes are partial indexes, because they don’t contain NULL values. This is probably due to an ancient optimisation (remember, partial indexes are smaller), which occasionally gets into your way, performance wise, if you do want to query for NULL values.

But in this case, it’s useful. Check this out:

CREATE TABLE message(deleted number(1));

CREATE INDEX i ON message (
  CASE WHEN deleted > 0 THEN deleted END
);

The above index is a function-based index, i.e. an index that contains not the value of the deleted column itself, but an expression based on it. Concretely, it contains only deleted values that are strictly greater than zero, because if the value is zero, then it is turned to NULL by the CASE expression, and Oracle doesn’t index NULL values. Check this out:

INSERT INTO message
SELECT DECODE(LEVEL, 1, 1, 0)
FROM dual
CONNECT BY LEVEL < 100000;

The above query is inserting a single row containing a deleted value of 1, and almost 100k rows containing a value of 0. The insert is very quick, because only one row has to be added to the index. The other almost 100k rows are skipped:

EXEC dbms_stats.gather_table_stats('SANDBOX', 'MESSAGE');

SELECT NUM_ROWS, BLOCKS
FROM SYS.ALL_TAB_STATISTICS
WHERE TABLE_NAME = 'MESSAGE';

SELECT NUM_ROWS, LEAF_BLOCKS
FROM SYS.ALL_IND_STATISTICS
WHERE TABLE_NAME = 'MESSAGE';

The result is:

NUM_ROWS       BLOCKS
---------------------
   99999          152 <-- table size

NUM_ROWS  LEAF_BLOCKS
---------------------
       1            1 <-- index size

The “trouble” with this kind of emulation is: It’s a function-based index. We can use this index only if we really reproduce the same “function” (or in this case, expression) as in the index itself. So, in order to fetch all the deleted messages, we must not write the following query:

SELECT *
FROM message
WHERE deleted = 1;

But this one, instead:

SELECT *
FROM message
WHERE CASE WHEN deleted > 0 THEN deleted END = 1;

Check out execution plans:

EXPLAIN PLAN FOR
SELECT *
FROM message
WHERE deleted = 1;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR
SELECT *
FROM message
WHERE CASE WHEN deleted > 0 THEN deleted END = 1;

SELECT * FROM TABLE(dbms_xplan.display);

The output being:

------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)|
------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         | 50000 |   146K|    44   (3)|
|*  1 |  TABLE ACCESS FULL| MESSAGE | 50000 |   146K|    44   (3)|
------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("DELETED"=1)

And

----------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |     1 |     3 |     2   (0)|
|   1 |  TABLE ACCESS BY INDEX ROWID| MESSAGE |     1 |     3 |     2   (0)|
|*  2 |   INDEX RANGE SCAN          | I       |     1 |       |     1   (0)|
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access(CASE  WHEN "DELETED">0 THEN "DELETED" END =1)

As you can see, the first query runs a full table scan, estimating to retrieve 50% of all the rows, when the actual result is only 1 row as can be seen in the second execution plan!

Insertion speed

What’s even more impressive is the difference in insertion speed. Consider the following code block, which measures the time it takes to insert 1 million times 0 and one million times 1:

SET SERVEROUTPUT ON
DECLARE
  ts TIMESTAMP;
BEGIN
  ts := SYSTIMESTAMP;
  INSERT INTO message 
  SELECT 0 FROM dual CONNECT BY level <= 1000000;
  dbms_output.put_line(SYSTIMESTAMP - ts);
  
  ts := SYSTIMESTAMP;
  INSERT INTO message 
  SELECT 1 FROM dual CONNECT BY level <= 1000000;
  dbms_output.put_line(SYSTIMESTAMP - ts);
END;
/

The result being:

+000000000 00:00:01.501000000
+000000000 00:00:08.162000000

The insertion is much faster if we don’t have to modify the index!

Conclusion

Partial indexes are a very neat trick in cases where your data is highly skewed and some values in a column are extremely rare and very frequently queried. They may drastically reduce the index size, which greatly improves performance in some situations, including inserting into the table, and querying the index.

In Oracle, they can be emulated using function-based indexes, which means you have to use the exact function expression from the index also in queries, in order to profit. But it may well be worth the trouble!

A Probably Incomplete, Comprehensive Guide to the Many Different Ways to JOIN Tables in SQL


Perhaps the most powerful SQL feature is the JOIN operation. It is the envy of all non-relational databases, because the concept is so simple, yet so universally applicable, when you want to “combine” two data sets.

Put simply, when joining two tables, you’re combining every row from one table with every row from another table, for which a certain predicate is true. An illustration from our SQL Masterclass shows it.

venn-join

See also our recent article about using Venn diagrams to illustrate JOINs. The above illustration compares INNER JOIN with different OUTER JOIN operations, but those are far from all possibilities that we have. Let’s look at things from a more systematic angle.

Do note that whenever in this article I say “X happens before Y”, I mean that “X happens logically before Y”. The database optimiser may still choose to execute Y before X, because that is faster without changing the outcome. More information about the syntactic / logical order of operations here.

But let’s look into all the join types, individually!

CROSS JOIN

The most basic JOIN operation is really a cartesian product. It just combines every row from one table with every row from another table. The best example for cartesian products is given by Wikipedia, using a deck of cards for illustration where we’re “cross joining” the ranks table with the suits table:

venn-cross-product

In real-world scenarios, CROSS JOIN can be very useful when doing reports, for instance you could generate a set of dates (e.g. days in a month) and cross join that with all the departments in the database, to create a complete days/departments table. Using PostgreSQL syntax:

SELECT *

-- This just generates all the days in January 2017
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)

-- Here, we're combining all days with all departments
CROSS JOIN departments

Imagine we have the following data:

+--------+   +------------+
| day    |   | department |
+--------+   +------------+
| Jan 01 |   | Dept 1     |
| Jan 02 |   | Dept 2     |
| ...    |   | Dept 3     |
| Jan 30 |   +------------+
| Jan 31 |
+--------+

The result would now look like this:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 01 | Dept 1     |
| Jan 01 | Dept 2     |
| Jan 01 | Dept 3     |

| Jan 02 | Dept 1     |
| Jan 02 | Dept 2     |
| Jan 02 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
+--------+------------+

Now, for each days/departments combination, you could calculate the daily revenue for that department, or something similar.

Characteristics

A CROSS JOIN is a cartesian product, i.e. a product as in “multiplication”. The mathematical notation uses the multiplication sign to express this operation: A × B, or in our case: days × departments.

Just like with “ordinary” arithmetic multiplication, if one of the two tables is empty (size zero), then the result will also be empty (size zero). This makes total sense. If we combine the previous 31 days with 0 departments, we’ll get 0 days/departments combinations. Likewise, if we combine an empty date range with any number of departments, we also get 0 days/departments combinations.

In other words:

size(result) = size(days) * size(departments)

Alternative syntaxes

In the old days, before the ANSI JOIN syntax was introduced to SQL, people just wrote comma-separated lists of tables in the FROM clause to write a CROSS JOIN. The above query is equivalent to this one:

SELECT *
FROM 
  generate_series(
    '2017-01-01'::TIMESTAMP,
    '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
    INTERVAL '1 day'
  ) AS days(day),
  departments

In general, I strongly recommend to use the CROSS JOIN keyword instead of comma-separated lists of tables, because if you intentionally want to do a CROSS JOIN, nothing communicates this intent (to the next developer!) better than using the actual keywords. So many things can go wrong, accidentally, with comma-separated lists of tables, including accidental cross joins. You don’t want those!

INNER JOIN (Theta-JOIN)

Building on top of the previous CROSS JOIN operation, INNER JOIN (or just simply JOIN, sometimes also called "THETA" JOIN) allows for filtering the outcome of the cartesian product by some predicate. Most of the times, we’re putting that predicate in an ON clause, and it could be something like this:

SELECT *

-- Same as before
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)

-- Now, exclude all days/departments combinations for
-- days before the department was created
JOIN departments AS d ON day >= d.created_at

In most databases, the INNER keyword is optional, so I’m just omitting it in this article.

Notice how the INNER JOIN operation allows for putting arbitrary predicates in the ON clause, which is again very useful when doing reporting. Just like in the previous CROSS JOIN example, we’re combining all days with all departments, but then we retain only those days/departments combinations for which the department already existed, i.e. for which the department creation preceded the day.

Again, using this data:

+--------+   +------------+------------+
| day    |   | department | created_at |
+--------+   +------------+------------+
| Jan 01 |   | Dept 1     | Jan 10     |
| Jan 02 |   | Dept 2     | Jan 11     |
| ...    |   | Dept 3     | Jan 12     |
| Jan 30 |   +------------+------------+
| Jan 31 |
+--------+

The result would now look like this:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 10 | Dept 1     |

| Jan 11 | Dept 1     |
| Jan 11 | Dept 2     |

| Jan 12 | Dept 1     |
| Jan 12 | Dept 2     |
| Jan 12 | Dept 3     |

| Jan 13 | Dept 1     |
| Jan 13 | Dept 2     |
| Jan 13 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
+--------+------------+

So, we’d get no results prior to January 10, those rows were filtered out.

Characteristics

INNER JOIN operations are filtered CROSS JOIN operations. This means that if one of the two tables is empty, then the result is also guaranteed to be empty. But unlike with CROSS JOIN, because of the predicate, we can always get less results than the CROSS JOIN would deliver.

In other words:

size(result) <= size(days) * size(departments)

Alternative syntaxes

While the ON clause is mandatory with the INNER JOIN operation, you’re not required to place JOIN predicates in there (although it is highly recommended from a readability perspective). Most databases will optimise the following, equivalent query in just the same way:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)

-- You can always JOIN .. ON true (or 1 = 1 in other DBs)
-- to turn an syntactic INNER JOIN into a semantic CROSS JOIN
JOIN departments AS d ON true

-- ... and then turn the CROSS JOIN back into an INNER JOIN
-- by putting the JOIN predicate in the WHERE clause:
WHERE day >= d.created_at

Of course, again, that is just obfuscating the query for the reader, but you may have your reasons, right? If we take this one step further, the following query is also equivalent, because most optimisers can figure out the equivalence and execute an INNER JOIN instead:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)

-- Now, this is really a syntactic CROSS JOIN
CROSS JOIN departments AS d
WHERE day >= d.created_at

… and, as seen before, CROSS JOIN is just syntax sugar for comma-separated table lists. In this case, we retain the WHERE clause to get what people have frequently done prior to the introduction of ANSI JOIN syntax:

SELECT *
FROM
  generate_series(
    '2017-01-01'::TIMESTAMP,
    '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
    INTERVAL '1 day'
  ) AS days(day),
  departments AS d
WHERE day >= d.created_at

All of these syntaxes do the same thing, usually without performance penalty, but clearly, they’re all less readable than the original INNER JOIN syntax.

EQUI JOIN

Sometimes, e.g. in literature, you will hear the term EQUI JOIN where "EQUI" isn’t really meant as a SQL keyword, but just as a specific way of writing a special kind of INNER JOIN.

In fact, it is weird that "EQUI" JOIN is the special case, because it’s what we’re doing most in SQL, also in OLTP applications where we simply JOIN by primary key / foreign key relationship. For instance:

SELECT *
FROM actor AS a
JOIN film_actor AS fa ON a.actor_id = fa.actor_id
JOIN film AS f ON f.film_id = fa.film_id

The above query selects all actors and their films. There are two INNER JOIN operations, one connecting actors to the corresponding entries in the film_actor relationship table (because actors can have many films and films can have many actors), and the other connecting each film_actor relationship with additional information about the film itself.

Characteristics

The characteristics of this operation are the same as those of the “general” INNER JOIN operation. An "EQUI" JOIN is still a cartesian product (CROSS JOIN) with a reduced set of results, i.e. containing only those actor/film combinations for which a given actor actually played in the given film.

So again, in other words:

size(result) <= size(actor) * size(film)

The result size can only be equal to the actor size multiplied by the film size, if every actor played in every film, which is unlikely.

Alternative syntaxes: USING

Again, as before, we could write the INNER JOIN operation without an actual INNER JOIN syntax, but CROSS JOIN or comma-separated table lists instead. That’s boring, but much more interesting are the following two alternative syntaxes, one of which is very very useful:

SELECT *
FROM actor
JOIN film_actor USING (actor_id)
JOIN film USING (film_id)

The USING clause replaces the ON clause and allows for listing a set of columns that have to be present in both sides of the JOIN operation. If you carefully design your database in the same way as the Sakila database, namely where every foreign key column(s) have the same names as their referenced primary key column(s) (e.g. actor.actor_id = film_actor.actor_id), then you can use USING for "EQUI" JOIN in at least these databases:

  • Derby
  • Firebird
  • HSQLDB
  • Ingres
  • MariaDB
  • MySQL
  • Oracle
  • PostgreSQL
  • SQLite
  • Vertica

These databases, unfortunately, don’t support the syntax:

  • Access
  • Cubrid
  • DB2
  • H2
  • HANA
  • Informix
  • SQL Server
  • Sybase ASE
  • Sybase SQL Anywhere

While this produces exactly (well almost) the same result as the ON clause, it’s much quicker to read and write. I said almost because some databases (and the SQL standard) specify that any column appearing in the USING clause loses its qualification. For instance:

SELECT
  f.title,   -- Ordinary column, can be qualified
  f.film_id, -- USING column, shouldn't be qualified
  film_id    -- USING column, correct / non-ambiguous here
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)

Also, of course, this syntax is a bit limited. Sometimes, you have several foreign keys in your tables, not all of which can have the primary key column name(s). For instance:

CREATE TABLE film (
  ..
  language_id          BIGINT REFERENCES language,
  original_language_id BIGINT REFERENCES language,
)

If you want to join by ORIGINAL_LANGUAGE_ID, you’ll have to resort to the ON clause.

Alternative syntaxes: NATURAL JOIN

An more extreme and much less useful form of "EQUI" JOIN is the NATURAL JOIN clause. The previous example could be further “improved” by replacing USING by NATURAL JOIN like this:

SELECT *
FROM actor
NATURAL JOIN film_actor
NATURAL JOIN film

Notice how we no longer need to specify any JOIN criteria, because a NATURAL JOIN will automatically take all the columns that share the same name from both tables that it joins and place them in a “hidden” USING clause. As we’ve seen before, as primary keys and foreign keys have the same column name, this appears quite useful.

Truth is: It isn’t useful. In the Sakila database, every table also has a LAST_UPDATE column, which is automatically being taken into consideration by NATURAL JOIN. The NATURAL JOIN query is thus equivalent to this one:

SELECT *
FROM actor
JOIN film_actor USING (actor_id, last_update)
JOIN film USING (film_id, last_update)

… which of course makes absolutely no sense at all. So, forget about NATURAL JOIN immediately (except for some very rare cases, e.g. when joining Oracle’s diagnostics views like v$sql NATURAL JOIN v$sql_plan, etc., for ad-hoc analytics)

OUTER JOIN

We’ve seen INNER JOIN before, which returns results only for combinations of the left / right table, for which the ON predicate yields true.

OUTER JOIN allows us to retain rowson the left / rigth side, for which we didn’t find a matching combination. Let’s go back to the days and departments example:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)
LEFT JOIN departments AS d ON day >= d.created_at

Again, the OUTER keyword is optional, so I’m omitting it in the examples.

This query is very subtly different from its INNER JOIN counter part in that it will always return at least one row per day, even if for a given day, there is no department that has already been created at that day. For instance: All departments were created on January 10. The above query will still return January 1-9:

+--------+   +------------+------------+
| day    |   | department | created_at |
+--------+   +------------+------------+
| Jan 01 |   | Dept 1     | Jan 10     |
| Jan 02 |   | Dept 2     | Jan 11     |
| ...    |   | Dept 3     | Jan 12     |
| Jan 30 |   +------------+------------+
| Jan 31 |
+--------+

In addition to the rows we got before in the INNER JOIN example, we now also have all the days from January 1-9 with NULL departments:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 01 |            | -- Extra rows with no match here
| Jan 02 |            | -- Extra rows with no match here
| ...    |            | -- Extra rows with no match here
| Jan 09 |            | -- Extra rows with no match here
| Jan 10 | Dept 1     |
| Jan 11 | Dept 1     |
| Jan 11 | Dept 2     |
| Jan 12 | Dept 1     |
| Jan 12 | Dept 2     |
| Jan 12 | Dept 3     |
| Jan 13 | Dept 1     |
| Jan 13 | Dept 2     |
| Jan 13 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
+--------+------------+

In other words, each day appears at least once in the result. LEFT JOIN does this for the left table, i.e. it retains all rows from the left table in the result.

Formally, a LEFT OUTER JOIN is an INNER JOIN with a UNION like this:

-- Convenient syntax:
SELECT *
FROM a LEFT JOIN b ON <predicate>

-- Cumbersome, equivalent syntax:
SELECT a.*, b.*
FROM a JOIN b ON <predicate>
UNION ALL
SELECT a.*, NULL, NULL, ..., NULL
FROM a
WHERE NOT EXISTS (
  SELECT * FROM b WHERE <predicate>
)

We’ll talk about NOT EXISTS further down, when talking about "SEMI" JOIN

RIGHT OUTER JOIN

RIGHT OUTER JOIN does exactly the opposite. It retains all the rows from the right table in the result. Let’s add more departments

+--------+   +------------+------------+
| day    |   | department | created_at |
+--------+   +------------+------------+
| Jan 01 |   | Dept 1     | Jan 10     |
| Jan 02 |   | Dept 2     | Jan 11     |
| ...    |   | Dept 3     | Jan 12     |
| Jan 30 |   | Dept 4     | Apr 01     |
| Jan 31 |   | Dept 5     | Apr 02     |
+--------+   +------------+------------+

The new departments 4 and 5 will not be in the previous results at all, because they were created on a day after January 31. But it will appear in a RIGHT JOIN result, because departments is the right table of the join operation, and all the rows from the right table will be retained.

Running this query:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)
RIGHT JOIN departments AS d ON day >= d.created_at

Will yield something like this:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 10 | Dept 1     |
| Jan 11 | Dept 1     |
| Jan 11 | Dept 2     |
| Jan 12 | Dept 1     |
| Jan 12 | Dept 2     |
| Jan 12 | Dept 3     |
| Jan 13 | Dept 1     |
| Jan 13 | Dept 2     |
| Jan 13 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
|        | Dept 4     | -- Extra rows with no match here
|        | Dept 5     | -- Extra rows with no match here
+--------+------------+

In most cases (I have to yet run into a situation where this isn’t the case), Each LEFT OUTER JOIN expression can be transformed into an equivalent RIGHT OUTER JOIN expression, and vice-versa. Because RIGHT OUTER JOIN is usually a bit less readable, most people use LEFT OUTER JOIN only.

FULL OUTER JOIN

Finally, there is also FULL OUTER JOIN, which retains all rows from both sides of the JOIN operation. In our example, this means that each day appears at least once in the result, just like each department appears at least once in the result.

Let’s take again this data:

+--------+   +------------+------------+
| day    |   | department | created_at |
+--------+   +------------+------------+
| Jan 01 |   | Dept 1     | Jan 10     |
| Jan 02 |   | Dept 2     | Jan 11     |
| ...    |   | Dept 3     | Jan 12     |
| Jan 30 |   | Dept 4     | Apr 01     |
| Jan 31 |   | Dept 5     | Apr 02     |
+--------+   +------------+------------+

And now, let’s run this query:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)
FULL JOIN departments AS d ON day >= d.created_at

The result will now look something like this:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 01 |            | -- row from the left table
| Jan 02 |            | -- row from the left table
| ...    |            | -- row from the left table
| Jan 09 |            | -- row from the left table
| Jan 10 | Dept 1     |
| Jan 11 | Dept 1     |
| Jan 11 | Dept 2     |
| Jan 12 | Dept 1     |
| Jan 12 | Dept 2     |
| Jan 12 | Dept 3     |
| Jan 13 | Dept 1     |
| Jan 13 | Dept 2     |
| Jan 13 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
|        | Dept 4     | -- row from the right table
|        | Dept 5     | -- row from the right table 
+--------+------------+

If you insist, formally, a LEFT OUTER JOIN is an INNER JOIN with a UNION like this:

-- Convenient syntax:
SELECT *
FROM a LEFT JOIN b ON <predicate>

-- Cumbersome, equivalent syntax:
SELECT a.*, b.*
FROM a JOIN b ON <predicate>
-- LEFT JOIN part
UNION ALL
SELECT a.*, NULL, NULL, ..., NULL
FROM a
WHERE NOT EXISTS (
  SELECT * FROM b WHERE <predicate>
)
-- RIGHT JOIN part
UNION ALL
SELECT NULL, NULL, ..., NULL, b.*
FROM b
WHERE NOT EXISTS (
  SELECT * FROM a WHERE <predicate>
)

Alternative syntaxes: “EQUI” OUTER JOIN

The above examples again used the “cartesian product with filter” kind of JOIN. Much more common, however, is the "EQUI" OUTER JOIN approach, where we join on a primary key / foreign key relationship. Let’s go back to the Sakila database example. Some actors haven’t played in any film, and we might want to query them like this:

SELECT *
FROM actor
LEFT JOIN film_actor USING (actor_id)
LEFT JOIN film USING (film_id)

This query will return all actors at least once, regardless if they played in a film. If we also want all films that do not have any actors (yet), we could use a FULL OUTER JOIN to combine the results:

SELECT *
FROM actor
FULL JOIN film_actor USING (actor_id)
FULL JOIN film USING (film_id)

And of course, this also works with NATURAL LEFT JOIN, NATURAL RIGHT JOIN, NATURAL FULL JOIN, but again, these aren’t useful at all, as we’d be joining again USING (..., LAST_UPDATE), which makes no sense at all.

Alternative syntaxes: Oracle and SQL Server style OUTER JOIN

These two databases had OUTER JOIN before the ANSI syntax was established. It looked like this:

-- Oracle
SELECT *
FROM actor a, film_actor fa, film f
WHERE a.actor_id = fa.actor_id(+)
AND fa.film_id = f.film_id(+)

-- SQL Server
SELECT *
FROM actor a, film_actor fa, film f
WHERE a.actor_id *= fa.actor_id
AND fa.film_id *= f.film_id

That’s nice, given that at some point in time (in the 80s??), ANSI didn’t specify OUTER JOIN yet. But the 80s are more than 30 years ago. So, it’s safe to say this stuff is obsolete.

SQL Server did the right thing and deprecated (and later removed) that syntax a long time ago. Oracle still supports it for backwards compatibility reasons.

But nothing about this syntax is reasonable or readable. Don’t use it. Replace it with ANSI JOIN.

PARTITIONED OUTER JOIN

This is Oracle specific, but I must say, it’s a real shame none of the other databases have stolen the feature yet. Remember the CROSS JOIN operation that we used to combine each day with each department? Because, sometimes, that’s what we want in the result: All combinations, and if there is a match put also the matching values in the row.

That’s hard to describe in words, much more easy with an example. Here’s the query with Oracle syntax:

WITH 

  -- Using CONNECT BY to generate all dates in January
  days(day) AS (
    SELECT DATE '2017-01-01' + LEVEL - 1
    FROM dual
    CONNECT BY LEVEL <= 31
  ),

  -- Our departments
  departments(department, created_at) AS (
    SELECT 'Dept 1', DATE '2017-01-10' FROM dual UNION ALL
    SELECT 'Dept 2', DATE '2017-01-11' FROM dual UNION ALL
    SELECT 'Dept 3', DATE '2017-01-12' FROM dual UNION ALL
    SELECT 'Dept 4', DATE '2017-04-01' FROM dual UNION ALL
    SELECT 'Dept 5', DATE '2017-04-02' FROM dual
  )
SELECT *
FROM days 
LEFT JOIN departments 
  PARTITION BY (department) -- This is where the magic happens
  ON day >= created_at

Unfortunately, PARTITION BY is used in various contexts with different meanings (e.g. for window functions). In this case, it means that we “partition” our data by the departments.department column, creating a “partition” for each department. Now, each partition will get a copy for each day, regardless if there’s a match in our predicate (unlike in the ordinary LEFT JOIN case, where we had a bunch of “department-less” days). The result of the above query is now this:

+--------+------------+------------+
| day    | department | created_at |
+--------+------------+------------+
| Jan 01 | Dept 1     |            | -- Didn't match, but still get row
| Jan 02 | Dept 1     |            | -- Didn't match, but still get row
| ...    | Dept 1     |            | -- Didn't match, but still get row
| Jan 09 | Dept 1     |            | -- Didn't match, but still get row
| Jan 10 | Dept 1     | Jan 10     | -- Matches, so get join result
| Jan 11 | Dept 1     | Jan 10     | -- Matches, so get join result
| Jan 12 | Dept 1     | Jan 10     | -- Matches, so get join result
| ...    | Dept 1     | Jan 10     | -- Matches, so get join result
| Jan 31 | Dept 1     | Jan 10     | -- Matches, so get join result

| Jan 01 | Dept 2     |            | -- Didn't match, but still get row
| Jan 02 | Dept 2     |            | -- Didn't match, but still get row
| ...    | Dept 2     |            | -- Didn't match, but still get row
| Jan 09 | Dept 2     |            | -- Didn't match, but still get row
| Jan 10 | Dept 2     |            | -- Didn't match, but still get row
| Jan 11 | Dept 2     | Jan 11     | -- Matches, so get join result
| Jan 12 | Dept 2     | Jan 11     | -- Matches, so get join result
| ...    | Dept 2     | Jan 11     | -- Matches, so get join result
| Jan 31 | Dept 2     | Jan 11     | -- Matches, so get join result

| Jan 01 | Dept 3     |            | -- Didn't match, but still get row
| Jan 02 | Dept 3     |            | -- Didn't match, but still get row
| ...    | Dept 3     |            | -- Didn't match, but still get row
| Jan 09 | Dept 3     |            | -- Didn't match, but still get row
| Jan 10 | Dept 3     |            | -- Didn't match, but still get row
| Jan 11 | Dept 3     |            | -- Didn't match, but still get row
| Jan 12 | Dept 3     | Jan 12     | -- Matches, so get join result
| ...    | Dept 3     | Jan 12     | -- Matches, so get join result
| Jan 31 | Dept 3     | Jan 12     | -- Matches, so get join result

| Jan 01 | Dept 4     |            | -- Didn't match, but still get row
| Jan 02 | Dept 4     |            | -- Didn't match, but still get row
| ...    | Dept 4     |            | -- Didn't match, but still get row
| Jan 31 | Dept 4     |            | -- Didn't match, but still get row

| Jan 01 | Dept 5     |            | -- Didn't match, but still get row
| Jan 02 | Dept 5     |            | -- Didn't match, but still get row
| ...    | Dept 5     |            | -- Didn't match, but still get row
| Jan 31 | Dept 5     |            | -- Didn't match, but still get row
+--------+------------+

As you can see, I’ve visually created 5 partitions for the 5 departments. Each partition combines the department with each day, but unlike when doing a CROSS JOIN, we’re now getting actual LEFT JOIN .. ON .. results in case there is a match for the predicate. This is a really nice feature for reporting in Oracle!

SEMI JOIN

In relational algebra, there is a notion of a semi join operation, which unfortunately doesn’t have a syntax representation in SQL. If it did, the syntax would probably be LEFT SEMI JOIN and RIGHT SEMI JOIN, just like the Cloudera Impala syntax extension offers.

What is a "SEMI" JOIN?

When writing something like the fictional query below:

SELECT *
FROM actor
LEFT SEMI JOIN film_actor USING (actor_id)

What we really mean is we want all actors that played in films. But we don’t want any films in the results, just the actors. More specifically, we don’t want each actor several times, once per film. We want each actor only once (or zero times) in the result.

Semi is latin for “half”, i.e. we implement only “half the join”, in this case, the left half.

In SQL, there are two alternative syntaxes that can emulate "SEMI" JOIN

Alternative syntax: EXISTS

This is the more powerful and a bit more verbose syntax

SELECT *
FROM actor a
WHERE EXISTS (
  SELECT * FROM film_actor fa
  WHERE a.actor_id = fa.actor_id
)

We’re looking for all the actors for which there exists a film, i.e. actors that played in a film. With this syntax (i.e. "SEMI" JOIN placed in the WHERE clause), it’s immediately clear that we can get each actor at most once in the result. There’s no actual JOIN in the syntax.

Nonetheless, most databases will be able to recognise that what’s going on here is really a "SEMI" JOIN, and not just an ordinary EXISTS() predicate. For instance, consider the Oracle execution plan for the above query:

hash-join-semi

Note how Oracle calls the operation “HASH JOIN (SEMI)” – the SEMI keyword is present here. The same is true for PostgreSQL:

nested-loop-semi-join

Or SQL Server:

nested-loops-left-semi-join

Apart from being the optimal solution in terms of correctness, there are also some performance benefits when using "SEMI" JOIN rather than INNER JOIN, as the database can stop looking for matches as soon as it found the first!

Alternative syntax: IN

IN and EXISTS are exactly equivalent "SEMI" JOIN emulations. The following query will produce the same plan in most databases (not MySQL), as the previous EXISTS query:

SELECT *
FROM actor
WHERE actor_id IN (
  SELECT actor_id FROM film_actor
)

If your database supports both syntaxes for the "SEMI" JOIN operation, you may choose whichever you prefer from a stylistic point of view.

This is different with …

ANTI JOIN

In principle, "ANTI" JOIN is just the opposite of "SEMI" JOIN. When writing something like the fictional query below:

SELECT *
FROM actor
LEFT ANTI JOIN film_actor USING (actor_id)

… what we’re doing is we take all the actors that didn’t play in any film. Unfortunately, again, SQL doesn’t have a built-in syntax for this operation, but we can emulate it with EXISTS:

Alternative syntax: NOT EXISTS

The following query has exactly the expected semantics:

SELECT *
FROM actor a
WHERE NOT EXISTS (
  SELECT * FROM film_actor fa
  WHERE a.actor_id = fa.actor_id
)

(Dangerous) alternative syntax: NOT IN

Watch out! While EXISTS and IN are equivalent, NOT EXISTS and NOT IN are not equivalent. This is because of NULL values.

In this particular case, the following NOT IN query is equivalent to the previous NOT EXISTS query, because our film_actor table has a NOT NULL constraint on film_actor.actor_id

SELECT *
FROM actor
WHERE actor_id NOT IN (
  SELECT actor_id FROM film_actor
)

If, however, actor_id became nullable, then the query would be wrong. Don’t believe it? Try running this:

SELECT *
FROM actor
WHERE actor_id NOT IN (1, 2, NULL)

It will not return any records. Why? Because NULL is the UNKNOWN value in SQL. So, the predicate above is the same as this:

SELECT *
FROM actor
WHERE actor_id NOT IN (1, 2, UNKNOWN)

And because we cannot be sure whether actor_id is in a set of values of which one value is UNKNOWN (is it 4? Or 5? Or -1?), the whole predicate becomes UNKNOWN

SELECT *
FROM actor
WHERE UNKNOWN

Here’s a nice article about three-valued logic by Joe Celko, if you want to learn more.

So, I cannot say this enough:

Do not use NOT IN predicates in SQL, ever, unless you put constant, non-null values there.

— Lukas Eder. Just now.

Don’t even gamble on the presence of a NOT NULL constraint. Perhaps, some DBA might temporarily turn off the constraint to load some data and your query will be wrong for the time being. Just use NOT EXISTS. Or, in some cases…

(Dangerous) alternative syntax: LEFT JOIN / IS NULL

Strangely enough, some people prefer the following syntax:

SELECT *
FROM actor a
LEFT JOIN film_actor fa
USING (actor_id)
WHERE film_id IS NULL

It’s correct as we’re:

  • Joining films to their actors
  • Retaining all actors without films (LEFT JOIN)
  • Retaining only actors without films (film_id IS NULL)

Now, I personally don’t like this syntax, as it doesn’t communicate the intent of an "ANTI" JOIN at all. And chances are, it is slower, because your optimiser doesn’t recognise this as an "ANTI" JOIN operation (or in fact, it cannot formally prove that it probably is). So, again, use NOT EXISTS instead.

An interesting (but a bit outdated) blog post comparing the three syntaxes is this one here:
https://explainextended.com/2009/09/15/not-in-vs-not-exists-vs-left-join-is-null-sql-server

LATERAL JOIN

LATERAL is a relatively new keyword in the SQL standard, and it is supported in PostgreSQL and Oracle. SQL Server folks have had a vendor-specific alternative syntax using the APPLY keyword forever (which I personally much prefer). Let’s look at an example, using the PostgreSQL / Oracle LATERAL keyword:

WITH 
  departments(department, created_at) AS (
    VALUES ('Dept 1', DATE '2017-01-10'),
           ('Dept 2', DATE '2017-01-11'),
           ('Dept 3', DATE '2017-01-12'),
           ('Dept 4', DATE '2017-04-01'),
           ('Dept 5', DATE '2017-04-02')
  )
SELECT *
FROM departments AS d
CROSS JOIN LATERAL generate_series(
  d.created_at, -- We can dereference a column from department!
  '2017-01-31'::TIMESTAMP, 
  INTERVAL '1 day'
) AS days(day)

Indeed, instead of a CROSS JOIN between all departments and all days, why not just generate the necessary days for each department directly? That’s what LATERAL does. It is a prefix to the right-hand side of any JOIN operation (including INNER JOIN, LEFT OUTER JOIN, etc.) that allows the right-hand side to access columns from the left hand side.

This of course has nothing to do with relational algebra anymore, because it imposes a JOIN order (from left to right). But sometimes, that’s OK and sometimes, your table-valued function (or subquery) is so complex, that’s the only way you can actually use it.

Another very popular use-case is to join a “TOP-N” query to a normal table – e.g. if you want to find each actor, and their TOP 5 best selling films:

SELECT a.first_name, a.last_name, f.*
FROM actor AS a
LEFT OUTER JOIN LATERAL (
  SELECT f.title, SUM(amount) AS revenue
  FROM film AS f
  JOIN film_actor AS fa USING (film_id)
  JOIN inventory AS i USING (film_id)
  JOIN rental AS r USING (inventory_id)
  JOIN payment AS p USING (rental_id)
  WHERE fa.actor_id = a.actor_id
  GROUP BY f.film_id
  ORDER BY revenue DESC
  LIMIT 5
) AS f
ON true

The result being something like:

outer-join-lateral-result

Don’t worry about the long list of joins in the derived table, that’s just how we have to get from the FILM table to the PAYMENT table in the Sakila database:

sakila1.

Essentially, the subquery is calculating thsoe TOP 5 best selling films per actor. So, it isn’t a “classic” derived table, but a correlated subquery that returns more than one row and one column. We’re all used to writing correlated subqueries like these:

SELECT 
  a.first_name, 
  a.last_name, 
  (SELECT count(*) 
   FROM film_actor AS fa 
   WHERE fa.actor_id = a.actor_id) AS films
FROM actor AS a

The result of the above correlated subquery is one row and one column. If you ever need to return more than one row and/or more than one column from a correlated subquery, LATERAL or APPLY will be your friend.

Note also how I’ve used LEFT OUTER JOIN together with LATERAL, and then I needed to put a dummy ON true clause for syntax correctness. Using LATERAL with OUTER JOIN will again always retain the left side of the JOIN, i.e. we’ll retain actors that didn’t play in any film.

Characteristics:

The LATERAL keyword doesn’t really change the semantics of the JOIN type that it is applied to. If you’re running a CROSS JOIN LATERAL, the result size is still

size(result) = size(left table) * size(right table)

Even if the right table is produced on a per-row-of-the-left-table basis.

You can use LATERAL also with OUTER JOIN, in case of which the left rows will be retained even if your table function doesn’t return any rows on the right side.

Alternative syntax: APPLY

SQL Server didn’t choose the confusing LATERAL keyword, they introduced the APPLY keyword (more specifically: CROSS APPLY and OUTER APPLY) a long time ago, which makes more sense, because we’re applying a function to each row of a table. Let’s assume for a moment that we have a generate_series() function in SQL Server:

-- Use with care, this is quite inefficient!
CREATE FUNCTION generate_series(@d1 DATE, @d2 DATE)
RETURNS TABLE AS
RETURN
  WITH t(d) AS (
    SELECT @d1 
    UNION ALL
    SELECT DATEADD(day, 1, d) 
    FROM t
    WHERE d < @d2
  ) 
  SELECT * FROM t;

Then, we could use CROSS APPLY to call that function for each department:

WITH 
  departments AS (
    SELECT * FROM (
      VALUES ('Dept 1', CAST('2017-01-10' AS DATE)),
             ('Dept 2', CAST('2017-01-11' AS DATE)),
             ('Dept 3', CAST('2017-01-12' AS DATE)),
             ('Dept 4', CAST('2017-04-01' AS DATE)),
             ('Dept 5', CAST('2017-04-02' AS DATE))
    ) d(department, created_at)
  )
SELECT *
FROM departments AS d
CROSS APPLY dbo.generate_series(
  d.created_at, -- We can dereference a column from department!
  CAST('2017-01-31' AS DATE)
)

What’s so nice about this syntax is that – again – we’re applying a function to each row of a table, and that function produces rows. Does it ring a bell? In Java 8, we would use Stream.flatMap() for that! Consider the following stream usage:

departments.stream()
           .flatMap(department -> generateSeries(
                             department.createdAt, 
                             LocalDate.parse("2017-01-31"))
                         .map(day -> tuple(department, day))
           );

What’s going on here?

  • The DEPARTMENTS table is just a Java departments stream
  • We flatMap the departments stream using a function that produces tuples for each department
  • Those tuples include the department itself, and a day generated from a series of days starting with the department’s createdAt day

Same story! SQL CROSS APPLY / CROSS JOIN LATERAL is the same thing as Java’s Stream.flatMap(). In fact, SQL and streams aren’t too different anyway. For more info, read this blog post here.

Note: Just like we could write LEFT OUTER JOIN LATERAL, we can also write OUTER APPLY in case we want to retain the left side of the JOIN expression.

MULTISET

Few databases implement this (in fact, only Oracle), but if you think about it, it’s an awesome JOIN type. One that creates nested collections. If all databases implemented it, we wouldn’t need ORMs!

A hypothetical example (that uses SQL standard syntax, not Oracle’s) looks like this:

SELECT a.*, MULTISET (
  SELECT f.*
  FROM film AS f
  JOIN film_actor AS fa USING (film_id)
  WHERE a.actor_id = fa.actor_id
) AS films
FROM actor

The MULTISET operator takes a correlated subquery argument, and aggregates all its resulting rows in a nested collection. This works in a similar fashion as a LEFT OUTER JOIN (we’re getting all actors, and if they have films, we’re getting all their films too), but instead of duplicating all the actors in the result set, we’re collecting them into a nested collection.

Just like what we would be doing in an ORM, when fetching stuff into this structure:

@Entity
class Actor {
  @ManyToMany
  List<Film> films;
}

@Entity
class Film {
}

Ignore the incompleteness of used JPA annotations, I just want to show the strength of nested collections. Unlike as in ORMs, the SQL MULTISET operator allows for collecting arbitrary results from correlated subqueries into nested collections – not just actual entities. This is million times more powerful than what any ORM could ever dream of.

Alternative syntaxes: Oracle

As I said, Oracle actually supports MULTISET, but you cannot create ad-hoc nested collections. For some reason, Oracle chose to implement nominal typing for these nested collections, rather than the usual SQL-style structural typing. So you have to declare your types in advance:

CREATE TYPE film_t AS OBJECT ( ... );
CREATE TYPE film_tt AS TABLE OF FILM;

SELECT 
  a.*, 
  CAST (
    MULTISET (
      SELECT f.*
      FROM film AS f
      JOIN film_actor AS fa USING (film_id)
      WHERE a.actor_id = fa.actor_id
    ) AS film_tt
  ) AS films
FROM actor

A bit more verbose, but still does the trick! Excellent!

Alternative syntaxes: PostgreSQL

For once, the awesome PostgreSQL is missing out on an excellent SQL standard feature, but there’s a workaround: arrays! And this time, we can use structural types, yay! So the following query will return a nested array of rows in PostgreSQL:

SELECT
  a AS actor,
  array_agg(
    f
  ) AS films
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
GROUP BY a

The result is everyone’s ORDBMS dream! Nested records and collections everywhere (in only two columns):

actor                  films
--------------------   ----------------------------------------------------
(1,PENELOPE,GUINESS)   {(1,ACADEMY DINOSAUR),(23,ANACONDA CONFESSIONS),...}
(2,NICK,WAHLBERG)      {(3,ADAPTATION HOLES),(31,APACHE DIVINE),...}
(3,ED,CHASE)           {(17,ALONE TRIP),(40,ARMY FLINTSTONES),...}

Tell me that you don’t find this exciting and I don’t know what is.

Conclusion

This was A Probably Incomplete, Comprehensive Guide to the Many Different Ways to JOIN Tables in SQL. I hope you’ve found 1-2 new tricks in this article. JOIN is only one of many very interesting SQL operations. If you’ve found this information useful, you will also like:

jOOQ 3.10 will Support SQL Server’s Table Valued Parameters


SQL Server has this nice feature called table-valued parameters (TVP), where users can pass table variables to a stored procedure for bulk data processing. This is particularly nice when the stored procedure is an inline table valued function, i.e. a function that returns a table as well. For instance:

CREATE TYPE numbers AS TABLE (i INTEGER);

CREATE FUNCTION cross_multiply (
  @numbers numbers READONLY
)
RETURNS @result TABLE (
  i1 INTEGER,
  i2 INTEGER,
  product INTEGER
)
AS
BEGIN
  INSERT INTO @result
  SELECT n1.i, n2.i, n1.i * n2.i
  FROM @numbers n1
  CROSS JOIN @numbers n2

  RETURN
END

The above function creates a cross product of a table with itself, and multiplies each possible combination. So, when calling this with the following table argument:

DECLARE @arg NUMBERS;
INSERT INTO @arg VALUES (1),(2),(3),(4);
SELECT * FROM cross_multiply(@arg);

We’re getting the following, nice result:

i1	i2	product
-----------------------
1	1	1
2	1	2
3	1	3
4	1	4
1	2	2
2	2	4
3	2	6
4	2	8
1	3	3
2	3	6
3	3	9
4	3	12
1	4	4
2	4	8
3	4	12
4	4	16

Easy, eh?

Call the above from Java with JDBC

The SQL Server JDBC driver (since recently) supports TVPs if you’re ready to use vendor specific API. If you want to run this T-SQL batch:

DECLARE @arg NUMBERS;
INSERT INTO @arg VALUES (1),(2),(3),(4);
SELECT * FROM cross_multiply(@arg);

In Java, you’d write something along the lines of this:

SQLServerDataTable table = new SQLServerDataTable();
sourceDataTable.addColumnMetadata("i" ,java.sql.Types.INTEGER);
sourceDataTable.addRow(1);  
sourceDataTable.addRow(2);  
sourceDataTable.addRow(3);  
sourceDataTable.addRow(4); 
  
try (SQLServerPreparedStatement stmt=   
    (SQLServerPreparedStatement) connection.prepareStatement(  
       "SELECT * FROM cross_multiply(?)")) {

    // Magic here:
    stmt.setStructured(1, "dbo.numbers", table);  

    try (ResultSet rs = stmt.executeQuery()) {
        ...
    }
}

This is a bit tedious as you have to work through all this API and remember:

  • type names
  • column names
  • column positions

But it works.

Now, call the above from Java, with jOOQ

No problem with jOOQ 3.10. Don’t worry about the boring JDBC data type binding details, as the jOOQ code generator has you covered. As always, all routines are generated classes / methods, and this time, the TABLE type is also a generated type. Let the code speak for itself. Instead of this SQL statement:

DECLARE @arg NUMBERS;
INSERT INTO @arg VALUES (1),(2),(3),(4);
SELECT * FROM cross_multiply(@arg);

You can write the following with jOOQ:

Numbers numbers = new NumbersRecord(
    new NumbersElementTypeRecord(1),
    new NumbersElementTypeRecord(2),
    new NumbersElementTypeRecord(3),
    new NumbersElementTypeRecord(4)
);

// Standalone function call:
Result<CrossMultiplyRecord> r1 = 
    crossMultiply(configuration, numbers);

// Embedded table-valued function call, with predicate
Result<CrossMultiplyRecord> r2 = 
DSL.using(configuration)
   .selectFrom(crossMultiply(numbers))
   .where(F_CROSS_MULTIPLY.PRODUCT.gt(5))
   .fetch();

System.out.println(r1);
System.out.println(r2);

And the nice printed output will be:

+----+----+-------+
|  i1|  i2|product|
+----+----+-------+
|   1|   1|      1|
|   2|   1|      2|
|   3|   1|      3|
|   4|   1|      4|
|   1|   2|      2|
|   2|   2|      4|
|   3|   2|      6|
|   4|   2|      8|
|   1|   3|      3|
|   2|   3|      6|
|   3|   3|      9|
|   4|   3|     12|
|   1|   4|      4|
|   2|   4|      8|
|   3|   4|     12|
|   4|   4|     16|
+----+----+-------+

+----+----+-------+
|  i1|  i2|product|
+----+----+-------+
|   3|   2|      6|
|   4|   2|      8|
|   2|   3|      6|
|   3|   3|      9|
|   4|   3|     12|
|   2|   4|      8|
|   3|   4|     12|
|   4|   4|     16|
+----+----+-------+

Not only does jOOQ understand table-valued parameters, since jOOQ 3.5, we have also supported table-valued functions, which can be used like any ordinary table:

Result<CrossMultiplyRecord> r2 = 
DSL.using(configuration)
   .selectFrom(crossMultiply(numbers))
   .where(F_CROSS_MULTIPLY.PRODUCT.gt(5))
   .fetch();

As you can see, the function call can be embedded in the from clause, it even returns safely-typed CrossMultiplyRecord elements (if you’re not using any projection), and you can form predicates on table columns (i.e. function return values), you can join the table, etc.

Excellent! Let’s start using table-valued parameters!

SQL, Streams, For Comprehension… It’s All the Same


Recently, at Devoxx, I’ve seen this beautiful slide in a talk by Kevlin Henney

In his talk, he was displaying a variety of approaches to solve the FizzBuzz “problem”, including a couple of very elegant solutions in completely declarative approaches and languages.

In this particular slide, Kevlin used a notation that is derived from maths. The set builder notation. Here’s an example from Wikipedia:

even-numbers

The example reads: For all n in (the set of all integer numbers), take those for which there exists () another integer k, for which the following equation is satisfied: n = 2k.

Or in plain English: All even integers. (because for even integers, there exists another integer that is half the even integer)

Beautiful, eh? In imperative programming, we’d probably do something like this instead:

List<Integer> even = new ArrayList<>();
for (int i = /* hmm...? */; i < /* what to put here */; i++)
    even.add(i * 2);

Or this:

List<Integer> even = new ArrayList<>();
for (int i = /* hmm...? */; i < /* what to put here */; i = i + 2)
    even.add(i);

But there are several problems with the imperative approach:

  • We have to realistically start somewhere
  • We have to realistically end somewhere
  • We have to store all values in an intermediate collection

Sure, those aren’t severe limitations in every day use-cases, because we’re probably solving a real world problem where we don’t actually need an infinite number of even integers, and storing them in an intermediate collection doesn’t consume all of our memory, but still, the declarative, mathematical approach is much leaner, because we can still answer those questions about where to start and where to end later, and we never need to materialise any intermediate collection before we make those final decisions.

For instance, we can declare X to be that set, and then declare Y to be a set that is derived from X, and finally materialise Z, which is a very tiny set derived from Y. For this, we may have never needed to materialise all the (even) integers.

How this compares to SQL

Kevlin made a cunning comparison. Of course, all functional programming aficionados will immediately recognise that languages like Scala have something called a “for comprehension”, which models precisely the mathematical set-builder notation.

Java 8 now has the Streams API, which allows us, to some extent, model something similar (although not as powerful). But Kevlin didn’t use those “modern” languages. He used SQL as a comparison. That “arcane” declarative programming language that has been around forever, and that we love so much. Yes, here’s how we can declare all the even numbers in SQL:

SELECT n
FROM integers
WHERE EXISTS (
  SELECT k
  FROM integers
  WHERE n = 2 * k
)

If optimisers were perfect, this semi-self-join between the two references of the integers “table” could be optimised perfectly. In most databases, we’d probably manually transform the above notation to this equivalent one:

SELECT n
FROM integers
WHERE MOD(n, 2) = 0

Yes, indeed. The set-builder notation and the SQL language are very similar beasts. The former prefers using mathematical symbols for brevity and conciseness, the latter prefers using English words to connect the different operators, but it’s the same thing. And if you squint hard enough, you’ll see that Java 8 Streams, for instance, are also pretty much the same thing:

everything-is-a-table

I’ve blogged about this recently where all the Java 8 Streams operations are compared to their SQL clause counterparts:
https://blog.jooq.org/2015/08/13/common-sql-clauses-and-their-equivalents-in-java-8-streams

How is this better?

It’s simple. Both the set-builder notation, and the SQL language (and in principle, other languages’ for comprehensions) are declarative. They are expressions, which can be composed to other, more complex expressions, without necessarily executing them.

Remember the imperative approach? We tell the machine exactly what to do:

  • Start counting from this particular minimal integer value
  • Stop counting at this particular maximal integer value
  • Store all even integers in between in this particular intermediate collection

What if we don’t actually need negative integers? What if we just wanted to have a utility that calculates even integers and then reuse that to list all positive integers? Or, all positive integers less than 100? Etc.

In the imperative approach, we have to refactor constantly, to avoid the overhead of

  • Producing too many integers
  • Storing too many integers (or storing them at all)

In truly declarative languages like SQL, we’re just describing “even integers” with an expression, possibly assigning the expression a name:

CREATE VIEW even_integers AS
SELECT n
FROM integers
WHERE EXISTS (
  SELECT k
  FROM integers
  WHERE k = 2 * n
)

So, when we actually use and materialise the even integers, e.g. positive integers less than 100, the optimiser can optimise away the double access to the integer table and produce only the exact number of values that we’re requesting (without materialising them in intermediate collections):

SELECT n
FROM even_integers
WHERE n BETWEEN 0 AND 100

Conclusion

Thinking in terms of sets, in terms of declaring sets, has always been our dream as software engineers. The approach is extremely compelling and elegant. We can delegate a lot of boring algorithmic work to the implementation engine of the declarative programming language. In the case of SQL, it would be a SQL database optimiser, which figures out a great lot of optimisations that we might not have thought of.

The above example is trivial. We can perfectly live in a world where we manually iterate over a local integer variable that goes from 0 to 100:

for (int i = 0; i <= 100; i++)
  doSomething(i);

But stuff gets hairy quite quickly. Compare Mario Fusco‘s famous tweet’s two versions of the same algorithm:

This also applies to SQL, and what’s even better in SQL than with Streams: The SQL statement is a declarative expression tree, not a formally ordered set of stream pipeline operations. The optimiser can freely reorder / transform the expression tree into something that it thinks is more optimal. This isn’t just a promise. This works in modern SQL databases every day, for very complex queries, which you can write in a matter of seconds, rather than hours.

Stay tuned for a short series of blog posts on the jOOQ blog illustrating what modern cost-based optimisation can do for you, when you’re using the SQL language.