A Probably Incomplete, Comprehensive Guide to the Many Different Ways to JOIN Tables in SQL


Perhaps the most powerful SQL feature is the JOIN operation. It is the envy of all non-relational databases, because the concept is so simple, yet so universally applicable, when you want to “combine” two data sets.

Put simply, when joining two tables, you’re combining every row from one table with every row from another table, for which a certain predicate is true. An illustration from our SQL Masterclass shows it.

venn-join

See also our recent article about using Venn diagrams to illustrate JOINs. The above illustration compares INNER JOIN with different OUTER JOIN operations, but those are far from all possibilities that we have. Let’s look at things from a more systematic angle.

Do note that whenever in this article I say “X happens before Y”, I mean that “X happens logically before Y”. The database optimiser may still choose to execute Y before X, because that is faster without changing the outcome. More information about the syntactic / logical order of operations here.

But let’s look into all the join types, individually!

CROSS JOIN

The most basic JOIN operation is really a cartesian product. It just combines every row from one table with every row from another table. The best example for cartesian products is given by Wikipedia, using a deck of cards for illustration where we’re “cross joining” the ranks table with the suits table:

venn-cross-product

In real-world scenarios, CROSS JOIN can be very useful when doing reports, for instance you could generate a set of dates (e.g. days in a month) and cross join that with all the departments in the database, to create a complete days/departments table. Using PostgreSQL syntax:

SELECT *

-- This just generates all the days in January 2017
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)

-- Here, we're combining all days with all departments
CROSS JOIN departments

Imagine we have the following data:

+--------+   +------------+
| day    |   | department |
+--------+   +------------+
| Jan 01 |   | Dept 1     |
| Jan 02 |   | Dept 2     |
| ...    |   | Dept 3     |
| Jan 30 |   +------------+
| Jan 31 |
+--------+

The result would now look like this:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 01 | Dept 1     |
| Jan 01 | Dept 2     |
| Jan 01 | Dept 3     |

| Jan 02 | Dept 1     |
| Jan 02 | Dept 2     |
| Jan 02 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
+--------+------------+

Now, for each days/departments combination, you could calculate the daily revenue for that department, or something similar.

Characteristics

A CROSS JOIN is a cartesian product, i.e. a product as in “multiplication”. The mathematical notation uses the multiplication sign to express this operation: A × B, or in our case: days × departments.

Just like with “ordinary” arithmetic multiplication, if one of the two tables is empty (size zero), then the result will also be empty (size zero). This makes total sense. If we combine the previous 31 days with 0 departments, we’ll get 0 days/departments combinations. Likewise, if we combine an empty date range with any number of departments, we also get 0 days/departments combinations.

In other words:

size(result) = size(days) * size(departments)

Alternative syntaxes

In the old days, before the ANSI JOIN syntax was introduced to SQL, people just wrote comma-separated lists of tables in the FROM clause to write a CROSS JOIN. The above query is equivalent to this one:

SELECT *
FROM 
  generate_series(
    '2017-01-01'::TIMESTAMP,
    '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
    INTERVAL '1 day'
  ) AS days(day),
  departments

In general, I strongly recommend to use the CROSS JOIN keyword instead of comma-separated lists of tables, because if you intentionally want to do a CROSS JOIN, nothing communicates this intent (to the next developer!) better than using the actual keywords. So many things can go wrong, accidentally, with comma-separated lists of tables, including accidental cross joins. You don’t want those!

INNER JOIN (Theta-JOIN)

Building on top of the previous CROSS JOIN operation, INNER JOIN (or just simply JOIN, sometimes also called "THETA" JOIN) allows for filtering the outcome of the cartesian product by some predicate. Most of the times, we’re putting that predicate in an ON clause, and it could be something like this:

SELECT *

-- Same as before
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)

-- Now, exclude all days/departments combinations for
-- days before the department was created
JOIN departments AS d ON day >= d.created_at

In most databases, the INNER keyword is optional, so I’m just omitting it in this article.

Notice how the INNER JOIN operation allows for putting arbitrary predicates in the ON clause, which is again very useful when doing reporting. Just like in the previous CROSS JOIN example, we’re combining all days with all departments, but then we retain only those days/departments combinations for which the department already existed, i.e. for which the department creation preceded the day.

Again, using this data:

+--------+   +------------+------------+
| day    |   | department | created_at |
+--------+   +------------+------------+
| Jan 01 |   | Dept 1     | Jan 10     |
| Jan 02 |   | Dept 2     | Jan 11     |
| ...    |   | Dept 3     | Jan 12     |
| Jan 30 |   +------------+------------+
| Jan 31 |
+--------+

The result would now look like this:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 10 | Dept 1     |

| Jan 11 | Dept 1     |
| Jan 11 | Dept 2     |

| Jan 12 | Dept 1     |
| Jan 12 | Dept 2     |
| Jan 12 | Dept 3     |

| Jan 13 | Dept 1     |
| Jan 13 | Dept 2     |
| Jan 13 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
+--------+------------+

So, we’d get no results prior to January 10, those rows were filtered out.

Characteristics

INNER JOIN operations are filtered CROSS JOIN operations. This means that if one of the two tables is empty, then the result is also guaranteed to be empty. But unlike with CROSS JOIN, because of the predicate, we can always get less results than the CROSS JOIN would deliver.

In other words:

size(result) <= size(days) * size(departments)

Alternative syntaxes

While the ON clause is mandatory with the INNER JOIN operation, you’re not required to place JOIN predicates in there (although it is highly recommended from a readability perspective). Most databases will optimise the following, equivalent query in just the same way:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)

-- You can always JOIN .. ON true (or 1 = 1 in other DBs)
-- to turn an syntactic INNER JOIN into a semantic CROSS JOIN
JOIN departments AS d ON true

-- ... and then turn the CROSS JOIN back into an INNER JOIN
-- by putting the JOIN predicate in the WHERE clause:
WHERE day >= d.created_at

Of course, again, that is just obfuscating the query for the reader, but you may have your reasons, right? If we take this one step further, the following query is also equivalent, because most optimisers can figure out the equivalence and execute an INNER JOIN instead:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)

-- Now, this is really a syntactic CROSS JOIN
CROSS JOIN departments AS d
WHERE day >= d.created_at

… and, as seen before, CROSS JOIN is just syntax sugar for comma-separated table lists. In this case, we retain the WHERE clause to get what people have frequently done prior to the introduction of ANSI JOIN syntax:

SELECT *
FROM
  generate_series(
    '2017-01-01'::TIMESTAMP,
    '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
    INTERVAL '1 day'
  ) AS days(day),
  departments AS d
WHERE day >= d.created_at

All of these syntaxes do the same thing, usually without performance penalty, but clearly, they’re all less readable than the original INNER JOIN syntax.

EQUI JOIN

Sometimes, e.g. in literature, you will hear the term EQUI JOIN where "EQUI" isn’t really meant as a SQL keyword, but just as a specific way of writing a special kind of INNER JOIN.

In fact, it is weird that "EQUI" JOIN is the special case, because it’s what we’re doing most in SQL, also in OLTP applications where we simply JOIN by primary key / foreign key relationship. For instance:

SELECT *
FROM actor AS a
JOIN film_actor AS fa ON a.actor_id = fa.actor_id
JOIN film AS f ON f.film_id = fa.film_id

The above query selects all actors and their films. There are two INNER JOIN operations, one connecting actors to the corresponding entries in the film_actor relationship table (because actors can have many films and films can have many actors), and the other connecting each film_actor relationship with additional information about the film itself.

Characteristics

The characteristics of this operation are the same as those of the “general” INNER JOIN operation. An "EQUI" JOIN is still a cartesian product (CROSS JOIN) with a reduced set of results, i.e. containing only those actor/film combinations for which a given actor actually played in the given film.

So again, in other words:

size(result) <= size(actor) * size(film)

The result size can only be equal to the actor size multiplied by the film size, if every actor played in every film, which is unlikely.

Alternative syntaxes: USING

Again, as before, we could write the INNER JOIN operation without an actual INNER JOIN syntax, but CROSS JOIN or comma-separated table lists instead. That’s boring, but much more interesting are the following two alternative syntaxes, one of which is very very useful:

SELECT *
FROM actor
JOIN film_actor USING (actor_id)
JOIN film USING (film_id)

The USING clause replaces the ON clause and allows for listing a set of columns that have to be present in both sides of the JOIN operation. If you carefully design your database in the same way as the Sakila database, namely where every foreign key column(s) have the same names as their referenced primary key column(s) (e.g. actor.actor_id = film_actor.actor_id), then you can use USING for "EQUI" JOIN in at least these databases:

  • Derby
  • Firebird
  • HSQLDB
  • Ingres
  • MariaDB
  • MySQL
  • Oracle
  • PostgreSQL
  • SQLite
  • Vertica

These databases, unfortunately, don’t support the syntax:

  • Access
  • Cubrid
  • DB2
  • H2
  • HANA
  • Informix
  • SQL Server
  • Sybase ASE
  • Sybase SQL Anywhere

While this produces exactly (well almost) the same result as the ON clause, it’s much quicker to read and write. I said almost because some databases (and the SQL standard) specify that any column appearing in the USING clause loses its qualification. For instance:

SELECT
  f.title,   -- Ordinary column, can be qualified
  f.film_id, -- USING column, shouldn't be qualified
  film_id    -- USING column, correct / non-ambiguous here
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)

Also, of course, this syntax is a bit limited. Sometimes, you have several foreign keys in your tables, not all of which can have the primary key column name(s). For instance:

CREATE TABLE film (
  ..
  language_id          BIGINT REFERENCES language,
  original_language_id BIGINT REFERENCES language,
)

If you want to join by ORIGINAL_LANGUAGE_ID, you’ll have to resort to the ON clause.

Alternative syntaxes: NATURAL JOIN

An more extreme and much less useful form of "EQUI" JOIN is the NATURAL JOIN clause. The previous example could be further “improved” by replacing USING by NATURAL JOIN like this:

SELECT *
FROM actor
NATURAL JOIN film_actor
NATURAL JOIN film

Notice how we no longer need to specify any JOIN criteria, because a NATURAL JOIN will automatically take all the columns that share the same name from both tables that it joins and place them in a “hidden” USING clause. As we’ve seen before, as primary keys and foreign keys have the same column name, this appears quite useful.

Truth is: It isn’t useful. In the Sakila database, every table also has a LAST_UPDATE column, which is automatically being taken into consideration by NATURAL JOIN. The NATURAL JOIN query is thus equivalent to this one:

SELECT *
FROM actor
JOIN film_actor USING (actor_id, last_update)
JOIN film USING (film_id, last_update)

… which of course makes absolutely no sense at all. So, forget about NATURAL JOIN immediately (except for some very rare cases, e.g. when joining Oracle’s diagnostics views like v$sql NATURAL JOIN v$sql_plan, etc., for ad-hoc analytics)

OUTER JOIN

We’ve seen INNER JOIN before, which returns results only for combinations of the left / right table, for which the ON predicate yields true.

OUTER JOIN allows us to retain rowson the left / rigth side, for which we didn’t find a matching combination. Let’s go back to the days and departments example:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)
LEFT JOIN departments AS d ON day >= d.created_at

Again, the OUTER keyword is optional, so I’m omitting it in the examples.

This query is very subtly different from its INNER JOIN counter part in that it will always return at least one row per day, even if for a given day, there is no department that has already been created at that day. For instance: All departments were created on January 10. The above query will still return January 1-9:

+--------+   +------------+------------+
| day    |   | department | created_at |
+--------+   +------------+------------+
| Jan 01 |   | Dept 1     | Jan 10     |
| Jan 02 |   | Dept 2     | Jan 11     |
| ...    |   | Dept 3     | Jan 12     |
| Jan 30 |   +------------+------------+
| Jan 31 |
+--------+

In addition to the rows we got before in the INNER JOIN example, we now also have all the days from January 1-9 with NULL departments:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 01 |            | -- Extra rows with no match here
| Jan 02 |            | -- Extra rows with no match here
| ...    |            | -- Extra rows with no match here
| Jan 09 |            | -- Extra rows with no match here
| Jan 10 | Dept 1     |
| Jan 11 | Dept 1     |
| Jan 11 | Dept 2     |
| Jan 12 | Dept 1     |
| Jan 12 | Dept 2     |
| Jan 12 | Dept 3     |
| Jan 13 | Dept 1     |
| Jan 13 | Dept 2     |
| Jan 13 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
+--------+------------+

In other words, each day appears at least once in the result. LEFT JOIN does this for the left table, i.e. it retains all rows from the left table in the result.

Formally, a LEFT OUTER JOIN is an INNER JOIN with a UNION like this:

-- Convenient syntax:
SELECT *
FROM a LEFT JOIN b ON <predicate>

-- Cumbersome, equivalent syntax:
SELECT a.*, b.*
FROM a JOIN b ON <predicate>
UNION ALL
SELECT a.*, NULL, NULL, ..., NULL
FROM a
WHERE NOT EXISTS (
  SELECT * FROM b WHERE <predicate>
)

We’ll talk about NOT EXISTS further down, when talking about "SEMI" JOIN

RIGHT OUTER JOIN

RIGHT OUTER JOIN does exactly the opposite. It retains all the rows from the right table in the result. Let’s add more departments

+--------+   +------------+------------+
| day    |   | department | created_at |
+--------+   +------------+------------+
| Jan 01 |   | Dept 1     | Jan 10     |
| Jan 02 |   | Dept 2     | Jan 11     |
| ...    |   | Dept 3     | Jan 12     |
| Jan 30 |   | Dept 4     | Apr 01     |
| Jan 31 |   | Dept 5     | Apr 02     |
+--------+   +------------+------------+

The new departments 4 and 5 will not be in the previous results at all, because they were created on a day after January 31. But it will appear in a RIGHT JOIN result, because departments is the right table of the join operation, and all the rows from the right table will be retained.

Running this query:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)
RIGHT JOIN departments AS d ON day >= d.created_at

Will yield something like this:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 10 | Dept 1     |
| Jan 11 | Dept 1     |
| Jan 11 | Dept 2     |
| Jan 12 | Dept 1     |
| Jan 12 | Dept 2     |
| Jan 12 | Dept 3     |
| Jan 13 | Dept 1     |
| Jan 13 | Dept 2     |
| Jan 13 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
|        | Dept 4     | -- Extra rows with no match here
|        | Dept 5     | -- Extra rows with no match here
+--------+------------+

In most cases (I have to yet run into a situation where this isn’t the case), Each LEFT OUTER JOIN expression can be transformed into an equivalent RIGHT OUTER JOIN expression, and vice-versa. Because RIGHT OUTER JOIN is usually a bit less readable, most people use LEFT OUTER JOIN only.

FULL OUTER JOIN

Finally, there is also FULL OUTER JOIN, which retains all rows from both sides of the JOIN operation. In our example, this means that each day appears at least once in the result, just like each department appears at least once in the result.

Let’s take again this data:

+--------+   +------------+------------+
| day    |   | department | created_at |
+--------+   +------------+------------+
| Jan 01 |   | Dept 1     | Jan 10     |
| Jan 02 |   | Dept 2     | Jan 11     |
| ...    |   | Dept 3     | Jan 12     |
| Jan 30 |   | Dept 4     | Apr 01     |
| Jan 31 |   | Dept 5     | Apr 02     |
+--------+   +------------+------------+

And now, let’s run this query:

SELECT *
FROM generate_series(
  '2017-01-01'::TIMESTAMP,
  '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
  INTERVAL '1 day'
) AS days(day)
FULL JOIN departments AS d ON day >= d.created_at

The result will now look something like this:

+--------+------------+
| day    | department |
+--------+------------+
| Jan 01 |            | -- row from the left table
| Jan 02 |            | -- row from the left table
| ...    |            | -- row from the left table
| Jan 09 |            | -- row from the left table
| Jan 10 | Dept 1     |
| Jan 11 | Dept 1     |
| Jan 11 | Dept 2     |
| Jan 12 | Dept 1     |
| Jan 12 | Dept 2     |
| Jan 12 | Dept 3     |
| Jan 13 | Dept 1     |
| Jan 13 | Dept 2     |
| Jan 13 | Dept 3     |
| ...    | ...        |
| Jan 31 | Dept 1     |
| Jan 31 | Dept 2     |
| Jan 31 | Dept 3     |
|        | Dept 4     | -- row from the right table
|        | Dept 5     | -- row from the right table 
+--------+------------+

If you insist, formally, a LEFT OUTER JOIN is an INNER JOIN with a UNION like this:

-- Convenient syntax:
SELECT *
FROM a LEFT JOIN b ON <predicate>

-- Cumbersome, equivalent syntax:
SELECT a.*, b.*
FROM a JOIN b ON <predicate>
-- LEFT JOIN part
UNION ALL
SELECT a.*, NULL, NULL, ..., NULL
FROM a
WHERE NOT EXISTS (
  SELECT * FROM b WHERE <predicate>
)
-- RIGHT JOIN part
UNION ALL
SELECT NULL, NULL, ..., NULL, b.*
FROM b
WHERE NOT EXISTS (
  SELECT * FROM a WHERE <predicate>
)

Alternative syntaxes: “EQUI” OUTER JOIN

The above examples again used the “cartesian product with filter” kind of JOIN. Much more common, however, is the "EQUI" OUTER JOIN approach, where we join on a primary key / foreign key relationship. Let’s go back to the Sakila database example. Some actors haven’t played in any film, and we might want to query them like this:

SELECT *
FROM actor
LEFT JOIN film_actor USING (actor_id)
LEFT JOIN film USING (film_id)

This query will return all actors at least once, regardless if they played in a film. If we also want all films that do not have any actors (yet), we could use a FULL OUTER JOIN to combine the results:

SELECT *
FROM actor
FULL JOIN film_actor USING (actor_id)
FULL JOIN film USING (film_id)

And of course, this also works with NATURAL LEFT JOIN, NATURAL RIGHT JOIN, NATURAL FULL JOIN, but again, these aren’t useful at all, as we’d be joining again USING (..., LAST_UPDATE), which makes no sense at all.

Alternative syntaxes: Oracle and SQL Server style OUTER JOIN

These two databases had OUTER JOIN before the ANSI syntax was established. It looked like this:

-- Oracle
SELECT *
FROM actor a, film_actor fa, film f
WHERE a.actor_id = fa.actor_id(+)
AND fa.film_id = f.film_id(+)

-- SQL Server
SELECT *
FROM actor a, film_actor fa, film f
WHERE a.actor_id *= fa.actor_id
AND fa.film_id *= f.film_id

That’s nice, given that at some point in time (in the 80s??), ANSI didn’t specify OUTER JOIN yet. But the 80s are more than 30 years ago. So, it’s safe to say this stuff is obsolete.

SQL Server did the right thing and deprecated (and later removed) that syntax a long time ago. Oracle still supports it for backwards compatibility reasons.

But nothing about this syntax is reasonable or readable. Don’t use it. Replace it with ANSI JOIN.

PARTITIONED OUTER JOIN

This is Oracle specific, but I must say, it’s a real shame none of the other databases have stolen the feature yet. Remember the CROSS JOIN operation that we used to combine each day with each department? Because, sometimes, that’s what we want in the result: All combinations, and if there is a match put also the matching values in the row.

That’s hard to describe in words, much more easy with an example. Here’s the query with Oracle syntax:

WITH 

  -- Using CONNECT BY to generate all dates in January
  days(day) AS (
    SELECT DATE '2017-01-01' + LEVEL - 1
    FROM dual
    CONNECT BY LEVEL <= 31
  ),

  -- Our departments
  departments(department, created_at) AS (
    SELECT 'Dept 1', DATE '2017-01-10' FROM dual UNION ALL
    SELECT 'Dept 2', DATE '2017-01-11' FROM dual UNION ALL
    SELECT 'Dept 3', DATE '2017-01-12' FROM dual UNION ALL
    SELECT 'Dept 4', DATE '2017-04-01' FROM dual UNION ALL
    SELECT 'Dept 5', DATE '2017-04-02' FROM dual
  )
SELECT *
FROM days 
LEFT JOIN departments 
  PARTITION BY (department) -- This is where the magic happens
  ON day >= created_at

Unfortunately, PARTITION BY is used in various contexts with different meanings (e.g. for window functions). In this case, it means that we “partition” our data by the departments.department column, creating a “partition” for each department. Now, each partition will get a copy for each day, regardless if there’s a match in our predicate (unlike in the ordinary LEFT JOIN case, where we had a bunch of “department-less” days). The result of the above query is now this:

+--------+------------+------------+
| day    | department | created_at |
+--------+------------+------------+
| Jan 01 | Dept 1     |            | -- Didn't match, but still get row
| Jan 02 | Dept 1     |            | -- Didn't match, but still get row
| ...    | Dept 1     |            | -- Didn't match, but still get row
| Jan 09 | Dept 1     |            | -- Didn't match, but still get row
| Jan 10 | Dept 1     | Jan 10     | -- Matches, so get join result
| Jan 11 | Dept 1     | Jan 10     | -- Matches, so get join result
| Jan 12 | Dept 1     | Jan 10     | -- Matches, so get join result
| ...    | Dept 1     | Jan 10     | -- Matches, so get join result
| Jan 31 | Dept 1     | Jan 10     | -- Matches, so get join result

| Jan 01 | Dept 2     |            | -- Didn't match, but still get row
| Jan 02 | Dept 2     |            | -- Didn't match, but still get row
| ...    | Dept 2     |            | -- Didn't match, but still get row
| Jan 09 | Dept 2     |            | -- Didn't match, but still get row
| Jan 10 | Dept 2     |            | -- Didn't match, but still get row
| Jan 11 | Dept 2     | Jan 11     | -- Matches, so get join result
| Jan 12 | Dept 2     | Jan 11     | -- Matches, so get join result
| ...    | Dept 2     | Jan 11     | -- Matches, so get join result
| Jan 31 | Dept 2     | Jan 11     | -- Matches, so get join result

| Jan 01 | Dept 3     |            | -- Didn't match, but still get row
| Jan 02 | Dept 3     |            | -- Didn't match, but still get row
| ...    | Dept 3     |            | -- Didn't match, but still get row
| Jan 09 | Dept 3     |            | -- Didn't match, but still get row
| Jan 10 | Dept 3     |            | -- Didn't match, but still get row
| Jan 11 | Dept 3     |            | -- Didn't match, but still get row
| Jan 12 | Dept 3     | Jan 12     | -- Matches, so get join result
| ...    | Dept 3     | Jan 12     | -- Matches, so get join result
| Jan 31 | Dept 3     | Jan 12     | -- Matches, so get join result

| Jan 01 | Dept 4     |            | -- Didn't match, but still get row
| Jan 02 | Dept 4     |            | -- Didn't match, but still get row
| ...    | Dept 4     |            | -- Didn't match, but still get row
| Jan 31 | Dept 4     |            | -- Didn't match, but still get row

| Jan 01 | Dept 5     |            | -- Didn't match, but still get row
| Jan 02 | Dept 5     |            | -- Didn't match, but still get row
| ...    | Dept 5     |            | -- Didn't match, but still get row
| Jan 31 | Dept 5     |            | -- Didn't match, but still get row
+--------+------------+

As you can see, I’ve visually created 5 partitions for the 5 departments. Each partition combines the department with each day, but unlike when doing a CROSS JOIN, we’re now getting actual LEFT JOIN .. ON .. results in case there is a match for the predicate. This is a really nice feature for reporting in Oracle!

SEMI JOIN

In relational algebra, there is a notion of a semi join operation, which unfortunately doesn’t have a syntax representation in SQL. If it did, the syntax would probably be LEFT SEMI JOIN and RIGHT SEMI JOIN, just like the Cloudera Impala syntax extension offers.

What is a "SEMI" JOIN?

When writing something like the fictional query below:

SELECT *
FROM actor
LEFT SEMI JOIN film_actor USING (actor_id)

What we really mean is we want all actors that played in films. But we don’t want any films in the results, just the actors. More specifically, we don’t want each actor several times, once per film. We want each actor only once (or zero times) in the result.

Semi is latin for “half”, i.e. we implement only “half the join”, in this case, the left half.

In SQL, there are two alternative syntaxes that can emulate "SEMI" JOIN

Alternative syntax: EXISTS

This is the more powerful and a bit more verbose syntax

SELECT *
FROM actor a
WHERE EXISTS (
  SELECT * FROM film_actor fa
  WHERE a.actor_id = fa.actor_id
)

We’re looking for all the actors for which there exists a film, i.e. actors that played in a film. With this syntax (i.e. "SEMI" JOIN placed in the WHERE clause), it’s immediately clear that we can get each actor at most once in the result. There’s no actual JOIN in the syntax.

Nonetheless, most databases will be able to recognise that what’s going on here is really a "SEMI" JOIN, and not just an ordinary EXISTS() predicate. For instance, consider the Oracle execution plan for the above query:

hash-join-semi

Note how Oracle calls the operation “HASH JOIN (SEMI)” – the SEMI keyword is present here. The same is true for PostgreSQL:

nested-loop-semi-join

Or SQL Server:

nested-loops-left-semi-join

Apart from being the optimal solution in terms of correctness, there are also some performance benefits when using "SEMI" JOIN rather than INNER JOIN, as the database can stop looking for matches as soon as it found the first!

Alternative syntax: IN

IN and EXISTS are exactly equivalent "SEMI" JOIN emulations. The following query will produce the same plan in most databases (not MySQL), as the previous EXISTS query:

SELECT *
FROM actor
WHERE actor_id IN (
  SELECT actor_id FROM film_actor
)

If your database supports both syntaxes for the "SEMI" JOIN operation, you may choose whichever you prefer from a stylistic point of view.

This is different with …

ANTI JOIN

In principle, "ANTI" JOIN is just the opposite of "SEMI" JOIN. When writing something like the fictional query below:

SELECT *
FROM actor
LEFT ANTI JOIN film_actor USING (actor_id)

… what we’re doing is we take all the actors that didn’t play in any film. Unfortunately, again, SQL doesn’t have a built-in syntax for this operation, but we can emulate it with EXISTS:

Alternative syntax: NOT EXISTS

The following query has exactly the expected semantics:

SELECT *
FROM actor a
WHERE NOT EXISTS (
  SELECT * FROM film_actor fa
  WHERE a.actor_id = fa.actor_id
)

(Dangerous) alternative syntax: NOT IN

Watch out! While EXISTS and IN are equivalent, NOT EXISTS and NOT IN are not equivalent. This is because of NULL values.

In this particular case, the following NOT IN query is equivalent to the previous NOT EXISTS query, because our film_actor table has a NOT NULL constraint on film_actor.actor_id

SELECT *
FROM actor
WHERE actor_id NOT IN (
  SELECT actor_id FROM film_actor
)

If, however, actor_id became nullable, then the query would be wrong. Don’t believe it? Try running this:

SELECT *
FROM actor
WHERE actor_id NOT IN (1, 2, NULL)

It will not return any records. Why? Because NULL is the UNKNOWN value in SQL. So, the predicate above is the same as this:

SELECT *
FROM actor
WHERE actor_id NOT IN (1, 2, UNKNOWN)

And because we cannot be sure whether actor_id is in a set of values of which one value is UNKNOWN (is it 4? Or 5? Or -1?), the whole predicate becomes UNKNOWN

SELECT *
FROM actor
WHERE UNKNOWN

Here’s a nice article about three-valued logic by Joe Celko, if you want to learn more.

So, I cannot say this enough:

Do not use NOT IN predicates in SQL, ever, unless you put constant, non-null values there.

— Lukas Eder. Just now.

Don’t even gamble on the presence of a NOT NULL constraint. Perhaps, some DBA might temporarily turn off the constraint to load some data and your query will be wrong for the time being. Just use NOT EXISTS. Or, in some cases…

(Dangerous) alternative syntax: LEFT JOIN / IS NULL

Strangely enough, some people prefer the following syntax:

SELECT *
FROM actor a
LEFT JOIN film_actor fa
USING (actor_id)
WHERE film_id IS NULL

It’s correct as we’re:

  • Joining films to their actors
  • Retaining all actors without films (LEFT JOIN)
  • Retaining only actors without films (film_id IS NULL)

Now, I personally don’t like this syntax, as it doesn’t communicate the intent of an "ANTI" JOIN at all. And chances are, it is slower, because your optimiser doesn’t recognise this as an "ANTI" JOIN operation (or in fact, it cannot formally prove that it probably is). So, again, use NOT EXISTS instead.

An interesting (but a bit outdated) blog post comparing the three syntaxes is this one here:
https://explainextended.com/2009/09/15/not-in-vs-not-exists-vs-left-join-is-null-sql-server

LATERAL JOIN

LATERAL is a relatively new keyword in the SQL standard, and it is supported in PostgreSQL and Oracle. SQL Server folks have had a vendor-specific alternative syntax using the APPLY keyword forever (which I personally much prefer). Let’s look at an example, using the PostgreSQL / Oracle LATERAL keyword:

WITH 
  departments(department, created_at) AS (
    VALUES ('Dept 1', DATE '2017-01-10'),
           ('Dept 2', DATE '2017-01-11'),
           ('Dept 3', DATE '2017-01-12'),
           ('Dept 4', DATE '2017-04-01'),
           ('Dept 5', DATE '2017-04-02')
  )
SELECT *
FROM departments AS d
CROSS JOIN LATERAL generate_series(
  d.created_at, -- We can dereference a column from department!
  '2017-01-31'::TIMESTAMP, 
  INTERVAL '1 day'
) AS days(day)

Indeed, instead of a CROSS JOIN between all departments and all days, why not just generate the necessary days for each department directly? That’s what LATERAL does. It is a prefix to the right-hand side of any JOIN operation (including INNER JOIN, LEFT OUTER JOIN, etc.) that allows the right-hand side to access columns from the left hand side.

This of course has nothing to do with relational algebra anymore, because it imposes a JOIN order (from left to right). But sometimes, that’s OK and sometimes, your table-valued function (or subquery) is so complex, that’s the only way you can actually use it.

Another very popular use-case is to join a “TOP-N” query to a normal table – e.g. if you want to find each actor, and their TOP 5 best selling films:

SELECT a.first_name, a.last_name, f.*
FROM actor AS a
LEFT OUTER JOIN LATERAL (
  SELECT f.title, SUM(amount) AS revenue
  FROM film AS f
  JOIN film_actor AS fa USING (film_id)
  JOIN inventory AS i USING (film_id)
  JOIN rental AS r USING (inventory_id)
  JOIN payment AS p USING (rental_id)
  WHERE fa.actor_id = a.actor_id
  GROUP BY f.film_id
  ORDER BY revenue DESC
  LIMIT 5
) AS f
ON true

The result being something like:

outer-join-lateral-result

Don’t worry about the long list of joins in the derived table, that’s just how we have to get from the FILM table to the PAYMENT table in the Sakila database:

sakila1.

Essentially, the subquery is calculating thsoe TOP 5 best selling films per actor. So, it isn’t a “classic” derived table, but a correlated subquery that returns more than one row and one column. We’re all used to writing correlated subqueries like these:

SELECT 
  a.first_name, 
  a.last_name, 
  (SELECT count(*) 
   FROM film_actor AS fa 
   WHERE fa.actor_id = a.actor_id) AS films
FROM actor AS a

The result of the above correlated subquery is one row and one column. If you ever need to return more than one row and/or more than one column from a correlated subquery, LATERAL or APPLY will be your friend.

Note also how I’ve used LEFT OUTER JOIN together with LATERAL, and then I needed to put a dummy ON true clause for syntax correctness. Using LATERAL with OUTER JOIN will again always retain the left side of the JOIN, i.e. we’ll retain actors that didn’t play in any film.

Characteristics:

The LATERAL keyword doesn’t really change the semantics of the JOIN type that it is applied to. If you’re running a CROSS JOIN LATERAL, the result size is still

size(result) = size(left table) * size(right table)

Even if the right table is produced on a per-row-of-the-left-table basis.

You can use LATERAL also with OUTER JOIN, in case of which the left rows will be retained even if your table function doesn’t return any rows on the right side.

Alternative syntax: APPLY

SQL Server didn’t choose the confusing LATERAL keyword, they introduced the APPLY keyword (more specifically: CROSS APPLY and OUTER APPLY) a long time ago, which makes more sense, because we’re applying a function to each row of a table. Let’s assume for a moment that we have a generate_series() function in SQL Server:

-- Use with care, this is quite inefficient!
CREATE FUNCTION generate_series(@d1 DATE, @d2 DATE)
RETURNS TABLE AS
RETURN
  WITH t(d) AS (
    SELECT @d1 
    UNION ALL
    SELECT DATEADD(day, 1, d) 
    FROM t
    WHERE d < @d2
  ) 
  SELECT * FROM t;

Then, we could use CROSS APPLY to call that function for each department:

WITH 
  departments AS (
    SELECT * FROM (
      VALUES ('Dept 1', CAST('2017-01-10' AS DATE)),
             ('Dept 2', CAST('2017-01-11' AS DATE)),
             ('Dept 3', CAST('2017-01-12' AS DATE)),
             ('Dept 4', CAST('2017-04-01' AS DATE)),
             ('Dept 5', CAST('2017-04-02' AS DATE))
    ) d(department, created_at)
  )
SELECT *
FROM departments AS d
CROSS APPLY dbo.generate_series(
  d.created_at, -- We can dereference a column from department!
  CAST('2017-01-31' AS DATE)
)

What’s so nice about this syntax is that – again – we’re applying a function to each row of a table, and that function produces rows. Does it ring a bell? In Java 8, we would use Stream.flatMap() for that! Consider the following stream usage:

departments.stream()
           .flatMap(department -> generateSeries(
                             department.createdAt, 
                             LocalDate.parse("2017-01-31"))
                         .map(day -> tuple(department, day))
           );

What’s going on here?

  • The DEPARTMENTS table is just a Java departments stream
  • We flatMap the departments stream using a function that produces tuples for each department
  • Those tuples include the department itself, and a day generated from a series of days starting with the department’s createdAt day

Same story! SQL CROSS APPLY / CROSS JOIN LATERAL is the same thing as Java’s Stream.flatMap(). In fact, SQL and streams aren’t too different anyway. For more info, read this blog post here.

Note: Just like we could write LEFT OUTER JOIN LATERAL, we can also write OUTER APPLY in case we want to retain the left side of the JOIN expression.

MULTISET

Few databases implement this (in fact, only Oracle), but if you think about it, it’s an awesome JOIN type. One that creates nested collections. If all databases implemented it, we wouldn’t need ORMs!

A hypothetical example (that uses SQL standard syntax, not Oracle’s) looks like this:

SELECT a.*, MULTISET (
  SELECT f.*
  FROM film AS f
  JOIN film_actor AS fa USING (film_id)
  WHERE a.actor_id = fa.actor_id
) AS films
FROM actor

The MULTISET operator takes a correlated subquery argument, and aggregates all its resulting rows in a nested collection. This works in a similar fashion as a LEFT OUTER JOIN (we’re getting all actors, and if they have films, we’re getting all their films too), but instead of duplicating all the actors in the result set, we’re collecting them into a nested collection.

Just like what we would be doing in an ORM, when fetching stuff into this structure:

@Entity
class Actor {
  @ManyToMany
  List<Film> films;
}

@Entity
class Film {
}

Ignore the incompleteness of used JPA annotations, I just want to show the strength of nested collections. Unlike as in ORMs, the SQL MULTISET operator allows for collecting arbitrary results from correlated subqueries into nested collections – not just actual entities. This is million times more powerful than what any ORM could ever dream of.

Alternative syntaxes: Oracle

As I said, Oracle actually supports MULTISET, but you cannot create ad-hoc nested collections. For some reason, Oracle chose to implement nominal typing for these nested collections, rather than the usual SQL-style structural typing. So you have to declare your types in advance:

CREATE TYPE film_t AS OBJECT ( ... );
CREATE TYPE film_tt AS TABLE OF FILM;

SELECT 
  a.*, 
  CAST (
    MULTISET (
      SELECT f.*
      FROM film AS f
      JOIN film_actor AS fa USING (film_id)
      WHERE a.actor_id = fa.actor_id
    ) AS film_tt
  ) AS films
FROM actor

A bit more verbose, but still does the trick! Excellent!

Alternative syntaxes: PostgreSQL

For once, the awesome PostgreSQL is missing out on an excellent SQL standard feature, but there’s a workaround: arrays! And this time, we can use structural types, yay! So the following query will return a nested array of rows in PostgreSQL:

SELECT
  a AS actor,
  array_agg(
    f
  ) AS films
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
GROUP BY a

The result is everyone’s ORDBMS dream! Nested records and collections everywhere (in only two columns):

actor                  films
--------------------   ----------------------------------------------------
(1,PENELOPE,GUINESS)   {(1,ACADEMY DINOSAUR),(23,ANACONDA CONFESSIONS),...}
(2,NICK,WAHLBERG)      {(3,ADAPTATION HOLES),(31,APACHE DIVINE),...}
(3,ED,CHASE)           {(17,ALONE TRIP),(40,ARMY FLINTSTONES),...}

Tell me that you don’t find this exciting and I don’t know what is.

Conclusion

This was A Probably Incomplete, Comprehensive Guide to the Many Different Ways to JOIN Tables in SQL. I hope you’ve found 1-2 new tricks in this article. JOIN is only one of many very interesting SQL operations. If you’ve found this information useful, you will also like:

jOOQ 3.10 will Support SQL Server’s Table Valued Parameters


SQL Server has this nice feature called table-valued parameters (TVP), where users can pass table variables to a stored procedure for bulk data processing. This is particularly nice when the stored procedure is an inline table valued function, i.e. a function that returns a table as well. For instance:

CREATE TYPE numbers AS TABLE (i INTEGER);

CREATE FUNCTION cross_multiply (
  @numbers numbers READONLY
)
RETURNS @result TABLE (
  i1 INTEGER,
  i2 INTEGER,
  product INTEGER
)
AS
BEGIN
  INSERT INTO @result
  SELECT n1.i, n2.i, n1.i * n2.i
  FROM @numbers n1
  CROSS JOIN @numbers n2

  RETURN
END

The above function creates a cross product of a table with itself, and multiplies each possible combination. So, when calling this with the following table argument:

DECLARE @arg NUMBERS;
INSERT INTO @arg VALUES (1),(2),(3),(4);
SELECT * FROM cross_multiply(@arg);

We’re getting the following, nice result:

i1	i2	product
-----------------------
1	1	1
2	1	2
3	1	3
4	1	4
1	2	2
2	2	4
3	2	6
4	2	8
1	3	3
2	3	6
3	3	9
4	3	12
1	4	4
2	4	8
3	4	12
4	4	16

Easy, eh?

Call the above from Java with JDBC

The SQL Server JDBC driver (since recently) supports TVPs if you’re ready to use vendor specific API. If you want to run this T-SQL batch:

DECLARE @arg NUMBERS;
INSERT INTO @arg VALUES (1),(2),(3),(4);
SELECT * FROM cross_multiply(@arg);

In Java, you’d write something along the lines of this:

SQLServerDataTable table = new SQLServerDataTable();
sourceDataTable.addColumnMetadata("i" ,java.sql.Types.INTEGER);
sourceDataTable.addRow(1);  
sourceDataTable.addRow(2);  
sourceDataTable.addRow(3);  
sourceDataTable.addRow(4); 
  
try (SQLServerPreparedStatement stmt=   
    (SQLServerPreparedStatement) connection.prepareStatement(  
       "SELECT * FROM cross_multiply(?)")) {

    // Magic here:
    stmt.setStructured(1, "dbo.numbers", table);  

    try (ResultSet rs = stmt.executeQuery()) {
        ...
    }
}

This is a bit tedious as you have to work through all this API and remember:

  • type names
  • column names
  • column positions

But it works.

Now, call the above from Java, with jOOQ

No problem with jOOQ 3.10. Don’t worry about the boring JDBC data type binding details, as the jOOQ code generator has you covered. As always, all routines are generated classes / methods, and this time, the TABLE type is also a generated type. Let the code speak for itself. Instead of this SQL statement:

DECLARE @arg NUMBERS;
INSERT INTO @arg VALUES (1),(2),(3),(4);
SELECT * FROM cross_multiply(@arg);

You can write the following with jOOQ:

Numbers numbers = new NumbersRecord(
    new NumbersElementTypeRecord(1),
    new NumbersElementTypeRecord(2),
    new NumbersElementTypeRecord(3),
    new NumbersElementTypeRecord(4)
);

// Standalone function call:
Result<CrossMultiplyRecord> r1 = 
    crossMultiply(configuration, numbers);

// Embedded table-valued function call, with predicate
Result<CrossMultiplyRecord> r2 = 
DSL.using(configuration)
   .selectFrom(crossMultiply(numbers))
   .where(F_CROSS_MULTIPLY.PRODUCT.gt(5))
   .fetch();

System.out.println(r1);
System.out.println(r2);

And the nice printed output will be:

+----+----+-------+
|  i1|  i2|product|
+----+----+-------+
|   1|   1|      1|
|   2|   1|      2|
|   3|   1|      3|
|   4|   1|      4|
|   1|   2|      2|
|   2|   2|      4|
|   3|   2|      6|
|   4|   2|      8|
|   1|   3|      3|
|   2|   3|      6|
|   3|   3|      9|
|   4|   3|     12|
|   1|   4|      4|
|   2|   4|      8|
|   3|   4|     12|
|   4|   4|     16|
+----+----+-------+

+----+----+-------+
|  i1|  i2|product|
+----+----+-------+
|   3|   2|      6|
|   4|   2|      8|
|   2|   3|      6|
|   3|   3|      9|
|   4|   3|     12|
|   2|   4|      8|
|   3|   4|     12|
|   4|   4|     16|
+----+----+-------+

Not only does jOOQ understand table-valued parameters, since jOOQ 3.5, we have also supported table-valued functions, which can be used like any ordinary table:

Result<CrossMultiplyRecord> r2 = 
DSL.using(configuration)
   .selectFrom(crossMultiply(numbers))
   .where(F_CROSS_MULTIPLY.PRODUCT.gt(5))
   .fetch();

As you can see, the function call can be embedded in the from clause, it even returns safely-typed CrossMultiplyRecord elements (if you’re not using any projection), and you can form predicates on table columns (i.e. function return values), you can join the table, etc.

Excellent! Let’s start using table-valued parameters!

How to Write a Quick and Dirty Converter in jOOQ


One of jOOQ‘s most powerful features is the capability of introducing custom data types, pretending the database actually understands them. For instance, when working with SQL TIMESTAMP types, users mostly want to use the new JSR-310 LocalDateTime, rather than the JDBC java.sql.Timestamp type.

In jOOQ 3.9+, this is a no brainer, as we’ve finally introduced the <javaTimeTypes> flag to automatically generate JSR 310 types instead of JDBC types. But sometimes, you want some custom conversion behaviour, so you write a Converter.

To the rescue our new jOOQ 3.9+ converter constructors, which essentially take two lambdas to construct a converter for you. For instance:

Converter<Timestamp, LocalDateTime> converter =
Converter.of(
    Timestamp.class,
    LocalDateTime.class,
    t -> t == null ? null : t.toLocalDateTime(),
    u -> u == null ? null : Timestamp.valueOf(u)
);

And you’re set! Even easier, if you don’t need any special null encoding (as above), just write this equivalent converter, instead:

Converter<Timestamp, LocalDateTime> converter =
Converter.ofNullable(
    Timestamp.class,
    LocalDateTime.class,
    Timestamp::toLocalDateTime
    Timestamp::valueOf
);

Where’s that useful? The code generator needs a concrete converter class, so you cannot use that with the code generator, but there are many other places in the jOOQ API where converters are useful, including when you write plain SQL like this:

DSL.field(
    "my_table.my_timestamp", 
    SQLDataType.TIMESTAMP.asConvertedDataType(
        Converter.ofNullable(...)
));

How to Prevent JDBC Resource Leaks with JDBC and with jOOQ


In a recent consulting gig, I was analysing a client’s connection pool issue in a productive system, where during some peak loads, all the Java processes involving database interactions just started queueing up until nothing really worked anymore. No exceptions, though, and when the peak load was gone in the evening, everything returned back to normal. The database load looked pretty healthy at the time, so no actual database problem was involved – the problem had to be a client side problem.

Weblogic operations teams quickly identified the connection pool to be the bottleneck. All the connections were constantly allocated to some client process. The immediate thought was: A resource leak is happeneing, and it didn’t show before because this was an exceptional situation: Around the beginning of the new year when everyone wanted to download their electronic documents from the bank (and some new features introduced many more document related database calls).

The obvious problem

That particular system still runs a lot of legacy code in Java 6 style, which means, there are tons of code elements of the following kind:

Connection connection = null;
try {

  // Get the connection from the pool through JNDI
  connection = JDBCHelper.getConnection();
}
finally {

  // Release the connection
  JDBCHelper.close(connection);  
}

While the above code is perfectly fine, and 99% of all database interactions were of the above type, there was an occasional instance of someone badly copy-pasting some code and doing something like this:

Connection connection = JDBCHelper.getConnection();
PreparedStatement stmt = null;

try {
  stmt = connection.prepareStatement("SELECT ...");
}
finally {

  // Release the statement
  JDBCHelper.close(stmt);
}

// But the connection is never released

Sometimes, things were even more subtle, as a utility method expected a connection like this:

// Utility method doesn't have to close the connection:
public void databaseCalls(Connection connection) {
  try {
    stmt = connection.prepareStatement("SELECT ...");
  }
  finally {

    // Release the statement
    JDBCHelper.close(stmt);
  }
}

public void businessLogic() {
  // Oops, subtle connection leak
  databaseCalls(JDBCHelper.getConnection());
}

Thoroughly fixing these things

There’s a quick fix to all these problems. The easiest fix is to just continue rigorously using the JDBCHelper.close() method (or just call connection.close() with appropriate error handling) every time. But apparently, that’s not easy enough as there will always be a non-vigilant developer (or a junior developer who doesn’t know these things), who will get it wrong, who will simply forget things.

I mean, even the official JDBC tutorial gets it “wrong” on their first page:
https://docs.oracle.com/javase/tutorial/jdbc/overview/index.html

The bad example being:

public void connectToAndQueryDatabase(
    String username, String password) {

    Connection con = DriverManager.getConnection(
                         "jdbc:myDriver:myDatabase",
                         username,
                         password);

    Statement stmt = con.createStatement();
    ResultSet rs = stmt.executeQuery(
        "SELECT a, b, c FROM Table1");

    while (rs.next()) {
        int x = rs.getInt("a");
        String s = rs.getString("b");
        float f = rs.getFloat("c");
    }
}

All resources leak in this example!

Of course, it’s just an example, and of course, it’s not a terrible situation, because resources can usually clean up themselves when they go out of scope, i.e. when the GC kicks in. But as software engineers we shouldn’t rely on that, and as the productive issues have shown, there are always edge cases, where precisely this lack of vigilance will cause great harm. After all,

It works on my machine

… is simply not an excuse. We should design our software for productive use.

Fix #1: Use try-with-resources. Always

If you want to stay on the safe side, always follow this rule:

The scope that acquires the resource, closes the resource

As long as you’re working with JDBC, save yourself the trouble of writing those JDBCUtilities classes that close non-null resources and safely catch exceptions that may arise. Just use try-with-resources, all the time. For instance, take the example from the Oracle JDBC tutorial, which should read:

public void connectToAndQueryDatabase(
     String username, String password) {

    // All of these resources are allocated in this method. Thus,
    // this method's responsibility is to also close / free all
    // these resources.
    try (Connection con = DriverManager.getConnection(
            "jdbc:myDriver:myDatabase", username, password);
         Statement stmt = con.createStatement();
         ResultSet rs = stmt.executeQuery(
            "SELECT a, b, c FROM Table1")) {

        while (rs.next()) {
            int x = rs.getInt("a");
            String s = rs.getString("b");
            float f = rs.getFloat("c");
        }
    }
}

This already feels that much better and cleaner, doesn’t it? All the resources are acquired in the above method, and the try-with-resources block will close all of them when they go out of scope. It’s just syntax sugar for something we’ve been doing manually all the time. But now, we will (hopefully) never again forget!

Of course, you could introduce automatic leak detection in your integration tests, because it’s rather easy to proxy the JDBC DataSource and count all connection acquisitions and closings. An example can be seen in this post:
The best way to detect database connection leaks

Fix #2: Use jOOQ, which manages resources for you

Historically, JDBC works on lazy resources that are kept around for a while. The assumption in 1997 (when JDBC was introduced) was that database interactions were relatively slow and it made sense to fetch and process one record at a time, even for moderately sized result sets.

In fact, it was even common to abort fetching records from a cursor when we’ve had enough results and close it eagerly before consuming all the rows.

Today, these assumptions are (mostly) no longer true, and jOOQ (like other, more modern database APIs) invert the lazy/eager API default behaviour. In jOOQ, the JDBC types have the following corresponding counterparts:

  • JDBC DataSource / Connection => jOOQ ConnectionProvider:
    jOOQ doesn’t know the concept of an “open connection” like JDBC. jOOQ only has this ConnectionProvider which works in a similar way to JDBC’s / JavaEE’s DataSource. The semantics here is that the connection / session is “managed” and jOOQ will acquire / release it once per statement. This happens automatically, so users don’t have to worry about any connection resource.
  • JDBC Statement (and subtypes) => jOOQ Query:
    While the JDBC statement (especially the PreparedStatement) is a resource that binds some server-side objects, such as an execution plan, for instance, jOOQ again doesn’t have such a resourceful thing. The Query just wraps the SQL string (or AST) and bind variables. All resources are created lazily only when the query is actually executed – and released immediately after execution. Again, users don’t have to worry about any statement resource.
  • JDBC ResultSet => jOOQ Result:
    The JDBC ResultSet corresponds to a server-side cursor, another object that possibly binds quite a few resources, depending on your fetch mode. Again, in jOOQ no resources are bound / exposed, because jOOQ by default eagerly fetches your entire result set – the assumption being that a low-level optimisation here doesn’t add much value for moderately sized result sets

With the above inverted defaults (from lazy to eager resource allocation / freeing), the jOOQ-ified Oracle JDBC tutorial code would look like this:

Working with a standalone Connection

public void connectToAndQueryDatabase(
    String username, String password) {

    // If you're using a standalone connection, you can pass that
    // one to jOOQ, but you're still responsible of closing it
    // again:
    try (Connection con = DriverManager.getConnection(
            "jdbc:myDriver:myDatabase", username, password)) {

        // There is no statment resource anymore, and the result
        // is fetched eagerly from the database, so you don't have
        // to worry about it
        for (Record record : DSL.using(con).fetch(
                "SELECT a, b, c FROM Table1")) {
            int x = record.get("a", int.class);
            String s = record.get("b", String.class);
            float f = record.get("c", float.class);
        }
    }
}

Working with a connection pool / DataSource

// You probably have some means of injecting / discovering
// a JDBC DataSource, e.g. from Spring, or from your JavaEE
// container, etc.
@Inject
DataSource ds;

public void connectToAndQueryDatabase(
    String username, String password) {

    // With a DataSource, jOOQ will automatically acquire and
    // close the JDBC Connection for you, so the last remaining
    // resource has also disappeared from your client code.
    for (Record record : DSL
           .using(ds, SQLDialect.ORACLE)
           .fetch("SELECT a, b, c FROM Table1")) {
        int x = record.get("a", int.class);
        String s = record.get("b", String.class);
        float f = record.get("c", float.class);
    }
}

With jOOQ, all resource management is automatic, by default, because by default, you don’t want to worry about this low level stuff. It’s not 1997 anymore. The JDBC API really is too low level for most use-cases.

If you do want to optimise resource management and not fetch everything eagerly, you can, of course. jOOQ will allow you to fetch your results lazily, in two ways:

Using a Cursor

@Inject
DataSource ds;

public void connectToAndQueryDatabase(
    String username, String password) {

    // jOOQ's Cursor type is a resource, just like JDBC's
    // ResultSet. It actually keeps a reference to an open
    // ResultSet, internally. This is an opt-in
    // feature, though, only to be used if desired.
    try (Cursor<Record> cursor : DSL
            .using(ds, SQLDialect.ORACLE)
            .fetchLazy("SELECT a, b, c FROM Table1")) {

        for (Record record : cursor) {
            int x = record.get("a", int.class);
            String s = record.get("b", String.class);
            float f = record.get("c", float.class);
        }
    }
}

Using a Java 8 Stream (lazy, resourceful version)

@Inject
DataSource ds;

public void connectToAndQueryDatabase(
    String username, String password) {

    // This can also work with a stream
    try (Stream<Record> stream : DSL
        .using(ds, SQLDialect.ORACLE)
        .fetchStream("SELECT a, b, c FROM Table1")) {

        stream.forEach(record -> {
            int x = record.get("a", int.class);
            String s = record.get("b", String.class);
            float f = record.get("c", float.class);
        });
    }
}

Unfortunately, there are no auto-closing streams in Java, which is why we have to resort to using the try-with-resources statement, breaking the fluency of jOOQ’s API.

Do note though, that you can use the Stream API in an eager fashion:

Using a Java 8 Stream (eager version)

@Inject
DataSource ds;

public void connectToAndQueryDatabase(
    String username, String password) {

    // Fetch the jOOQ Result eagerly into memory, then stream it
    // Again, no resource management
    DSL.using(ds, SQLDialect.ORACLE)
       .fetch()
       .stream("SELECT a, b, c FROM Table1")
       .forEach(record -> {
            int x = record.get("a", int.class);
            String s = record.get("b", String.class);
            float f = record.get("c", float.class);
        });
}

Conclusion

Developers, unfortunately, often suffer from

Works on my machine

This leads to problems that can be discovered only in production, under load. When it comes to resources, it is important to constantly remind ourselves that …

The scope that acquires the resource, closes the resource

JDBC (and the JDK’s IO APIs), “unfortunately”, deal with resources on a very low level. This way, their default behaviour is very resource-efficient. For instance, when you only need to read a file header, you don’t load the entire file into memory through the InputStream. You can explicitly, manually, only load the first few lines.

But in many applications, this default and its low level nature gets in the way of correctness (accidental resource leaks are easy to create), and convenience (a lot of boiler plate code needs to be written).

With database interactions, it’s usually best to migrate your JDBC code towards a more modern API like jOOQ, which abstracts resource handling away in its API and inverts the lazy/eager semantics: Eager by default, lazy on demand.

More information about the differences between jOOQ and JDBC can be seen here, in the manual.

jOOQ Tuesdays: Mario Fusco Talks About Functional and Declarative Programming


Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

mariofusco

I’m very excited to feature today Mario Fusco, author of LambdaJ, working on Red Hat’s drools, a Java Champion and frequent speaker at Java conferences on all topics functional programming.

Mario, a long time ago, I have already stumbled upon your name when looking up the author of Lambdaj – a library that went to the extreme to bring lambdas to Java 5 or earlier. How does it work? And what’s the most peculiar hack you implemented to make it work?

When I started developing Lambdaj in 2007 I thought to it just as a proof-of-concept to check how far I could push Java 5. I never expected that it could become something that somebody else other than myself may actually want to use. In reality, given the limited, or I should say non-existing, capabilities of Java 5 as a functional language, Lambdaj was entirely a big hack. Despite this, people started using and somewhat loving it, and this made me (and possibly somebody else) realize that Java developers, or at least part of them, were tired of the pure imperative paradigm imposed by the language and ready to experiment with something more functional.

The main feature of Lambdaj, and what made its DSL quite nice to use, was the possibility to reference the method of a class in a static and type safe way and pass it to another method. In this way you could for example sort a list of persons by their age doing something like:

sort(persons, on(Person.class).getAge());

As anticipated what happened under the hood was a big hack: the on() method created a proxy of the Person class so you could safely call the getAge() method on it. The proxy didn’t do anything useful other than registering the method call. However it had to return something of the same type of the value returned by the actual method to avoid a ClassCastException. To this purpose it had a mechanism to generate a reasonably unique instance of that type, an int in my example. Before returning that value it also associated it, using a WeakHashMap, to the invoked method. In this way the sort() method was actually invoked with a list and the value generated by my proxy. It then retrieved from the map the Java method associated with that value and invoked it on all the items of the list performing the operation, a sorting in this case, that it was supposed to execute.

That’s crazy 🙂 I’m sure you’re happy that a lot of Lambdaj features are now deprecated. You’re now touring the world with your functional programming talks. What makes you so excited about this topic?

The whole Lambdaj project is now deprecated and abandoned. The new functional features introduced with Java 8 just made it obsolete. Nevertheless it not only had the merit to make developers become curious and interested about functional programming, but also to experiment with new patterns and ideas that in the end also influenced the Java 8 syntax. Take for instance how you can sort a Stream of persons by age using a method reference

persons.sort(Person::getAge)

It looks evident how the method references have been at least inspired by the Lambdaj‘s on() method.

There is a number of things that I love of functional programming:

  1. The readability: a snippet of code written in functional style looks like a story while too often the equivalent code in imperative style resembles a puzzle.
  2. The declarative nature: in functional programming is enough to declare the result that you want to achieve rather than specifying the steps to obtain it. You only care about the what without getting lost in the details of the how.
  3. The possibility of treating data and behaviors uniformly: functional programming allows you to pass to a method both data (the list of persons to be sorted) and computation (the function to be applied to each person in the list). This idea is fundamental for many algorithms like for example the map/reduce: since data and computation are the same thing and the second is typically orders of magnitude smaller you are free to send them to the machine holding the data instead of the opposite.
  4. The higher level of abstraction: the possibility of encapsulating computations in functions and pass them around to other functions allows both a dramatic reduction of code duplication and the design of more generic and expressive API.
  5. Immutability and referential transparency: using immutable values and having side-effects programs makes far easier to reason on your code, test it and ensure its correctness.
  6. The parallelism friendliness: all the features listed above also enable the parallelization of your software in a simpler and more reliable way. It is not coincidence that functional programming started becoming more popular around 10 years ago that is also when multicore CPUs began to be available on commodity hardware.

Our readers love SQL (or at least, they use it frequently). How does functional programming compare to SQL?

The most evident thing that FP and SQL have in common is their declarative paradigm. To some extent SQL, or at least the data selection part, can be seen as a functional language specialized to manipulate data in tabular format.

The data modification part is a totally different story though. The biggest part of SQL users normally change data in a destructive way, overwriting or even deleting the existing data. This is clearly in contrast with the immutability mantra of functional programming. However this is only how SQL is most commonly used, but nothing dictates that it couldn’t be also employed in a non-destructive append-only way. I wish to see SQL used more often in this way in future.

In your day job, you’re working for Red Hat, on drools. Business rules sound enterprisey. How does that get along with your fondness of functional programming?

Under an user point of view a rule engine in general and drools in particular are the extreme form of declarative programming, second only to Prolog. For this reason developers who are only familiar with the imperative paradigm struggle to use it, because they also try to enforce it to work in an imperative way. Conversely programmers more used to think in functional (and then declarative) terms are more often able to use it correctly when they approach it for the first time.

For what regards me, my work as developer of both the core engine and the compiler of drools allows me to experiment every day in both fields of language design and algorithmic invention and optimization. To cut it short it’s a challenging job and there’s lot’s of fun in it: don’t tell this to my employer but I cannot stop being surprised that they allow me to play with this everyday and they also pay me for that.

You’re also on the board of VoxxedDays Ticino, Zurich, and CERN (wow, how geeky is that? A large hadron collider Java conference!). Why is Voxxed such a big success for you?

I must admit that, before being involved in this, I didn’t imagine the amount of work that organizing a conference requires. However this effort is totally rewarded. In particular the great advantage of VoxxedDays is the fact of being local 1-day events made by developers for developers that practically anybody can afford.

I remember that the most common feedback I received after the first VoxxedDays Ticino that we did 2 years ago was some like: “This has been the very first conference I attended in my life and I didn’t imagine it could have been a so amazing experience both under a technical and even more a social point of view. Thanks a lot for that, I eagerly wait to attend even next year”. Can you imagine something more rewarding for a conference organizer?

The other important thing for me is giving the possibility to speakers that aren’t rock stars (yet) to talk in public and share their experience with a competent audience. I know that for at least some of them this is only the first step to let themselves and others discover their capabilities as public speakers and launch them toward bigger conferences like the Devoxx.

Thank you very much Mario

If you want to learn more about Mario’s insights on functional programming, please do visit his interesting talks at Devoxx from the recent past:

What we Need is Standardised Non-OSS Licenses


If you’ve followed the recent (fake) news, you’ve probably already heard it. Oracle is “massively ramping up audits of Java customers it claims are in breach of its licences”

After a quick check on the source (The Register), here’s a more realistic, probably more accurate version of that headline:

Oracle is thinking about auditing 1-2 companies that massively ran the commercial Java extensions in production without paying

There, fixed. Also:

But there is a deeper problem to this discussion

Of course, all sorts of (ex) Red Hat and or Pivotal employees quickly jumped to the conclusion of the kind: Hey this wouldn’t happen with us – the good guys – the OSS guys.

For example:

That’s not surprising, of course. What’s also not surprising is that people who are already strongly opinionated will see their opinions reinforced. Another random example:

If you want more examples, just search Twitter for the article URL. There are tons of reactions.

The latter case is not very interesting. The former, however, is. Aleksey Shipilëv obviously has a good point.

use products with unambiguous licenses, like OSS

… and, of course, he’s not right at all. 🙂 There are some very ambiguous licenses in the OSS field, including many copyleft licenses. Take, for instance, LGPL 2.1, which is a very long license, and contains ridiculuous things like:

If such an object file uses only numerical parameters, data structure layouts and accessors, and small macros and small inline functions (ten lines or less in length), then the use of the object file is unrestricted, regardless of whether it is legally a derivative work. (Executables containing this object code plus portions of the Library will still fall under Section 6.)

(emphasis mine). ten lines of code. What’s a line? Everything between two \n characters? On Windows, does a line have to end in \r\n for this clause to be applicable? What if I remove formatting and have 10000 character lines? Such functions aren’t small, but certainly less than 10 lines. Right? RIGHT?

Hmm…

Not to mention that this single ambiguity (there are more) infects the entire rest of the license text, because it introduces unrestricted use in a rather restrictive library. Think that’s nuts? Go check Hibernate’s license. Most of it (and thus YOUR application, if you patched Hibernate) is affected.

Licensing = restricting

At the end of the day, pretty much every license will restrict rights in some way (except for the public domain “license”). The problem with commercial licenses, however, is that they’re very unique, whereas OSS licenses are usually always the same (mostly some [X]GPL or ASL, MIT, BSD). In other words, OSS licenses are standardised and thus: pretty well understood. And thus: Much less risky.

That’s not the case with commercial licenses. Take the jOOQ license for instance. As of the end of 2016, it’s 23 pages strong (including the annex containing pricing). What does the license mean to our customers? Here’s a TL;DR version (obviously, if in doubt: the actual license will apply, not this TL;DR version):

  • Developer workstations need a timely limited or perpetual license
  • All server workstations are licensed for free, perpetually
  • Object code may be distributed and sublicensed
  • Source code may be used (e.g. for maintenance), but not distributed

And, of course, there are different price plans, but those aren’t really part of the license. So, jOOQ feels like Open Source: source code is shipped, may be used for documentation purpose, may be patched, recompiled, but not distributed, i.e. it isn’t free as in freedom (of course not, it would be the end of our business).

But what does it mean that the source code may be used? The license explicitly allows “modification”, but what does that mean? Are you also allowed to document such modification, just not ship it? E.g. in a public GitHub issue? Such that other users who are affected may profit from your fix?

If in doubt, the best way forward is to ask the vendor. In our case, we’re very open minded and quick to answer – and also quick to improve the license when it is not clear.

In Oracle’s case, a bit less. Of course, because Oracle is a huge company, and who are you even going to ask? Who will take the time to answer an individual question? It’s simply not possible.

The solution: Standardised commercial licenses

There aren’t too many business models with software. First off, there are a few different categories of software, e.g.:

  • SaaS: This is still the wild west. But essentially, you don’t license the software, you rent an access point.
  • Servers: Databases, programming environments, operating systems, they all fall into this category. These are systems that run your software (and/or data).
  • Libraries: Things like jOOQ, Hibernate. These are programs that are embedded in other programs (e.g. SaaS or Servers)
  • Tools: Things like IntelliJ, JRebel. These are programs to create and manipulate data, but they aren’t needed to run it. They can be easily removed.

Each category works entirely differently. For instance, copyleft doesn’t really affect SaaS and tools categories (unless you want to protect your trade secrets, of course), whereas it’s a killer for libraries.

SaaS, libraries and tools are usually per seat licenses, whereas servers are usually per core licenses – i.e. whatever scales better for both the vendor and customer.

This is an extremely simplified overview of commercial licensing, but imagine: What if all vendors in each one of the above categories could just pick a couple of yes/no answers to a standardised set of questions (e.g. what may be distributed? what may be modified? what may be run?), and they could pick only well understood standard wording of these concepts, then everything would be much clearer.

Back to the original Oracle auditing story

In the linked article, Oracle allegedly starts auditing Java users. Because the OracleJDK obviously isn’t “free” (as in freedom), but partially, it is “free” (as in beer) because there are a variety of use-cases where you don’t pay. However, there are some features that are “commercial” (i.e. non-free-as-in-beer), such as JMC and the Flight Recorder.

The interesting thing is that both of these features (and some others) ship with the “free” (as in beer) OracleJDK, but they’re part of the “COMMERCIAL FEATURES” (legal yelling) and those features must even be documented in YOUR LICENSE using this notice, such that YOUR end users may also not use them for free:

Use of the Commercial Features for any commercial or production purpose requires a separate license from Oracle. “Commercial Features” means those features identified Table 1-1 (Commercial Features In Java SE Product Editions) of the Java SE documentation accessible at http://www.oracle.com/technetwork/java/javase/documentation/index.html

(Did you know that? If you’re using OracleJDK in your application, you have to embed the above in your own EULA).

But do note, outside of these cryptic licenses, I’ve found several references to

Java Mission Control is available free of charge for development

E.g. here: http://download.oracle.com/technology/products/missioncontrol/updatesites/base/5.2.0/eclipse

Of course, this has absolutely no legal value, it might have been true at some time but now outdated. But that’s how I remember it. I can use Java Mission Control for free for development (not for productive use). Now, we’re back to this discussion. What’s productive use?

  • Can I profile a simple test program for free? Probably yes.
  • Can I profile my entire program (e.g. jOOQ) for free? Probably yes.
  • Can I run the profile in an CI environment to detect regressions for free? Hmmm.

And how is that understanding of “free” encoded in the actual license?

Standardised wording

Oracle has a long tradition of giving away software for free-as-in-beer to developers. Back in the days (before OSS, when there was only Oracle and IBM), that was a cunning move, because the money is not in development. It’s in operations. So, if developers get top notch software for free, they become evangelists. They’ll love the products, and convince the end users.

But again. Who are developers? When do they stop developing and start operating? When they test? When they ship?

We’ll never know for sure – as every vendor writes their own, unique license.

What we need is a standardised set of well understood commercial licenses, just like the OSS folks have their standardised set of well understood OSS licenses. For our industry as a whole, this would be of immense value, because the little fish (like ourselves), we could compete much better with the big ones without having to give away all of our IP for free under the terms of an OSS license. Our customers would no longer run into any legal issues. All risks from weird license texts would be removed.

And hopefully, this would put pressure on the big ones. And prevent articles like the one from the Register.

Do You Really Have to Name Everything in Software?


This is one of software engineering’s oldest battles. No, I’m not talking about where to put curly braces, or whether to use tabs or spaces. I mean the eternal battle between nominal typing and structural typing.

This article is inspired by a very vocal blogger who eloquently reminds us to …

[…] Please Avoid Functional Vomit

Read the full article here:
https://dzone.com/articles/using-java-8-please-avoid-functional-vomit

What’s the post really about?

It is about naming things. As we all know:

There are only two hard things in Computer Science: cache invalidation and naming things.

— Phil Karlton

Now, for some reason, there is a group of people who wants constant pain and suffering by explicitly naming everything, including rather abstract concepts and algorithmic components, such as compound predicates. Those people like nominal typing and all the features that are derived from it. What is nominal typing (as opposed to structural typing)?

Structural typing

SQL is a good example to study the two worlds. When you write SQL statements, you’re creating structural row types all the time. For instance, when you write:

SELECT first_name, last_name
FROM customer

… what you’re really doing is you’re creating a new rowtype of the structure (in pseudo-SQL):

TYPE (
  first_name VARCHAR,
  last_name VARCHAR
)

The type has the following properties:

  • It is a tuple or record (as always in SQL)
  • It contains two attributes or columns
  • Those two attributes / columns are called first_name and last_name
  • Their types is VARCHAR

This is a structural type, because the SQL statement that produces the type only declares the type’s structure implicitly, by producing a set of column expressions.

In Java, we know lambda expressions, which are (incomplete) structural types, such as:

// A type that can check for i to be even
i -> i % 2 == 0

Nominal typing

Nominal typing takes things one step further. In SQL, nominal typing is perfectly possible as well, for instance, in the above statement, we selected from a well-known table by name customer. Nominal typing assigns a name to a structural type (and possibly stores the type somewhere, for reuse).

If we want to name our (first_name, last_name) type, we could do things like:

-- By using a derived table:
SELECT *
FROM (
  SELECT first_name, last_name
  FROM customer
) AS people

-- By using a common table expression:
WITH people AS (
  SELECT first_name, last_name
  FROM customer
)
SELECT *
FROM people

-- By using a view
CREATE VIEW people AS
SELECT first_name, last_name
FROM customer

In all cases, we’ve assigned the name people to the structural type (first_name, last_name). The only difference being the scope for which the name (and the corresponding content) is defined.

In Java, we can only use lambda expressions, once we assign them to a typed name, either by using an assignment, or by passing the expression to a method that takes a named type argument:

// Naming the lambda expression itself
Predicate<Integer> p = i -> i % 2 == 0

// Passing the lambda expression to a method
Stream.of(1, 2, 3)
      .filter(i -> i % 2 == 0);

Back to the article

The article claims that giving a name to things is always better. For instance, the author proposes giving a name to what we would commonly refer to as a “predicate”:

//original, less clear code
if(barrier.value() > LIMIT && barrier.value() > 0){
//extracted out to helper function. More code, more clear
if(barrierHasPositiveLimitBreach()){

So, the author thinks that extracting a rather trivial predicate into an external function is better because a future reader of such code will better understand what’s going on. At least in the article’s opinion. Let’s refute this claim for the sake of the argument:

  • The proposed name is verbose and requires quite some thinking.
  • What does breach mean?
  • Is breach the same as >= or the same as >?
  • Is LIMIT a constant? From where?
  • Where is barrier? Who owns it?
  • What does the verb “has” mean, here? Does it depend on something outside of barrier? E.g. some shared state?
  • What happens if there’s a negative limit?

By naming the predicate (remember, naming things is hard), the OP has added several layers of cognitive complexity to the reader, while quite possibly introducing subtle bugs, because probably both LIMIT and barrier should be function arguments, rather than global (im)mutable state that is assumed to be there, by the function.

The name introduced several concepts (“to have a breach”, “positive limit”, “breach”) that are not well defined and need some deciphering. How do we decipher it? Probably by looking inside the function and reading the actual code. So what do we gain? Better reuse, perhaps? But is this really reusable?

Finally, there is a (very slight) risk of introducing a performance penalty by the additional indirection. If we translate this to SQL, we could have written a stored function and then queried:

SELECT *
FROM orders -- Just an assumption here
WHERE barrier_has_positive_limit_breach(orders.barrier)

If this was some really complicated business logic depending on a huge number of things, perhaps extracting the function might’ve been worthwile. But in this particular case, is it really better than:

SELECT *
FROM orders
WHERE barrier > :limit AND barrier > 0

or even

SELECT *
FROM orders
WHERE barrier > GREATEST(:limit, 0)

Conclusion

There are some people in our industry who constantly want to see the world in black and white. As soon as they’ve had one small success story (e.g. reusing a very common predicate 4-5 times by extracting it into a function), they conclude with a general rule of this approach being always superior.

They struggle with the notion of “it depends”. Nominal typing and structural typing are both very interesting concepts. Structural typing is extremely powerful, whereas nominal typing helps us humans keep track of complexity. In SQL, we’ve always liked to structure our huge SQL statements, e.g. in nameable views. Likewise, Java programmers structure their code in nameable classes and methods.

But it should be immediately clear to anyone reading the linked article that the author seems to like hyperboles and probably wasn’t really serious, given the silly example he came up with. The message he’s conveying is wrong, because it claims that naming things is always better. It’s not true.

Be pragmatic. Name things where it really helps. Don’t name things where it doesn’t. Or as Leon Bambrick amended Phil Karlton’s quote:

There are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors

Here’s my advice to you, dear nominal typing loving blogger. There’s are only two ways of typing: nominal typing and structural typing. And it depends typing.