JOIN Elimination: An Essential Optimiser Feature for Advanced SQL Usage

The SQL language has one great advantage over procedural, object oriented, and “ordinary” functional programming languages. The fact that it is truly declarative (i.e. a 4GL / fourth generation programming language) means that a sophisticated optimiser can easily transform one SQL expression into another, equivalent SQL expression, which might be faster to execute.

How does this work?

Expression Tree Equivalence

Here’s an introduction to equivalent expression trees from my SQL Masterclass: Let’s assume we want to evaluate the following expression:

A + (A - B)

Now in maths, it can be proven trivially that by the laws of associativity, the above expression is really the same as this one:

(A + A) - B

We didn’t really win anything yet, but we can equally trivially turn the above addition into a multiplication that can be proven to be exactly equivalent:

(2 * A) - B

Now, imagine that A is an extremely “expensive” value, e.g. a value that we have to fetch from the disk, or worse, from the network. The cost of accessing A is thus very high and if we can avoid one access to A by the above transformation, then the resulting expression is much faster to evaluate than the original one, even if mathematically speaking, it does not matter at all.

That’s what optimisers do all the time. They transform expressions into equivalent expressions which are faster to execute. And they constantly adapt, so if the DBA chooses to move A from a remote server to the local server, thus reducing the cost of access to A, perhaps, suddenly, the original plan will be better again, because of the cost of multiplication (just an example).

A SQL Example

An equally trivial SQL example from my SQL Masterclass shows that it really doesn’t matter mathematically if we run this query:

SELECT first_name, last_name
FROM customer
WHERE first_name = 'JAMIE'

Or this one:

SELECT *
FROM (
  SELECT first_name, last_name
  FROM customer
)
WHERE first_name = 'JAMIE'

With the SQL language, it may be a bit harder to see that these are exactly equivalent SQL statements, but if we translate the above queries to relational algebra, it may become more visible:

Selection before projection

… or WHERE before SELECT:

Projection then selection

… or SELECT before WHERE:

Don’t be fooled by relational algebra‘s term “selection”. It does not correspond to the SELECT clause, but to the WHERE clause!

We can prove (let’s leave the exercise to the reader), that both expressions are exactly equivalent, so optimisers can pick whichever they have a more efficient matching algorithm for:

  • Ordinary row stores will probably apply the selection first, before projecting, as the expensive operation is accessing rows and if we can filter rows early (e.g. through indexes) then the whole query is less expensive)
  • A column store might (just a hypothesis) apply the projection first, as the expensive operation is accessing the columns. We then might have to traverse the entire column anyway to filter out the rows

Let’s Talk About JOINs (and Their Elimination)

JOIN elimination is one of the most basic and yet most powerful SQL transformations, which is implemented by most modern databases in one way or another. It is so powerful, because we’re writing (potentially useless) JOINs all the time, when writing a bit more complex queries. See also our article about JOINs for a general overview.

Now, consider the following simplified schema, taken from the Sakila database:

CREATE TABLE address (
  address_id INT NOT NULL,
  address VARCHAR(50) NOT NULL,
  CONSTRAINT pk_address PRIMARY KEY (address_id)
);

CREATE TABLE customer (
  customer_id INT NOT NULL,
  first_name VARCHAR(45) NOT NULL,
  last_name VARCHAR(45) NOT NULL,
  address_id INT NOT NULL,
  CONSTRAINT pk_customer PRIMARY KEY  (customer_id),
  CONSTRAINT fk_customer_address FOREIGN KEY (address_id) 
    REFERENCES address(address_id)
);

Let’s ignore indexing and other useful features for this example.

INNER JOIN Elimination

The following query shows a common JOIN use-case, joining a parent table (ADDRESS) to a child table (CUSTOMER) in a to-one relationship:

SELECT c.*
FROM customer c
JOIN address a 
ON c.address_id = a.address_id

We intended to fetch all customers and their addresses. But observe: We project only columns from the CUSTOMER table and we don’t have any predicates at all, specifically not predicates using the ADDRESS table. So, we’re completely ignoring any contributions from the ADDRESS table. We never really needed the JOIN in the first place!

And in fact, the optimiser can prove this too, because of the FOREIGN KEY constraint on C.ADDRESS_ID, which guarantees that every CUSTOMER record has exactly one corresponding ADDRESS record. The JOIN does not duplicate, nor remove any CUSTOMER rows, so it is unneeded and thus eliminated (by some, not all databases, will list each database at the end of the article).

So, the database can rewrite the SQL statement to the following, equivalent SQL statement in the presence of said FOREIGN KEY:

SELECT *
FROM customer c

Now, quite obviously, this query will be faster than the previous one, if the entire JOIN can be avoided, and thus the entire access to the ADDRESS table. Neat, huh? Who would have thought that FOREIGN KEYs can be so useful in terms of performance.

The above works if there’s also a NOT NULL constraint on the FOREIGN KEY. If there isn’t, e.g. as in this query:

SELECT title
FROM film f
JOIN language l ON f.original_language_id = l.language_id

The JOIN can still be eliminated, but there needs to be a replacement NOT NULL predicate, as such:

SELECT title
FROM film
WHERE original_language_id IS NOT NULL

OUTER JOIN Elimination

A LEFT [ OUTER ] JOIN will JOIN the right table to the left table but keep rows from the left table if there is no match (again, an explanation of joins can be seen here). When we apply LEFT JOIN to the previous query…

SELECT c.*
FROM customer c
LEFT JOIN address a 
ON c.address_id = a.address_id

… then we’ll fetch all rows from CUSTOMER regardless if that customer has any ADDRESS. This is useful if the FOREIGN KEY is optional (nullable), or completely absent, e.g. through:

ALTER TABLE customer DROP CONSTRAINT fk_customer_address

OUTER JOIN is even easier to eliminate, as it doesn’t require any FOREIGN KEY constraint for the database to prove that it is unneeded. A UNIQUE constraint on the parent table (here: ADDRESS.ADDRESS_ID) is sufficient to show that for every CUSTOMER there can be at most one ADDRESS, so the LEFT JOIN won’t duplicate any CUSTOMER rows (Unlike INNER JOIN, OUTER JOIN never remove rows).

Hence, the above query can again be rewritten to the more optimal:

SELECT *
FROM customer c

OUTER JOIN Elimination with DISTINCT

Another interesting case of OUTER JOIN elimination is the following one, which unfortunately didn’t work on Oracle for a customer of ours, recently, in a complex query that ran rogue. Let’s look at some other tables of the Sakila database, which expose a to-many relationship:

CREATE TABLE actor (
  actor_id numeric NOT NULL ,
  first_name VARCHAR(45) NOT NULL,
  last_name VARCHAR(45) NOT NULL,
  CONSTRAINT pk_actor PRIMARY KEY (actor_id)
);

CREATE TABLE film (
  film_id int NOT NULL,
  title VARCHAR(255) NOT NULL,
  CONSTRAINT pk_film PRIMARY KEY (film_id)
);

CREATE TABLE film_actor (
  actor_id INT NOT NULL,
  film_id  INT NOT NULL,
  CONSTRAINT pk_film_actor PRIMARY KEY (actor_id, film_id),
  CONSTRAINT fk_film_actor_actor FOREIGN KEY (actor_id)
    REFERENCES actor (actor_id),
  CONSTRAINT fk_film_actor_film FOREIGN KEY (film_id)
    REFERENCES film (film_id)
);

Now, consider this query:

SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id

We’re looking for all actors and their films, but then we project only distinct actors. Again, the JOIN to FILM_ACTOR doesn’t contribute anything to the result, but because we’re joining a to-many relationship here (from parent table ACTOR to child table FILM_ACTOR), the JOIN is producing duplicate rows. Without DISTINCT, we’d get something like:

FIRST_NAME   LAST_NAME
----------------------
...
PENELOPE     GUINESS
PENELOPE     GUINESS
PENELOPE     GUINESS
NICK         WAHLBERG
NICK         WAHLBERG
...

But thanks to the DISTINCT keyword, the result is (provably) no different from the result of this much simpler query:

SELECT DISTINCT first_name, last_name
FROM actor a

(Note, DISTINCT cannot be eliminated, unless we already have a UNIQUE constraint on (FIRST_NAME, LAST_NAME)).

Why not Just Refactor the SQL Manually?

Of course, all of this shouldn’t be needed if our SQL were perfect. In the above trivial examples, the SQL can (and should) be re-written manually to improve quality. But note that:

  • Developers make mistakes, and those mistakes may be very subtle when queries get more complex. I’ll show an example below.
  • The presence of this feature actually encourages writing more complex SQL, especially when using reusable views. I’ll show another example below.
  • Finally, I’ve previously advocated avoiding needless, mandatory work, like SELECT *. Such work is mandatory because the optimiser cannot prove its needlessness. In the case of these JOINs, the optimiser can prove the needlessness, so the work is no longer mandatory. It can be eliminated.

Here are some complex examples as promised, where this optimiser feature really shines:

Subtle Mistakes

Let’s consider the following query (in PostgreSQL syntax):

SELECT c.name, count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
JOIN film f USING (film_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

What does it do? For ACTOR_ID = 1 (Penelope Guiness), we’re looking for all the different film categories she played in, and the number of films per category. This is easier to understand when we look at the result:

NAME         COUNT
Horror       3
Family       2
New          2
Classics     2
Games        2
Music        1
Sci-Fi       1
Animation    1
Sports       1
Children     1
Comedy       1
Documentary  1
Foreign      1

Now, can you spot the unneeded JOINs? In fact, we never needed ACTOR, nor did we need FILM

SELECT c.name, count(*)
FROM film_actor fa
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

Cool, eh? The JOINs can be eliminated (again, in some databases, see below) and our “mistake” is no longer relevant to the query. The mistake could have also snuck (or sneaked?) in from a previous query version, which may have looked like this, projecting also the actor information and the list of films per category, in case of which the additional JOIN are needed:

SELECT 
  c.name, count(*), 
  a.first_name, a.last_name, 
  array_agg(f.title ORDER BY f.title)
FROM actor a
JOIN film_actor fa USING (actor_id)
JOIN film f USING (film_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name, a.first_name, a.last_name
ORDER BY count(*) DESC

The result being:

NAME         COUNT  FIRST_NAME  LAST_NAME  FILMS
Horror       3      PENELOPE    GUINESS    {"ELEPHANT TROJAN","LADY STAGE","RULES HUMAN"}
Family       2      PENELOPE    GUINESS    {"KING EVOLUTION","SPLASH GUMP"}
New          2      PENELOPE    GUINESS    {"ANGELS LIFE","OKLAHOMA JUMANJI"}
Classics     2      PENELOPE    GUINESS    {"COLOR PHILADELPHIA","WESTWARD SEABISCUIT"}
Games        2      PENELOPE    GUINESS    {"BULWORTH COMMANDMENTS","HUMAN GRAFFITI"}
Music        1      PENELOPE    GUINESS    {"WIZARD COLDBLOODED"}
Sci-Fi       1      PENELOPE    GUINESS    {"CHEAPER CLYDE"}
Animation    1      PENELOPE    GUINESS    {"ANACONDA CONFESSIONS"}
Sports       1      PENELOPE    GUINESS    {"GLEAMING JAWBREAKER"}
Children     1      PENELOPE    GUINESS    {"LANGUAGE COWBOY"}
Comedy       1      PENELOPE    GUINESS    {"VERTIGO NORTHWEST"}
Documentary  1      PENELOPE    GUINESS    {"ACADEMY DINOSAUR"}
Foreign      1      PENELOPE    GUINESS    {"MULHOLLAND BEAST"}

As you can see, this optimisation can be very useful on your legacy SQL, because if we maintain a complex query, we might not always be able to see all the JOINs that are really needed.

Reusable Views

Sometimes, we simply add additional JOINs for convenience, when building complex queries from simpler ones, e.g. by using views (which is a completely underrated RDBMS feature! You should all write more views).

Consider this view:

CREATE VIEW v_customer AS
SELECT 
  c.first_name, c.last_name, 
  a.address, ci.city, co.country
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)

It’s not unlikely that we will write a view like this, simply because we’re incredibly bored to constantly join all these tables all the time. Every time we do something with customers and addresses, we need the CITY and COUNTRY table as well.

From now on (with this view), we can simply select from the view and “forget” about how it came to be. Now, let’s consider we completely forget about the underlying table, because the view was so useful all the time. We could think about doing this:

SELECT first_name, last_name
FROM v_customer

What do you think will happen? Exactly. JOIN elimination. A view isn’t really anything special, just a “macro” of some stored SQL (beware of some databases, where this isn’t always the case, e.g. MySQL, which Bill Karwin was kind enough to hint me at). So the above statement will be transformed into:

SELECT first_name, last_name
FROM (
  SELECT 
    c.first_name, c.last_name, 
    a.address, ci.city, co.country
  FROM customer c
  JOIN address a USING (address_id)
  JOIN city ci USING (city_id)
  JOIN country co USING (country_id)
) v_customer

… which can be transformed into this (we don’t need all columns in the nested select):

SELECT first_name, last_name
FROM (
  SELECT 
    c.first_name, c.last_name
  FROM customer c
  JOIN address a USING (address_id)
  JOIN city ci USING (city_id)
  JOIN country co USING (country_id)
) v_customer

… which can be transformed into this (JOINs can be eliminated):

SELECT first_name, last_name
FROM (
  SELECT 
    c.first_name, c.last_name
  FROM customer c
) v_customer

… and finally (the subquery is not useful):

SELECT first_name, last_name
FROM customer

The view is even very useful for this particular query, thanks to JOIN elimination!

Note, the SQL transformations exposed above are simply educational. Actual optimisers may perform transformations in an entirely differently, or in a different order. This is just to show what’s possible, and what kind of stuff is being done.

Cool, So Can My Database Do It?

Perhaps! Let’s look at the three different types of JOIN elimination in the context of these databases:

  • DB2 LUW 10.5
  • MySQL 8.0.2
  • Oracle 12.2.0.1
  • PostgreSQL 9.6
  • SQL Server 2014

INNER JOIN Elimination

Remember, this depends on the presence (and usefulness) of a FOREIGN KEY constraint. The SQL statement we’re using here is:

SELECT first_name, last_name
FROM customer c
JOIN address a ON c.address_id = a.address_id

We’re hoping to get:

SELECT first_name, last_name
FROM customer c

DB2 LUW

The following execution plan (created with Markus Winand’s cool utility) shows that this works in DB2, there’s no access to the ADDRESS table:

Explain Plan                                                           |
-----------------------------------------------------------------------|
ID | Operation                           |                 Rows | Cost |
 1 | RETURN                              |                      |   61 |
 2 |  FETCH CUSTOMER                     | 599 of 599 (100.00%) |   61 |
 3 |   IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) |   20 |

MySQL

MySQL 8, apart from finally introducing CTE and window functions (yay), has a lot of new optimiser features, read Morgan Tocker’s useful optimiser guide for details. Unfortunately, INNER JOIN elimination is not implemented:

ID  TABLE  TYPE    REF                  ROWS   EXTRA
1   c      ALL                          599    
1   a      eq_ref  sakila.c.address_id  1      Using index

Not only is the JOIN executed, but it is executed using a nested loop with 599 index lookups, as MySQL still only supports NESTED LOOP JOINs, not HASH JOINs.

Bummer.

Oracle

No problem at all for Oracle:

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |   599 | 28752 |     5   (0)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| CUSTOMER |   599 | 28752 |     5   (0)| 00:00:01 |
------------------------------------------------------------------------------

… the JOIN is eliminated as expected.

PostgreSQL

Unfortunately, PostgreSQL cannot eliminate INNER JOIN:

Hash Join  (cost=19.57..42.79 rows=599 width=13)
  Hash Cond: (c.address_id = a.address_id)
  ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=15)
  ->  Hash  (cost=12.03..12.03 rows=603 width=4)
        ->  Seq Scan on address a  (cost=0.00..12.03 rows=603 width=4)

Not as bad as in MySQL, though, as PostgreSQL chose to use a HASH JOIN to combine the two tables.

SQL Server

No problemo for SQL Server, the ADDRESS table access is gone!

  |--Table Scan(OBJECT:([sakila].[dbo].[customer] AS [c]))

Notice, however, that SQL Server can only eliminate INNER JOIN on NOT NULL FOREIGN KEYs!

Excellent. What about…

OUTER JOIN Elimination

This one is a bit easier to prove for the database, remember? We don’t rely on any FOREIGN KEY anymore. A UNIQUE key in the parent table is sufficient to eliminate an OUTER JOIN. We can safely expect that if the INNER JOIN could be eliminated (DB2, Oracle, SQL Server), then an OUTER JOIN can be eliminated, too.

Here’s the query:

SELECT first_name, last_name
FROM customer c
LEFT JOIN address a ON c.address_id = a.address_id

And the outcome:

DB2 LUW

Good

Explain Plan                                                           |
-----------------------------------------------------------------------|
ID | Operation                           |                 Rows | Cost |
 1 | RETURN                              |                      |   61 |
 2 |  FETCH CUSTOMER                     | 599 of 599 (100.00%) |   61 |
 3 |   IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) |   20 |

MySQL

Still nope:

ID  TABLE  TYPE    REF                  ROWS   EXTRA
1   c      ALL                          599    
1   a      eq_ref  sakila.c.address_id  1      Using index

Oracle

Perfect:

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |   599 | 28752 |     5   (0)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| CUSTOMER |   599 | 28752 |     5   (0)| 00:00:01 |
------------------------------------------------------------------------------

PostgreSQL

Unlike INNER JOIN elimination, this works. Great!

Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=13)

SQL Server

As expected, good:

  |--Table Scan(OBJECT:([sakila].[dbo].[customer] AS [c]))

Finally…

OUTER JOIN Elimination with DISTINCT

Remember, the query here navigates a to-many relationship, producing duplicate records of the parent table, but then removes all those duplicates again by

  • Ignoring contributions from the child table
  • Removing duplicates with DISTINCT

It’s easy to prove that this:

SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id

Is equivalent to this:

SELECT DISTINCT first_name, last_name
FROM actor a

Remember also, this only works with OUTER JOIN, not with INNER JOIN, as the latter might remove rows, so we have to execute it to see if it does.

DB2 LUW

Cool, this actually works!

Explain Plan                                                          |
----------------------------------------------------------------------|
ID | Operation        |                 Rows | Cost                   |
 1 | RETURN           |                      |   20                   |
 2 |  TBSCAN          | 200 of 200 (100.00%) |   20                   |
 3 |   SORT (UNIQUE)  | 200 of 200 (100.00%) |   20                   |
 4 |    TBSCAN ACTOR  | 200 of 200 (100.00%) |   20                   |

There’s no access to the FILM_ACTOR table, nor to its indexes. Very nice.

MySQL

As this is a more sophisticated transformation than the previous ones, we don’t have high hopes here.

ID  TABLE  TYPE    REF                  ROWS   EXTRA
1   a      ALL                          200    Using temporary
1   fa     ref     sakila.a.actor_id    27     Using index; Distinct

This has become a rather expensive query, again because of the lack of HASH JOIN support in MySQL!

Oracle

I’m very surprised to see that Oracle doesn’t support this optimisation, we’re executing the full query:

---------------------------------------------------------------
| Id  | Operation           | Name                    | Rows  |
---------------------------------------------------------------
|   0 | SELECT STATEMENT    |                         |  5462 |
|   1 |  HASH UNIQUE        |                         |  5462 |
|   2 |   NESTED LOOPS OUTER|                         |  5462 |
|   3 |    TABLE ACCESS FULL| ACTOR                   |   200 |
|*  4 |    INDEX RANGE SCAN | IDX_FK_FILM_ACTOR_ACTOR |    27 |
---------------------------------------------------------------

Curiously, Oracle also chose a NESTED LOOP JOIN in this case, even if we could have loaded the entire index on the FILM_ACTOR table into memory and then HASH JOINed it to ACTOR. Note that the cardinality estimate of the resulting query is quite off, too, despite the DISTINCT operation. This can lead to significant effects in an upstream query, which selects from this query (e.g. if stored as a view) – which is what happened to our customer.

PostgreSQL

PostgreSQL also doesn’t support this elimination, but at least gets cardinality estimates much more accurately and chooses a HASH JOIN operation:

HashAggregate  (cost=193.53..194.81 rows=128 width=13)
  Group Key: a.first_name, a.last_name
  ->  Hash Right Join  (cost=6.50..166.22 rows=5462 width=13)
        Hash Cond: (fa.actor_id = a.actor_id)
        ->  Seq Scan on film_actor fa  (cost=0.00..84.62 rows=5462 width=2)
        ->  Hash  (cost=4.00..4.00 rows=200 width=17)
              ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=17)

SQL Server

The pleasant surprise came from SQL Server, which does support this optimisation too:

  |--Sort(DISTINCT ORDER BY:([a].[first_name] ASC, [a].[last_name] ASC))
       |--Table Scan(OBJECT:([sakila].[dbo].[actor] AS [a]))

As you can see, no access to any FILM_ACTOR related objects!

Summary

Here’s a summary of what databases can eliminate:

Database INNER JOIN:
to-one
INNER JOIN nullable:
to-one
OUTER JOIN:
to-one
OUTER JOIN DISTINCT:
to-many
DB2 LUW 10.5 Yep Yep Yep Yep
MySQL 8.0.2 Nope Nope Nope Nope
Oracle 12.2.0.1 Yep Yep Yep Nope
PostgreSQL 9.6 Nope Nope Yep Nope
SQL Server 2014 Yep Nope Yep Yep

Conclusion

JOIN elimination is a very simple to understand, yet incredibly powerful feature that modern databases support to help developers build and maintain complex SQL queries without worrying too much about performance side effects.

It is possible because SQL is a 4GL (fourth-generation programming language), i.e. a declarative language whose parsed expression trees can “easily” be transformed into equivalent, simpler, and faster to execute expression trees. This work is constantly done by the optimiser behind the scenes, often without you even noticing (see example where unnecessary ACTOR and FILM tables were removed).

Currenly, DB2 and SQL Server are the leaders here with Oracle being a close runner-up, at least in my investigation. There’s hope for PostgreSQL, and a bit less hope for MySQL right now. There are tons of other interesting SQL transformations, which I’ll blog about in future blog posts, which may make the difference in other kinds of complex queries.

If you were intrigued by this sort of functionality, do have a look at my most recent SQL talk, which helps developers understand the real value of SQL in the context of the optimiser:

jOOQ 3.10 Supports Exciting MySQL 8.0 Features

In recent months, there had been some really exciting news from the MySQL team:

These two SQL standard language features are among the most powerful SQL features that are available from most other databases. I frequently include them in conference talks about SQL (see my article about 10 SQL Tricks That You Didn’t Think Were Possible), and as well in the Data Geekery SQL Masterclass. With MySQL 8.0 now supporting these exciting features, the masterclass will be including MySQL as well (along with Oracle, SQL Server, PostgreSQL, and DB2). And, of course, these features are now supported in the upcoming jOOQ 3.10 as well.

Want to try it out yourself? Just run:

docker pull mysql:8.0.2
docker run --name MYSQL802 --net=host -p 3306:3306 -e MYSQL_ROOT_PASSWORD=test -d mysql:8.0.2

Then, connect to this instance and run this nice little query in it:

WITH RECURSIVE t(a, b) AS (
  SELECT 1, CAST('a' AS CHAR(15))
  UNION ALL
  SELECT t.a + 1, CONCAT(t.b, 'a')
  FROM t
  WHERE t.a < 10
)
SELECT a, SUM(a) OVER (ORDER BY a) AS ∑, b
FROM t

And get this result:

a       ∑       b
--------------------------
1       1       a
2       3       aa
3       6       aaa
4       10      aaaa
5       15      aaaaa
6       21      aaaaaa
7       28      aaaaaaa
8       36      aaaaaaaa
9       45      aaaaaaaaa
10      55      aaaaaaaaaa

Would you believe this is MySQL?

Bonus

A nice “hidden” feature is the support of new pessimistic locking clauses, in particular FOR UPDATE SKIP LOCKED. This has been available in Oracle for ages and since recently in PostgreSQL as well, and now in MySQL. A very useful feature when implementing simple message queues or reservation systems. More details in this article here:

http://mysqlserverteam.com/mysql-8-0-1-using-skip-locked-and-nowait-to-handle-hot-rows/

Of course, SKIP LOCKED (and NOWAIT) will be supported in jOOQ 3.10 as well.

How to Execute SQL Batches With JDBC and jOOQ

Some databases (in particular MySQL and T-SQL databases like SQL Server and Sybase) support a very nice feature: They allow for running a “batch” of statements in a single statement. For instance, in SQL Server, you can do something like this:

-- Statement #1
DECLARE @table AS TABLE (id INT);

-- Statement #2
SELECT * FROM @table;

-- Statement #3
INSERT INTO @table VALUES (1),(2),(3);

-- Statement #4
SELECT * FROM @table;

This is a batch of 4 statements, and it can be executed as a single statement both with JDBC and with jOOQ. Let’s see how:

Executing a batch with JDBC

Unfortunately, the term “batch” has several meanings, and in this case, I don’t mean the JDBC Statement.addBatch() method, which is actually a bit clumsy as it doesn’t allow for fetching mixed update counts and result sets.

Instead, what I’ll be doing is this:

String sql =
    "\n  -- Statement #1                              "
  + "\n  DECLARE @table AS TABLE (id INT);            "
  + "\n                                               "
  + "\n  -- Statement #2                              "
  + "\n  SELECT * FROM @table;                        "
  + "\n                                               "
  + "\n  -- Statement #3                              "
  + "\n  INSERT INTO @table VALUES (1),(2),(3);       "
  + "\n                                               "
  + "\n  -- Statement #4                              "
  + "\n  SELECT * FROM @table;                        ";

try (PreparedStatement s = c.prepareStatement(sql)) {
    fetchLoop:
    for (int i = 0, updateCount = 0;; i++) {
        boolean result = (i == 0)
            ? s.execute()
            : s.getMoreResults();

        if (result)
            try (ResultSet rs = s.getResultSet()) {
                System.out.println("\nResult:");

                while (rs.next())
                    System.out.println("  " + rs.getInt(1));
            }
        else if ((updateCount = s.getUpdateCount()) != -1)
            System.out.println("\nUpdate Count: " + updateCount);
        else
            break fetchLoop;
    }
}

The output of the above program being:

Result:

Update Count: 3

Result:
  1
  2
  3

The above API usage is a somewhat “hidden” – or at least not every day usage of the JDBC API. Mostly, you’ll be using Statement.executeQuery() when you’re expecting a ResultSet, or Statement.executeUpdate() otherwise.

But in our case, we don’t really know what’s happening. We’re going to discover the result types on the fly, when executing the statement. Here are the main JDBC API features that we’re using, along with an explanation:

  • Statement.execute(): This method should be used if we don’t know the result type. The method returns a boolean, which is true when the first statement in the batch produced a ResultSet and false otherwise.
  • Statement.getMoreResults(): This method returns the same kind of boolean value as the previous Statement.execute() method, but it does so for the next statement in the batch (i.e. for every statement except the first).
  • If the current result is a ResultSet (the boolean was true), then we’ll obtain that ResultSet through Statement.getResultSet() (we can obviously no longer call the usual Statement.executeQuery() to obtain the ResultSet).
  • If the current result is not a ResultSet (the boolean was true), then we’ll check the update count value through Statement.getUpdateCount().
  • If the update count is -1, then we’ve reached the end of the batch.

What a nice state machine!

The nice thing about this is that a batch may be completely nondeterministic. E.g. there may be triggers, T-SQL blocks (e.g. an IF statement), stored procedures, and many other things that contribute result sets and/or update counts. In some cases, we simply don’t know what we’ll get.

Executing a batch with jOOQ

It’s great that the JDBC API designers have thought of this exotic API usage on a rather low level. This makes JDBC extremely powerful. But who remembers the exact algorithm all the time? After all, the above minimalistic version required around 20 lines of code for something as simple as that.

Compare this to the following jOOQ API usage:

System.out.println(
    DSL.using(c).fetchMany(sql)
);

The result being:

Result set:
+----+
|  id|
+----+
Update count: 3
Result set:
+----+
|  id|
+----+
|   1|
|   2|
|   3|
+----+

Huh! Couldn’t get much simpler than that! Let’s walk through what happens:

The DSLContext.fetchMany() method is intended for use when users know there will be many result sets and/or update counts. Unlike JDBC which reuses ordinary JDBC API, jOOQ has a different API here to clearly distinguish between behaviours. The method then eagerly fetches all the results and update counts in one go (lazy fetching is on the roadmap with issue #4503).

The resulting type is org.jooq.Results, a type that extends List<Result>, which allows for iterating over the results only, for convenience. If a mix of results or update counts need to be consumed, the Results.resultsOrRows() method can be used.

A note on warnings / errors

Note that if your batch raises errors, then the above JDBC algorithm is incomplete. Read more about this in this follow-up post.

The 10 Most Popular DB Engines (SQL and NoSQL) in 2015

About two years ago, we’ve published this post about the 10 most popular DB engines, where we analyzed the data published by Solid IT on their DB Ranking website.

In the meantime, the Solid IT measurement system has found to be a credible source, such that the website has also been cited at Gartner, InfoWorld, and many other sources. For more details about how this is measured, check out the relevant website on db-engines.com:
http://db-engines.com/en/ranking_definition

Comparing the top 10 list, we can see that the players have shifted, heavily:

Reproduced with permission of DB-Engines.com

Reproduced with permission of DB-Engines.com

The top 3 elefants are still Oracle, MySQL and Microsoft SQL Server, but the runner-ups have changed. While PostgreSQL is still gaining traction, it has lost grounds compared to MongoDB.

Also, Cassandra and Redis have pushed out Sybase and Teradata from the top 10!

When 2 years ago, there had been only a single non RDBMS in this top 10 list, there are now 3, all of which “schema-less”, and they’re gaining additional momentum.

Clearly, vendors of RDBMS will need to move quickly to accommodate the needs of document storage and key-value storage the way their new competitors do.

For us on the jOOQ blog, a much more interesting perspective is to see where our supported databases currently are in this ranking:

Reproduced with permission of DB-Engines.com

Reproduced with permission of DB-Engines.com

With the recent release of jOOQ 3.7, we’ve added another three databases to our list of now 21 supported RDBMS. Compared to last year’s ranking, almost all of these RDBMS are gaining traction as well, apart from SQL Server.

One thing is certain: We still have very exciting times ahead. Stay tuned for more updates, and check out the current ranking here:

http://db-engines.com/en/ranking

Use MySQL’s Strict Mode on all new Projects!

MySQL is a database that has been bending the SQL standard in ways that make it hard to move off MySQL. What may appear to be a clever technique for vendor lockin (or maybe just oversight of the standard) can be quite annoying in understanding the real meaning of the SQL language.

One such example is MySQL’s interpretation of how GROUP BY works. In MySQL, unlike any other database, you can put arbitrary expressions into your SELECT clause, even if they do not have a formal dependency on the GROUP BY expression. For instance:

SELECT employer, first_name, last_name
FROM employees
GROUP BY employer

This will work in MySQL, but what does it mean? If we only have one resulting record per employer, which one of the employees will be returned? The semantics of the above query is really this one:

SELECT employer, ARBITRARY(first_name), ARBITRARY(last_name)
FROM employees
GROUP BY employer

If we assume that there is such an aggregation function as ARBITRARY(). Some may claim that this can be used for some clever performance “optimisation”. I say: Don’t. This is so weakly specified, it is not even clear if the two references of this pseudo-ARBITRARY() aggregate function will produce values from the same record.

Just look at the number of Stack Overflow questions that evolve around the “not a GROUP BY expression” error:

I’m sure that parts of this damage that has been caused to a generation of SQL developers is due to the fact that this works in some databases.

ONLY_FULL_GROUP_BY

But there is a flag in MySQL called ONLY_FULL_GROUP_BY, and Morgan Tocker, the MySQL community manager suggests eventually turning it on by default.

MySQL community members tend to agree that this is a good decision in the long run.

While it is certainly very hard to turn this flag on for a legacy application, all new applications built on top of MySQL should make sure to turn on this flag. In fact, new applications should even consider turning on “strict SQL mode” entirely, to make sure they get a better, more modern SQL experience.

For more information about MySQL server modes, please consider the manual:

http://dev.mysql.com/doc/refman/5.7/en/sql-mode.html

Don’t Migrate to MariaDB just yet. MySQL is Back!

Now that I have your attention, I’d like to invite you to a critical review of where we’re at in the MySQL vs. MariaDB debate. Around one month ago, I visited Oracle Open World 2014, and I’ve met with Morgan Tocker, the MySQL community manager at Oracle to learn about where MySQL is heading.

Who “is” MySQL

An interesting learning for myself is the fact that according to Morgan, there had been quite a few former MySQL AB employees that stayed with MySQL when Sun acquired the database, and that are still there now (perhaps after a short break), now that Oracle owns Sun and thus MySQL. In fact, right now as we speak, Oracle is pushing hard to get even more people on board of the MySQL team. When this article was written, there were 21 open MySQL jobs on this blog post dating February 25, 2014.

More details about Oracle’s plans to promote and push MySQL can be seen in this presentation by Morten Andersen:

oracle-mysql

So, if you’re still contemplating a migration to MariaDB, maybe give this all a second thought. MySQL is still the database that is being used by most large companies, including Facebook and LinkedIn. MySQL won’t go away and it will get much better in the next couple of years – maybe even to a point where it has a similar amount of features as PostgreSQL? We can only hope.

Need another convincing argument? You can now use Oracle Enterprise Manager for both Oracle and MySQL databases. How awesome is that!

enterprise-manager

Stay tuned for a couple of additional blog posts on this blog, about what’s new and what’s on the MySQL roadmap in the next years. See all of Morten Andersen’s slides here:

Frightening Facts about MySQL

So you might’ve seen Destroy all Software’s talk about JavaScript:

Here’s a similar talk (less funny more scary) about MySQL: