JOIN Elimination: An Essential Optimiser Feature for Advanced SQL Usage

The SQL language has one great advantage over procedural, object oriented, and “ordinary” functional programming languages. The fact that it is truly declarative (i.e. a 4GL / fourth generation programming language) means that a sophisticated optimiser can easily transform one SQL expression into another, equivalent SQL expression, which might be faster to execute.

How does this work?

Expression Tree Equivalence

Here’s an introduction to equivalent expression trees from my SQL Masterclass: Let’s assume we want to evaluate the following expression:

A + (A - B)

Now in maths, it can be proven trivially that by the laws of associativity, the above expression is really the same as this one:

(A + A) - B

We didn’t really win anything yet, but we can equally trivially turn the above addition into a multiplication that can be proven to be exactly equivalent:

(2 * A) - B

Now, imagine that A is an extremely “expensive” value, e.g. a value that we have to fetch from the disk, or worse, from the network. The cost of accessing A is thus very high and if we can avoid one access to A by the above transformation, then the resulting expression is much faster to evaluate than the original one, even if mathematically speaking, it does not matter at all.

That’s what optimisers do all the time. They transform expressions into equivalent expressions which are faster to execute. And they constantly adapt, so if the DBA chooses to move A from a remote server to the local server, thus reducing the cost of access to A, perhaps, suddenly, the original plan will be better again, because of the cost of multiplication (just an example).

A SQL Example

An equally trivial SQL example from my SQL Masterclass shows that it really doesn’t matter mathematically if we run this query:

SELECT first_name, last_name
FROM customer
WHERE first_name = 'JAMIE'

Or this one:

SELECT *
FROM (
  SELECT first_name, last_name
  FROM customer
)
WHERE first_name = 'JAMIE'

With the SQL language, it may be a bit harder to see that these are exactly equivalent SQL statements, but if we translate the above queries to relational algebra, it may become more visible:

Selection before projection

… or WHERE before SELECT:

Projection then selection

… or SELECT before WHERE:

Don’t be fooled by relational algebra‘s term “selection”. It does not correspond to the SELECT clause, but to the WHERE clause!

We can prove (let’s leave the exercise to the reader), that both expressions are exactly equivalent, so optimisers can pick whichever they have a more efficient matching algorithm for:

  • Ordinary row stores will probably apply the selection first, before projecting, as the expensive operation is accessing rows and if we can filter rows early (e.g. through indexes) then the whole query is less expensive)
  • A column store might (just a hypothesis) apply the projection first, as the expensive operation is accessing the columns. We then might have to traverse the entire column anyway to filter out the rows

Let’s Talk About JOINs (and Their Elimination)

JOIN elimination is one of the most basic and yet most powerful SQL transformations, which is implemented by most modern databases in one way or another. It is so powerful, because we’re writing (potentially useless) JOINs all the time, when writing a bit more complex queries. See also our article about JOINs for a general overview.

Now, consider the following simplified schema, taken from the Sakila database:

CREATE TABLE address (
  address_id INT NOT NULL,
  address VARCHAR(50) NOT NULL,
  CONSTRAINT pk_address PRIMARY KEY (address_id)
);

CREATE TABLE customer (
  customer_id INT NOT NULL,
  first_name VARCHAR(45) NOT NULL,
  last_name VARCHAR(45) NOT NULL,
  address_id INT NOT NULL,
  CONSTRAINT pk_customer PRIMARY KEY  (customer_id),
  CONSTRAINT fk_customer_address FOREIGN KEY (address_id) 
    REFERENCES address(address_id)
);

Let’s ignore indexing and other useful features for this example.

INNER JOIN Elimination

The following query shows a common JOIN use-case, joining a parent table (ADDRESS) to a child table (CUSTOMER) in a to-one relationship:

SELECT c.*
FROM customer c
JOIN address a 
ON c.address_id = a.address_id

We intended to fetch all customers and their addresses. But observe: We project only columns from the CUSTOMER table and we don’t have any predicates at all, specifically not predicates using the ADDRESS table. So, we’re completely ignoring any contributions from the ADDRESS table. We never really needed the JOIN in the first place!

And in fact, the optimiser can prove this too, because of the FOREIGN KEY constraint on C.ADDRESS_ID, which guarantees that every CUSTOMER record has exactly one corresponding ADDRESS record. The JOIN does not duplicate, nor remove any CUSTOMER rows, so it is unneeded and thus eliminated (by some, not all databases, will list each database at the end of the article).

So, the database can rewrite the SQL statement to the following, equivalent SQL statement in the presence of said FOREIGN KEY:

SELECT *
FROM customer c

Now, quite obviously, this query will be faster than the previous one, if the entire JOIN can be avoided, and thus the entire access to the ADDRESS table. Neat, huh? Who would have thought that FOREIGN KEYs can be so useful in terms of performance.

The above works if there’s also a NOT NULL constraint on the FOREIGN KEY. If there isn’t, e.g. as in this query:

SELECT title
FROM film f
JOIN language l ON f.original_language_id = l.language_id

The JOIN can still be eliminated, but there needs to be a replacement NOT NULL predicate, as such:

SELECT title
FROM film
WHERE original_language_id IS NOT NULL

OUTER JOIN Elimination

A LEFT [ OUTER ] JOIN will JOIN the right table to the left table but keep rows from the left table if there is no match (again, an explanation of joins can be seen here). When we apply LEFT JOIN to the previous query…

SELECT c.*
FROM customer c
LEFT JOIN address a 
ON c.address_id = a.address_id

… then we’ll fetch all rows from CUSTOMER regardless if that customer has any ADDRESS. This is useful if the FOREIGN KEY is optional (nullable), or completely absent, e.g. through:

ALTER TABLE customer DROP CONSTRAINT fk_customer_address

OUTER JOIN is even easier to eliminate, as it doesn’t require any FOREIGN KEY constraint for the database to prove that it is unneeded. A UNIQUE constraint on the parent table (here: ADDRESS.ADDRESS_ID) is sufficient to show that for every CUSTOMER there can be at most one ADDRESS, so the LEFT JOIN won’t duplicate any CUSTOMER rows (Unlike INNER JOIN, OUTER JOIN never remove rows).

Hence, the above query can again be rewritten to the more optimal:

SELECT *
FROM customer c

OUTER JOIN Elimination with DISTINCT

Another interesting case of OUTER JOIN elimination is the following one, which unfortunately didn’t work on Oracle for a customer of ours, recently, in a complex query that ran rogue. Let’s look at some other tables of the Sakila database, which expose a to-many relationship:

CREATE TABLE actor (
  actor_id numeric NOT NULL ,
  first_name VARCHAR(45) NOT NULL,
  last_name VARCHAR(45) NOT NULL,
  CONSTRAINT pk_actor PRIMARY KEY (actor_id)
);

CREATE TABLE film (
  film_id int NOT NULL,
  title VARCHAR(255) NOT NULL,
  CONSTRAINT pk_film PRIMARY KEY (film_id)
);

CREATE TABLE film_actor (
  actor_id INT NOT NULL,
  film_id  INT NOT NULL,
  CONSTRAINT pk_film_actor PRIMARY KEY (actor_id, film_id),
  CONSTRAINT fk_film_actor_actor FOREIGN KEY (actor_id)
    REFERENCES actor (actor_id),
  CONSTRAINT fk_film_actor_film FOREIGN KEY (film_id)
    REFERENCES film (film_id)
);

Now, consider this query:

SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id

We’re looking for all actors and their films, but then we project only distinct actors. Again, the JOIN to FILM_ACTOR doesn’t contribute anything to the result, but because we’re joining a to-many relationship here (from parent table ACTOR to child table FILM_ACTOR), the JOIN is producing duplicate rows. Without DISTINCT, we’d get something like:

FIRST_NAME   LAST_NAME
----------------------
...
PENELOPE     GUINESS
PENELOPE     GUINESS
PENELOPE     GUINESS
NICK         WAHLBERG
NICK         WAHLBERG
...

But thanks to the DISTINCT keyword, the result is (provably) no different from the result of this much simpler query:

SELECT DISTINCT first_name, last_name
FROM actor a

(Note, DISTINCT cannot be eliminated, unless we already have a UNIQUE constraint on (FIRST_NAME, LAST_NAME)).

Why not Just Refactor the SQL Manually?

Of course, all of this shouldn’t be needed if our SQL were perfect. In the above trivial examples, the SQL can (and should) be re-written manually to improve quality. But note that:

  • Developers make mistakes, and those mistakes may be very subtle when queries get more complex. I’ll show an example below.
  • The presence of this feature actually encourages writing more complex SQL, especially when using reusable views. I’ll show another example below.
  • Finally, I’ve previously advocated avoiding needless, mandatory work, like SELECT *. Such work is mandatory because the optimiser cannot prove its needlessness. In the case of these JOINs, the optimiser can prove the needlessness, so the work is no longer mandatory. It can be eliminated.

Here are some complex examples as promised, where this optimiser feature really shines:

Subtle Mistakes

Let’s consider the following query (in PostgreSQL syntax):

SELECT c.name, count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
JOIN film f USING (film_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

What does it do? For ACTOR_ID = 1 (Penelope Guiness), we’re looking for all the different film categories she played in, and the number of films per category. This is easier to understand when we look at the result:

NAME         COUNT
Horror       3
Family       2
New          2
Classics     2
Games        2
Music        1
Sci-Fi       1
Animation    1
Sports       1
Children     1
Comedy       1
Documentary  1
Foreign      1

Now, can you spot the unneeded JOINs? In fact, we never needed ACTOR, nor did we need FILM

SELECT c.name, count(*)
FROM film_actor fa
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

Cool, eh? The JOINs can be eliminated (again, in some databases, see below) and our “mistake” is no longer relevant to the query. The mistake could have also snuck (or sneaked?) in from a previous query version, which may have looked like this, projecting also the actor information and the list of films per category, in case of which the additional JOIN are needed:

SELECT 
  c.name, count(*), 
  a.first_name, a.last_name, 
  array_agg(f.title ORDER BY f.title)
FROM actor a
JOIN film_actor fa USING (actor_id)
JOIN film f USING (film_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name, a.first_name, a.last_name
ORDER BY count(*) DESC

The result being:

NAME         COUNT  FIRST_NAME  LAST_NAME  FILMS
Horror       3      PENELOPE    GUINESS    {"ELEPHANT TROJAN","LADY STAGE","RULES HUMAN"}
Family       2      PENELOPE    GUINESS    {"KING EVOLUTION","SPLASH GUMP"}
New          2      PENELOPE    GUINESS    {"ANGELS LIFE","OKLAHOMA JUMANJI"}
Classics     2      PENELOPE    GUINESS    {"COLOR PHILADELPHIA","WESTWARD SEABISCUIT"}
Games        2      PENELOPE    GUINESS    {"BULWORTH COMMANDMENTS","HUMAN GRAFFITI"}
Music        1      PENELOPE    GUINESS    {"WIZARD COLDBLOODED"}
Sci-Fi       1      PENELOPE    GUINESS    {"CHEAPER CLYDE"}
Animation    1      PENELOPE    GUINESS    {"ANACONDA CONFESSIONS"}
Sports       1      PENELOPE    GUINESS    {"GLEAMING JAWBREAKER"}
Children     1      PENELOPE    GUINESS    {"LANGUAGE COWBOY"}
Comedy       1      PENELOPE    GUINESS    {"VERTIGO NORTHWEST"}
Documentary  1      PENELOPE    GUINESS    {"ACADEMY DINOSAUR"}
Foreign      1      PENELOPE    GUINESS    {"MULHOLLAND BEAST"}

As you can see, this optimisation can be very useful on your legacy SQL, because if we maintain a complex query, we might not always be able to see all the JOINs that are really needed.

Reusable Views

Sometimes, we simply add additional JOINs for convenience, when building complex queries from simpler ones, e.g. by using views (which is a completely underrated RDBMS feature! You should all write more views).

Consider this view:

CREATE VIEW v_customer AS
SELECT 
  c.first_name, c.last_name, 
  a.address, ci.city, co.country
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)

It’s not unlikely that we will write a view like this, simply because we’re incredibly bored to constantly join all these tables all the time. Every time we do something with customers and addresses, we need the CITY and COUNTRY table as well.

From now on (with this view), we can simply select from the view and “forget” about how it came to be. Now, let’s consider we completely forget about the underlying table, because the view was so useful all the time. We could think about doing this:

SELECT first_name, last_name
FROM v_customer

What do you think will happen? Exactly. JOIN elimination. A view isn’t really anything special, just a “macro” of some stored SQL (beware of some databases, where this isn’t always the case, e.g. MySQL, which Bill Karwin was kind enough to hint me at). So the above statement will be transformed into:

SELECT first_name, last_name
FROM (
  SELECT 
    c.first_name, c.last_name, 
    a.address, ci.city, co.country
  FROM customer c
  JOIN address a USING (address_id)
  JOIN city ci USING (city_id)
  JOIN country co USING (country_id)
) v_customer

… which can be transformed into this (we don’t need all columns in the nested select):

SELECT first_name, last_name
FROM (
  SELECT 
    c.first_name, c.last_name
  FROM customer c
  JOIN address a USING (address_id)
  JOIN city ci USING (city_id)
  JOIN country co USING (country_id)
) v_customer

… which can be transformed into this (JOINs can be eliminated):

SELECT first_name, last_name
FROM (
  SELECT 
    c.first_name, c.last_name
  FROM customer c
) v_customer

… and finally (the subquery is not useful):

SELECT first_name, last_name
FROM customer

The view is even very useful for this particular query, thanks to JOIN elimination!

Note, the SQL transformations exposed above are simply educational. Actual optimisers may perform transformations in an entirely differently, or in a different order. This is just to show what’s possible, and what kind of stuff is being done.

Cool, So Can My Database Do It?

Perhaps! Let’s look at the three different types of JOIN elimination in the context of these databases:

  • DB2 LUW 10.5
  • MySQL 8.0.2
  • Oracle 12.2.0.1
  • PostgreSQL 9.6
  • SQL Server 2014

INNER JOIN Elimination

Remember, this depends on the presence (and usefulness) of a FOREIGN KEY constraint. The SQL statement we’re using here is:

SELECT first_name, last_name
FROM customer c
JOIN address a ON c.address_id = a.address_id

We’re hoping to get:

SELECT first_name, last_name
FROM customer c

DB2 LUW

The following execution plan (created with Markus Winand’s cool utility) shows that this works in DB2, there’s no access to the ADDRESS table:

Explain Plan                                                           |
-----------------------------------------------------------------------|
ID | Operation                           |                 Rows | Cost |
 1 | RETURN                              |                      |   61 |
 2 |  FETCH CUSTOMER                     | 599 of 599 (100.00%) |   61 |
 3 |   IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) |   20 |

MySQL

MySQL 8, apart from finally introducing CTE and window functions (yay), has a lot of new optimiser features, read Morgan Tocker’s useful optimiser guide for details. Unfortunately, INNER JOIN elimination is not implemented:

ID  TABLE  TYPE    REF                  ROWS   EXTRA
1   c      ALL                          599    
1   a      eq_ref  sakila.c.address_id  1      Using index

Not only is the JOIN executed, but it is executed using a nested loop with 599 index lookups, as MySQL still only supports NESTED LOOP JOINs, not HASH JOINs.

Bummer.

Oracle

No problem at all for Oracle:

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |   599 | 28752 |     5   (0)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| CUSTOMER |   599 | 28752 |     5   (0)| 00:00:01 |
------------------------------------------------------------------------------

… the JOIN is eliminated as expected.

PostgreSQL

Unfortunately, PostgreSQL cannot eliminate INNER JOIN:

Hash Join  (cost=19.57..42.79 rows=599 width=13)
  Hash Cond: (c.address_id = a.address_id)
  ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=15)
  ->  Hash  (cost=12.03..12.03 rows=603 width=4)
        ->  Seq Scan on address a  (cost=0.00..12.03 rows=603 width=4)

Not as bad as in MySQL, though, as PostgreSQL chose to use a HASH JOIN to combine the two tables.

SQL Server

No problemo for SQL Server, the ADDRESS table access is gone!

  |--Table Scan(OBJECT:([sakila].[dbo].[customer] AS [c]))

Notice, however, that SQL Server can only eliminate INNER JOIN on NOT NULL FOREIGN KEYs!

Excellent. What about…

OUTER JOIN Elimination

This one is a bit easier to prove for the database, remember? We don’t rely on any FOREIGN KEY anymore. A UNIQUE key in the parent table is sufficient to eliminate an OUTER JOIN. We can safely expect that if the INNER JOIN could be eliminated (DB2, Oracle, SQL Server), then an OUTER JOIN can be eliminated, too.

Here’s the query:

SELECT first_name, last_name
FROM customer c
LEFT JOIN address a ON c.address_id = a.address_id

And the outcome:

DB2 LUW

Good

Explain Plan                                                           |
-----------------------------------------------------------------------|
ID | Operation                           |                 Rows | Cost |
 1 | RETURN                              |                      |   61 |
 2 |  FETCH CUSTOMER                     | 599 of 599 (100.00%) |   61 |
 3 |   IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) |   20 |

MySQL

Still nope:

ID  TABLE  TYPE    REF                  ROWS   EXTRA
1   c      ALL                          599    
1   a      eq_ref  sakila.c.address_id  1      Using index

Oracle

Perfect:

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |   599 | 28752 |     5   (0)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| CUSTOMER |   599 | 28752 |     5   (0)| 00:00:01 |
------------------------------------------------------------------------------

PostgreSQL

Unlike INNER JOIN elimination, this works. Great!

Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=13)

SQL Server

As expected, good:

  |--Table Scan(OBJECT:([sakila].[dbo].[customer] AS [c]))

Finally…

OUTER JOIN Elimination with DISTINCT

Remember, the query here navigates a to-many relationship, producing duplicate records of the parent table, but then removes all those duplicates again by

  • Ignoring contributions from the child table
  • Removing duplicates with DISTINCT

It’s easy to prove that this:

SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id

Is equivalent to this:

SELECT DISTINCT first_name, last_name
FROM actor a

Remember also, this only works with OUTER JOIN, not with INNER JOIN, as the latter might remove rows, so we have to execute it to see if it does.

DB2 LUW

Cool, this actually works!

Explain Plan                                                          |
----------------------------------------------------------------------|
ID | Operation        |                 Rows | Cost                   |
 1 | RETURN           |                      |   20                   |
 2 |  TBSCAN          | 200 of 200 (100.00%) |   20                   |
 3 |   SORT (UNIQUE)  | 200 of 200 (100.00%) |   20                   |
 4 |    TBSCAN ACTOR  | 200 of 200 (100.00%) |   20                   |

There’s no access to the FILM_ACTOR table, nor to its indexes. Very nice.

MySQL

As this is a more sophisticated transformation than the previous ones, we don’t have high hopes here.

ID  TABLE  TYPE    REF                  ROWS   EXTRA
1   a      ALL                          200    Using temporary
1   fa     ref     sakila.a.actor_id    27     Using index; Distinct

This has become a rather expensive query, again because of the lack of HASH JOIN support in MySQL!

Oracle

I’m very surprised to see that Oracle doesn’t support this optimisation, we’re executing the full query:

---------------------------------------------------------------
| Id  | Operation           | Name                    | Rows  |
---------------------------------------------------------------
|   0 | SELECT STATEMENT    |                         |  5462 |
|   1 |  HASH UNIQUE        |                         |  5462 |
|   2 |   NESTED LOOPS OUTER|                         |  5462 |
|   3 |    TABLE ACCESS FULL| ACTOR                   |   200 |
|*  4 |    INDEX RANGE SCAN | IDX_FK_FILM_ACTOR_ACTOR |    27 |
---------------------------------------------------------------

Curiously, Oracle also chose a NESTED LOOP JOIN in this case, even if we could have loaded the entire index on the FILM_ACTOR table into memory and then HASH JOINed it to ACTOR. Note that the cardinality estimate of the resulting query is quite off, too, despite the DISTINCT operation. This can lead to significant effects in an upstream query, which selects from this query (e.g. if stored as a view) – which is what happened to our customer.

PostgreSQL

PostgreSQL also doesn’t support this elimination, but at least gets cardinality estimates much more accurately and chooses a HASH JOIN operation:

HashAggregate  (cost=193.53..194.81 rows=128 width=13)
  Group Key: a.first_name, a.last_name
  ->  Hash Right Join  (cost=6.50..166.22 rows=5462 width=13)
        Hash Cond: (fa.actor_id = a.actor_id)
        ->  Seq Scan on film_actor fa  (cost=0.00..84.62 rows=5462 width=2)
        ->  Hash  (cost=4.00..4.00 rows=200 width=17)
              ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=17)

SQL Server

The pleasant surprise came from SQL Server, which does support this optimisation too:

  |--Sort(DISTINCT ORDER BY:([a].[first_name] ASC, [a].[last_name] ASC))
       |--Table Scan(OBJECT:([sakila].[dbo].[actor] AS [a]))

As you can see, no access to any FILM_ACTOR related objects!

Summary

Here’s a summary of what databases can eliminate:

Database INNER JOIN:
to-one
INNER JOIN nullable:
to-one
OUTER JOIN:
to-one
OUTER JOIN DISTINCT:
to-many
DB2 LUW 10.5 Yep Yep Yep Yep
MySQL 8.0.2 Nope Nope Nope Nope
Oracle 12.2.0.1 Yep Yep Yep Nope
PostgreSQL 9.6 Nope Nope Yep Nope
SQL Server 2014 Yep Nope Yep Yep

Conclusion

JOIN elimination is a very simple to understand, yet incredibly powerful feature that modern databases support to help developers build and maintain complex SQL queries without worrying too much about performance side effects.

It is possible because SQL is a 4GL (fourth-generation programming language), i.e. a declarative language whose parsed expression trees can “easily” be transformed into equivalent, simpler, and faster to execute expression trees. This work is constantly done by the optimiser behind the scenes, often without you even noticing (see example where unnecessary ACTOR and FILM tables were removed).

Currenly, DB2 and SQL Server are the leaders here with Oracle being a close runner-up, at least in my investigation. There’s hope for PostgreSQL, and a bit less hope for MySQL right now. There are tons of other interesting SQL transformations, which I’ll blog about in future blog posts, which may make the difference in other kinds of complex queries.

If you were intrigued by this sort of functionality, do have a look at my most recent SQL talk, which helps developers understand the real value of SQL in the context of the optimiser:

jOOQ Tuesdays: Glenn Paulley Gives Insight into SQL’s History

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

glenn-paulley

I’m very excited to feature today Glenn Paulley who has been working with and on SQL for several decades.

Glenn, you have been a part of the database ecosystem since the very early days, having been Director of Engineering at Sybase and representing SAP with the SQL Standard committee. What is it that fascinates you so much about databases?

Data management technology has been my favourite subject within Computer Science since I was an undergraduate student at the University of Manitoba, and so it has remained throughout much of my career. I was privileged to be part of the team that implemented the first online IBM DB2 application – using IMS/TM as the transaction monitor – in Canada at Great-West Life Assurance in the late 1980’s, and we hosted some of the DB2 development team from IBM’s San Jose lab – including Don Haderle – to celebrate that achievement. So yes, I guess you could say that I’ve been around for a while.

Much of my personal expertise lies in the realm of query processing and optimization, having had Paul Larson and Frank Wm. Tompa as my Ph.D. supervisors at the University of Waterloo. But I am interested in many related database subjects: query languages certainly, multidatabase systems, and information retrieval. Two closely-related topics within database systems are of particular interest to me.  One is scale. Companies and organizations have a lot of data; billion-row tables in a relational database are fairly routine today in many companies. Advances in other technologies, such as the Internet of Things, are going to dramatically increase the data management requirements for firms wanting to take advantage of these technologies. Implementing solutions to achieve the scale required is difficult, and is of great interest to me. The related issue, quite naturally, is processing queries over such vast collections and getting the execution time down to something reasonable. So query optimization remains a favourite topic.

We’ve had a very interesting E-Mail conversation about SQL’s experiments related to Object Orientation in the late 1990’s. Today, we have ORDBMS like Oracle, Informix, PostgreSQL. Why didn’t ORDBMS succeed?

ORDBMS implementations were not as successful as their developers expected, to be sure, though I would note that many SQL implementations now contain “object” features even though objects are specifically omitted from the ISO SQL standard. Oracle 12c is a good example – Oracle’s PL/SQL object implementation supports a good selection of object-oriented programming features, such as polymorphism and single inheritance, that when coupled with collection types provide a very rich data model that can handle very complex applications. There are others, too, of course: InterSystems’ Caché product, for example, is still available.

So, while object support in relational systems is present, in many instances, to me the significant issues are (1) the performance of object constructions on larger database instances and (2) the question of where do you want objects to exist in the application stack: in the server proper, or in another tier, implemented in a true object-oriented language? The latter issue is the premise behind object-relational mapping tools, though I think that their usage often causes as many problems as they solve.

How do you see the future of the SQL language – e.g. with respect to Big Data or alternative models like document stores (which have N1QL) or graph stores (which have Open Cypher)?

My personal view is that SQL will continue to evolve; having an independent query language that permits one to query or manipulate a database but avoid writing a “program” in the traditional sense is a good idea. Over time that language might evolve to something different from today’s SQL, but it is likely that we will still call it S-Q-L. I do not expect a revolutionary approach to be successful; there is simply far too much invested in current applications and infrastructure. Don’t get me wrong – I’m not saying SQL doesn’t need improvements. It does.

Unlike the work of e.g. the JCP or w3c, which are public and open, SQL still seems to be a more closed environment. How do you see the interaction between SQL and the end user? Can “ordinary” folks participate in the future of SQL?

The SQL standard is published by the International Standards Organization (ISO), whose member countries contribute to changes to the standard and vote on them on a regular basis. Member countries that contribute to the ISO SQL standard include (at least) the US, Canada, Germany, the United Kingdom, Japan, and Korea. Participation in the standards process requires individuals or companies to belong to these “national bodies” in order to view drafts of the standard and vote on proposed changes. Usually those meetings are held in-person – at least in Canada and the United States – so there is a real cost to participation in the process.

While having SQL as an international ISO standard is, I think, worthwhile, the ISO’s business model is based on activities such as collecting revenue from the sale of standards documents. Unless governments or other benefactors sponsor the development of those standards, then it is difficult to see how the standard can be made more freely available. That issue is a political one, rather than a technical one.

After a brief stop at Conestoga College, you’re heading back to SAP. What will you be working on in the next few years?

I will be continuing to focus on database technology now that I’ve returned to SAP. So far I’ve had a fantastic experience being back at the SAP Waterloo lab, and I have every expectation that will continue into the future.

You’ve spent a lot of time at SAP (and before SAP) with Sybase. How do the different Sybase distributions relate to SAP’s new flagship database HANA?

The various SAP database systems (SQL Anywhere, IQ, ASE) all contain database technology pertinent to the HANA platform.

Last question: What do SQL and Curling have in common? 🙂

As a Level 3 curling coach, one of my tasks as a High Performance Coach for the Ontario Curling Council is to collect performance data on the athletes in our various programs. Naturally I use a database system to store that data, and perform various analyses on it, so the two are not as unrelated as you might think!

 

GROUP BY ROLLUP / CUBE

Every now and then, you come across a requirement that will bring you to your SQL limits. Many of us probably give up early and calculate stuff in Java / [or your language]. Instead, it might’ve been so easy and fast to do with SQL. If you’re working with an advanced database, such as DB2, Oracle, SQL Server, Sybase SQL Anywhere, (and MySQL in this case, which supports the WITH ROLLUP clause), you can take advantage of the ROLLUP / CUBE / GROUPING SETS grouping functions.

Lets have a look at my fictional salary progression compared to that of a fictional friend, who has chosen a different career path (observe the salary boost in 2011):

select 'Lukas'      as employee, 
       'SoftSkills' as company, 
	   80000        as salary, 
	   2007         as year 
from dual
union all select 'Lukas', 'SoftSkills', 80000,  2008 from dual
union all select 'Lukas', 'SmartSoft',  90000,  2009 from dual
union all select 'Lukas', 'SmartSoft',  95000,  2010 from dual
union all select 'Lukas', 'jOOQ',       200000, 2011 from dual
union all select 'Lukas', 'jOOQ',       250000, 2012 from dual
union all select 'Tom',   'SoftSkills', 89000,  2007 from dual
union all select 'Tom',   'SoftSkills', 90000,  2008 from dual
union all select 'Tom',   'SoftSkills', 91000,  2009 from dual
union all select 'Tom',   'SmartSoft',  92000,  2010 from dual
union all select 'Tom',   'SmartSoft',  93000,  2011 from dual
union all select 'Tom',   'SmartSoft',  94000,  2012 from dual

Now we’re used to gathering statistics using simple grouping and simple aggregate functions. For instance, let’s calculate how much Lukas and Tom earned on average over the past few years:

with data as ([above select])
select employee, avg(salary)
from data
group by employee

This will show that Lukas has earned more:

+--------+-----------+
|EMPLOYEE|AVG(SALARY)|
+--------+-----------+
|Lukas   |     132500|
|Tom     |      91500|
+--------+-----------+

So it’s probably interesting to find out what they have earned on average in which company:

with data as (...)
select company, employee, avg(salary)
from data
group by company, employee
order by company, employee

And immediately, it becomes clear where the big bucks are and that Tom has made a bad decision 😉

+----------+--------+-----------+
|COMPANY   |EMPLOYEE|AVG(SALARY)|
+----------+--------+-----------+
|jOOQ      |Lukas   |     225000|
|SmartSoft |Lukas   |      92500|
|SmartSoft |Tom     |      93000|
|SoftSkills|Lukas   |      80000|
|SoftSkills|Tom     |      90000|
+----------+--------+-----------+

ROLLUP

By adding grouping fields, we “lose” some aggregation information. In the above examples, the overall average salary per employee is no longer available directly from the result. That’s obvious, considering the grouping algorithm. But in nice-looking reports, we often want to display those grouping headers as well. This is where ROLLUP, CUBE (and GROUPING SETS) come into play. Consider the following query:

with data as (...)
select company, employee, avg(salary)
from data
group by rollup(company), employee

The above rollup function will now add additional rows to the grouping result set, holding useful aggregated values. In this case, when we “roll up the salaries of the company”, we will get the average of the remaining grouping fields, i.e. the average per employee:

+----------+--------+-----------+
|COMPANY   |EMPLOYEE|AVG(SALARY)|
+----------+--------+-----------+
|SmartSoft |Tom     |      93000|
|SoftSkills|Tom     |      90000|
|{null}    |Tom     |      91500|
|jOOQ      |Lukas   |     225000|
|SmartSoft |Lukas   |      92500|
|SoftSkills|Lukas   |      80000|
|{null}    |Lukas   |     132500|
+----------+--------+-----------+

Note how these rows hold the same information as the ones from the first query, where we were only grouping by employee… This becomes even more interesting, when we put more grouping fields into the rollup function:

with data as (...)
select company, employee, avg(salary)
from data
group by rollup(employee, company)

As you can see, the order of grouping fields is important in the rollup function. The result from this query now also adds the overall average salary paid to all employees in all companies

+----------+--------+-----------+
|COMPANY   |EMPLOYEE|AVG(SALARY)|
+----------+--------+-----------+
|SmartSoft |Tom     |      93000|
|SoftSkills|Tom     |      90000|
|{null}    |Tom     |      91500|
|jOOQ      |Lukas   |     225000|
|SmartSoft |Lukas   |      92500|
|SoftSkills|Lukas   |      80000|
|{null}    |Lukas   |     132500|
|{null}    |{null}  |     112000|
+----------+--------+-----------+

In order to identify the totals rows for reporting, you can use the GROUPING() function in DB2, Oracle, SQL Server and Sybase SQL Anywhere. In Oracle and SQL Server, there’s the even more useful GROUPING_ID() function:

with data as (...)
select grouping_id(employee, company) id, company, employee, avg(salary)
from data
group by rollup(employee, company)

It documents on what “grouping level” of the rollup function the current row was produced:

+----+----------+--------+-----------+
|  ID|COMPANY   |EMPLOYEE|AVG(SALARY)|
+----+----------+--------+-----------+
|   0|SmartSoft |Tom     |      93000|
|   0|SoftSkills|Tom     |      90000|
|   1|{null}    |Tom     |      91500|
|   0|jOOQ      |Lukas   |     225000|
|   0|SmartSoft |Lukas   |      92500|
|   0|SoftSkills|Lukas   |      80000|
|   1|{null}    |Lukas   |     132500|
|   3|{null}    |{null}  |     112000|
+----+----------+--------+-----------+

CUBE

The cube function works similar, except that the order of cube grouping fields becomes irrelevant, as all combinations of grouping are combined. This is a bit tricky to put in words, so lets put it in action:

with data as (...)
select grouping_id(employee, company) id, company, employee, avg(salary)
from data
group by cube(employee, company)

In the following result, you will get:

  • GROUPING_ID() = 0: Average per company and employee. This is the normal grouping result
  • GROUPING_ID() = 1: Average per employee
  • GROUPING_ID() = 2: Average per company
  • GROUPING_ID() = 3: Overall average
+----+----------+--------+-----------+
|  ID|COMPANY   |EMPLOYEE|AVG(SALARY)|
+----+----------+--------+-----------+
|   3|{null}    |{null}  |     112000|
|   2|jOOQ      |{null}  |     225000|
|   2|SmartSoft |{null}  |      92800|
|   2|SoftSkills|{null}  |      86000|
|   1|{null}    |Tom     |      91500|
|   0|SmartSoft |Tom     |      93000|
|   0|SoftSkills|Tom     |      90000|
|   1|{null}    |Lukas   |     132500|
|   0|jOOQ      |Lukas   |     225000|
|   0|SmartSoft |Lukas   |      92500|
|   0|SoftSkills|Lukas   |      80000|
+----+----------+--------+-----------+

In other words, using the CUBE() function, you will get grouping results for every possible combination of the grouping fields supplied to the CUBE() function, which results in 2^n GROUPING_ID()’s for n “cubed” grouping fields

Support in jOOQ

jOOQ 2.0 introduces support for these functions. If you want to translate the last select into jOOQ, you’d roughly get this Java code:

// assuming that DATA is an actual table...
create.select(
         groupingId(DATA.EMPLOYEE, DATA.COMPANY).as("id"),
         DATA.COMPANY, DATA.EMPLOYEE, avg(SALARY))
      .from(DATA)
      .groupBy(cube(DATA.EMPLOYEE, DATA.COMPANY));

With this powerful tool, you’re ready for all of those fancy reports and data overviews. For more details, read on about ROLLUP(), CUBE(), and GROUPING SETS() functions on the SQL Server documentation page, which explains it quite nicely:

http://msdn.microsoft.com/en-us/library/bb522495.aspx