Expression Tree Equivalence
Here’s an introduction to equivalent expression trees from my SQL Masterclass: Let’s assume we want to evaluate the following expression:A + (A - B)Now in maths, it can be proven trivially that by the laws of associativity, the above expression is really the same as this one:
(A + A) - BWe didn’t really win anything yet, but we can equally trivially turn the above addition into a multiplication that can be proven to be exactly equivalent:
(2 * A) - BNow, imagine that
A
is an extremely “expensive” value, e.g. a value that we have to fetch from the disk, or worse, from the network. The cost of accessing A is thus very high and if we can avoid one access to A by the above transformation, then the resulting expression is much faster to evaluate than the original one, even if mathematically speaking, it does not matter at all.
That’s what optimisers do all the time. They transform expressions into equivalent expressions which are faster to execute. And they constantly adapt, so if the DBA chooses to move A from a remote server to the local server, thus reducing the cost of access to A, perhaps, suddenly, the original plan will be better again, because of the cost of multiplication (just an example).
A SQL Example
An equally trivial SQL example from my SQL Masterclass shows that it really doesn’t matter mathematically if we run this query:
SELECT first_name, last_name
FROM customer
WHERE first_name = 'JAMIE'
SELECT *
FROM (
SELECT first_name, last_name
FROM customer
)
WHERE first_name = 'JAMIE'
WHERE
before SELECT
:

SELECT
before WHERE
:

SELECT
clause, but to the WHERE
clause!
We can prove (let’s leave the exercise to the reader), that both expressions are exactly equivalent, so optimisers can pick whichever they have a more efficient matching algorithm for:
- Ordinary row stores will probably apply the selection first, before projecting, as the expensive operation is accessing rows and if we can filter rows early (e.g. through indexes) then the whole query is less expensive)
- A column store might (just a hypothesis) apply the projection first, as the expensive operation is accessing the columns. We then might have to traverse the entire column anyway to filter out the rows
Let’s Talk About JOINs (and Their Elimination)
JOIN elimination is one of the most basic and yet most powerful SQL transformations, which is implemented by most modern databases in one way or another. It is so powerful, because we’re writing (potentially useless) JOINs all the time, when writing a bit more complex queries. See also our article about JOINs for a general overview. Now, consider the following simplified schema, taken from the Sakila database:
CREATE TABLE address (
address_id INT NOT NULL,
address VARCHAR(50) NOT NULL,
CONSTRAINT pk_address PRIMARY KEY (address_id)
);
CREATE TABLE customer (
customer_id INT NOT NULL,
first_name VARCHAR(45) NOT NULL,
last_name VARCHAR(45) NOT NULL,
address_id INT NOT NULL,
CONSTRAINT pk_customer PRIMARY KEY (customer_id),
CONSTRAINT fk_customer_address FOREIGN KEY (address_id)
REFERENCES address(address_id)
);
ADDRESS
) to a child table (CUSTOMER
) in a to-one relationship:
SELECT c.*
FROM customer c
JOIN address a
ON c.address_id = a.address_id
CUSTOMER
table and we don’t have any predicates at all, specifically not predicates using the ADDRESS
table. So, we’re completely ignoring any contributions from the ADDRESS
table. We never really needed the JOIN in the first place!
And in fact, the optimiser can prove this too, because of the FOREIGN KEY
constraint on C.ADDRESS_ID
, which guarantees that every CUSTOMER
record has exactly one corresponding ADDRESS
record. The JOIN does not duplicate, nor remove any CUSTOMER
rows, so it is unneeded and thus eliminated (by some, not all databases, will list each database at the end of the article).
So, the database can rewrite the SQL statement to the following, equivalent SQL statement in the presence of said FOREIGN KEY:
SELECT *
FROM customer c
ADDRESS
table. Neat, huh? Who would have thought that FOREIGN KEY
s can be so useful in terms of performance.
The above works if there’s also a NOT NULL
constraint on the FOREIGN KEY
. If there isn’t, e.g. as in this query:
SELECT title
FROM film f
JOIN language l ON f.original_language_id = l.language_id
NOT NULL
predicate, as such:
SELECT title
FROM film
WHERE original_language_id IS NOT NULL
LEFT [ OUTER ] JOIN
will JOIN the right table to the left table but keep rows from the left table if there is no match (again, an explanation of joins can be seen here). When we apply LEFT JOIN
to the previous query…
SELECT c.*
FROM customer c
LEFT JOIN address a
ON c.address_id = a.address_id
CUSTOMER
regardless if that customer has any ADDRESS
. This is useful if the FOREIGN KEY
is optional (nullable), or completely absent, e.g. through:
ALTER TABLE customer DROP CONSTRAINT fk_customer_address
OUTER JOIN
is even easier to eliminate, as it doesn’t require any FOREIGN KEY
constraint for the database to prove that it is unneeded. A UNIQUE
constraint on the parent table (here: ADDRESS.ADDRESS_ID
) is sufficient to show that for every CUSTOMER
there can be at most one ADDRESS
, so the LEFT JOIN
won’t duplicate any CUSTOMER
rows (Unlike INNER JOIN
, OUTER JOIN
never remove rows).
Hence, the above query can again be rewritten to the more optimal:
SELECT *
FROM customer c
OUTER JOIN
elimination is the following one, which unfortunately didn’t work on Oracle for a customer of ours, recently, in a complex query that ran rogue. Let’s look at some other tables of the Sakila database, which expose a to-many relationship:
CREATE TABLE actor (
actor_id numeric NOT NULL ,
first_name VARCHAR(45) NOT NULL,
last_name VARCHAR(45) NOT NULL,
CONSTRAINT pk_actor PRIMARY KEY (actor_id)
);
CREATE TABLE film (
film_id int NOT NULL,
title VARCHAR(255) NOT NULL,
CONSTRAINT pk_film PRIMARY KEY (film_id)
);
CREATE TABLE film_actor (
actor_id INT NOT NULL,
film_id INT NOT NULL,
CONSTRAINT pk_film_actor PRIMARY KEY (actor_id, film_id),
CONSTRAINT fk_film_actor_actor FOREIGN KEY (actor_id)
REFERENCES actor (actor_id),
CONSTRAINT fk_film_actor_film FOREIGN KEY (film_id)
REFERENCES film (film_id)
);
SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN
to FILM_ACTOR
doesn’t contribute anything to the result, but because we’re joining a to-many relationship here (from parent table ACTOR
to child table FILM_ACTOR
), the JOIN
is producing duplicate rows. Without DISTINCT
, we’d get something like:
FIRST_NAME LAST_NAME ---------------------- ... PENELOPE GUINESS PENELOPE GUINESS PENELOPE GUINESS NICK WAHLBERG NICK WAHLBERG ...But thanks to the
DISTINCT
keyword, the result is (provably) no different from the result of this much simpler query:
SELECT DISTINCT first_name, last_name
FROM actor a
DISTINCT
cannot be eliminated, unless we already have a UNIQUE
constraint on (FIRST_NAME, LAST_NAME)
).
Why not Just Refactor the SQL Manually?
Of course, all of this shouldn’t be needed if our SQL were perfect. In the above trivial examples, the SQL can (and should) be re-written manually to improve quality. But note that:- Developers make mistakes, and those mistakes may be very subtle when queries get more complex. I’ll show an example below.
- The presence of this feature actually encourages writing more complex SQL, especially when using reusable views. I’ll show another example below.
- Finally, I’ve previously advocated avoiding needless, mandatory work, like
SELECT *
. Such work is mandatory because the optimiser cannot prove its needlessness. In the case of theseJOINs
, the optimiser can prove the needlessness, so the work is no longer mandatory. It can be eliminated.
SELECT c.name, count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
JOIN film f USING (film_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC
ACTOR_ID = 1
(Penelope Guiness), we’re looking for all the different film categories she played in, and the number of films per category. This is easier to understand when we look at the result:
NAME COUNT Horror 3 Family 2 New 2 Classics 2 Games 2 Music 1 Sci-Fi 1 Animation 1 Sports 1 Children 1 Comedy 1 Documentary 1 Foreign 1Now, can you spot the unneeded JOINs? In fact, we never needed
ACTOR
, nor did we need FILM
SELECT c.name, count(*)
FROM film_actor fa
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC
JOIN
are needed:
SELECT
c.name, count(*),
a.first_name, a.last_name,
array_agg(f.title ORDER BY f.title)
FROM actor a
JOIN film_actor fa USING (actor_id)
JOIN film f USING (film_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name, a.first_name, a.last_name
ORDER BY count(*) DESC
NAME COUNT FIRST_NAME LAST_NAME FILMS Horror 3 PENELOPE GUINESS {"ELEPHANT TROJAN","LADY STAGE","RULES HUMAN"} Family 2 PENELOPE GUINESS {"KING EVOLUTION","SPLASH GUMP"} New 2 PENELOPE GUINESS {"ANGELS LIFE","OKLAHOMA JUMANJI"} Classics 2 PENELOPE GUINESS {"COLOR PHILADELPHIA","WESTWARD SEABISCUIT"} Games 2 PENELOPE GUINESS {"BULWORTH COMMANDMENTS","HUMAN GRAFFITI"} Music 1 PENELOPE GUINESS {"WIZARD COLDBLOODED"} Sci-Fi 1 PENELOPE GUINESS {"CHEAPER CLYDE"} Animation 1 PENELOPE GUINESS {"ANACONDA CONFESSIONS"} Sports 1 PENELOPE GUINESS {"GLEAMING JAWBREAKER"} Children 1 PENELOPE GUINESS {"LANGUAGE COWBOY"} Comedy 1 PENELOPE GUINESS {"VERTIGO NORTHWEST"} Documentary 1 PENELOPE GUINESS {"ACADEMY DINOSAUR"} Foreign 1 PENELOPE GUINESS {"MULHOLLAND BEAST"}As you can see, this optimisation can be very useful on your legacy SQL, because if we maintain a complex query, we might not always be able to see all the JOINs that are really needed. Reusable Views Sometimes, we simply add additional
JOINs
for convenience, when building complex queries from simpler ones, e.g. by using views (which is a completely underrated RDBMS feature! You should all write more views).
Consider this view:
CREATE VIEW v_customer AS
SELECT
c.first_name, c.last_name,
a.address, ci.city, co.country
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)
CITY
and COUNTRY
table as well.
From now on (with this view), we can simply select from the view and “forget” about how it came to be. Now, let’s consider we completely forget about the underlying table, because the view was so useful all the time. We could think about doing this:
SELECT first_name, last_name
FROM v_customer
JOIN
elimination. A view isn’t really anything special, just a “macro” of some stored SQL (beware of some databases, where this isn’t always the case, e.g. MySQL, which Bill Karwin was kind enough to hint me at). So the above statement will be transformed into:
SELECT first_name, last_name
FROM (
SELECT
c.first_name, c.last_name,
a.address, ci.city, co.country
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)
) v_customer
SELECT first_name, last_name
FROM (
SELECT
c.first_name, c.last_name
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)
) v_customer
SELECT first_name, last_name
FROM (
SELECT
c.first_name, c.last_name
FROM customer c
) v_customer
SELECT first_name, last_name
FROM customer
Cool, So Can My Database Do It?
Perhaps! Let’s look at the three different types ofJOIN
elimination in the context of these databases:
- DB2 LUW 10.5
- MySQL 8.0.2
- Oracle 12.2.0.1
- PostgreSQL 9.6
- SQL Server 2014
INNER JOIN Elimination
Remember, this depends on the presence (and usefulness) of aFOREIGN KEY
constraint. The SQL statement we’re using here is:
SELECT first_name, last_name
FROM customer c
JOIN address a ON c.address_id = a.address_id
SELECT first_name, last_name
FROM customer c
ADDRESS
table:
Explain Plan | -----------------------------------------------------------------------| ID | Operation | Rows | Cost | 1 | RETURN | | 61 | 2 | FETCH CUSTOMER | 599 of 599 (100.00%) | 61 | 3 | IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) | 20 |MySQL MySQL 8, apart from finally introducing CTE and window functions (yay), has a lot of new optimiser features, read Morgan Tocker’s useful optimiser guide for details. Unfortunately,
INNER JOIN
elimination is not implemented:
ID TABLE TYPE REF ROWS EXTRA 1 c ALL 599 1 a eq_ref sakila.c.address_id 1 Using indexNot only is the
JOIN
executed, but it is executed using a nested loop with 599 index lookups, as MySQL still only supports NESTED LOOP JOINs, not HASH JOINs.
Bummer.
Oracle
No problem at all for Oracle:
------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 599 | 28752 | 5 (0)| 00:00:01 | | 1 | TABLE ACCESS FULL| CUSTOMER | 599 | 28752 | 5 (0)| 00:00:01 | ------------------------------------------------------------------------------… the
JOIN
is eliminated as expected.
PostgreSQL
Unfortunately, PostgreSQL cannot eliminate INNER JOIN:
Hash Join (cost=19.57..42.79 rows=599 width=13) Hash Cond: (c.address_id = a.address_id) -> Seq Scan on customer c (cost=0.00..14.99 rows=599 width=15) -> Hash (cost=12.03..12.03 rows=603 width=4) -> Seq Scan on address a (cost=0.00..12.03 rows=603 width=4)Not as bad as in MySQL, though, as PostgreSQL chose to use a HASH JOIN to combine the two tables. SQL Server No problemo for SQL Server, the
ADDRESS
table access is gone!
|--Table Scan(OBJECT:([sakila].[dbo].[customer] AS ))Notice, however, that SQL Server can only eliminate
INNER JOIN
on NOT NULL
FOREIGN KEYs!
Excellent. What about…
OUTER JOIN Elimination
This one is a bit easier to prove for the database, remember? We don’t rely on anyFOREIGN KEY
anymore. A UNIQUE
key in the parent table is sufficient to eliminate an OUTER JOIN
. We can safely expect that if the INNER JOIN
could be eliminated (DB2, Oracle, SQL Server), then an OUTER JOIN
can be eliminated, too.
Here’s the query:
SELECT first_name, last_name
FROM customer c
LEFT JOIN address a ON c.address_id = a.address_id
Explain Plan | -----------------------------------------------------------------------| ID | Operation | Rows | Cost | 1 | RETURN | | 61 | 2 | FETCH CUSTOMER | 599 of 599 (100.00%) | 61 | 3 | IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) | 20 |MySQL Still nope:
ID TABLE TYPE REF ROWS EXTRA 1 c ALL 599 1 a eq_ref sakila.c.address_id 1 Using indexOracle Perfect:
------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 599 | 28752 | 5 (0)| 00:00:01 | | 1 | TABLE ACCESS FULL| CUSTOMER | 599 | 28752 | 5 (0)| 00:00:01 | ------------------------------------------------------------------------------PostgreSQL Unlike
INNER JOIN
elimination, this works. Great!
Seq Scan on customer c (cost=0.00..14.99 rows=599 width=13)SQL Server As expected, good:
|--Table Scan(OBJECT:([sakila].[dbo].[customer] AS ))Finally…
OUTER JOIN Elimination with DISTINCT
Remember, the query here navigates a to-many relationship, producing duplicate records of the parent table, but then removes all those duplicates again by- Ignoring contributions from the child table
- Removing duplicates with
DISTINCT
SELECT DISTINCT first_name, last_name
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
SELECT DISTINCT first_name, last_name
FROM actor a
OUTER JOIN
, not with INNER JOIN
, as the latter might remove rows, so we have to execute it to see if it does.
DB2 LUW
Cool, this actually works!
Explain Plan | ----------------------------------------------------------------------| ID | Operation | Rows | Cost | 1 | RETURN | | 20 | 2 | TBSCAN | 200 of 200 (100.00%) | 20 | 3 | SORT (UNIQUE) | 200 of 200 (100.00%) | 20 | 4 | TBSCAN ACTOR | 200 of 200 (100.00%) | 20 |There’s no access to the
FILM_ACTOR
table, nor to its indexes. Very nice.
MySQL
As this is a more sophisticated transformation than the previous ones, we don’t have high hopes here.
ID TABLE TYPE REF ROWS EXTRA 1 a ALL 200 Using temporary 1 fa ref sakila.a.actor_id 27 Using index; DistinctThis has become a rather expensive query, again because of the lack of HASH JOIN support in MySQL! Oracle I’m very surprised to see that Oracle doesn’t support this optimisation, we’re executing the full query:
--------------------------------------------------------------- | Id | Operation | Name | Rows | --------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5462 | | 1 | HASH UNIQUE | | 5462 | | 2 | NESTED LOOPS OUTER| | 5462 | | 3 | TABLE ACCESS FULL| ACTOR | 200 | |* 4 | INDEX RANGE SCAN | IDX_FK_FILM_ACTOR_ACTOR | 27 | ---------------------------------------------------------------Curiously, Oracle also chose a NESTED LOOP JOIN in this case, even if we could have loaded the entire index on the
FILM_ACTOR
table into memory and then HASH JOINed it to ACTOR
. Note that the cardinality estimate of the resulting query is quite off, too, despite the DISTINCT
operation. This can lead to significant effects in an upstream query, which selects from this query (e.g. if stored as a view) – which is what happened to our customer.
PostgreSQL
PostgreSQL also doesn’t support this elimination, but at least gets cardinality estimates much more accurately and chooses a HASH JOIN operation:
HashAggregate (cost=193.53..194.81 rows=128 width=13) Group Key: a.first_name, a.last_name -> Hash Right Join (cost=6.50..166.22 rows=5462 width=13) Hash Cond: (fa.actor_id = a.actor_id) -> Seq Scan on film_actor fa (cost=0.00..84.62 rows=5462 width=2) -> Hash (cost=4.00..4.00 rows=200 width=17) -> Seq Scan on actor a (cost=0.00..4.00 rows=200 width=17)SQL Server The pleasant surprise came from SQL Server, which does support this optimisation too:
|--Sort(DISTINCT ORDER BY:([a].[first_name] ASC, [a].[last_name] ASC)) |--Table Scan(OBJECT:([sakila].[dbo].[actor] AS [a]))As you can see, no access to any
FILM_ACTOR
related objects!
Summary
Here’s a summary of what databases can eliminate:Database | INNER JOIN: to-one | INNER JOIN nullable: to-one | OUTER JOIN: to-one | OUTER JOIN DISTINCT: to-many |
---|---|---|---|---|
DB2 LUW 10.5 | Yep | Yep | Yep | Yep |
MySQL 8.0.2 | Nope | Nope | Nope | Nope |
Oracle 12.2.0.1 | Yep | Yep | Yep | Nope |
PostgreSQL 9.6 | Nope | Nope | Yep | Nope |
SQL Server 2014 | Yep | Nope | Yep | Yep |
Conclusion
JOIN
elimination is a very simple to understand, yet incredibly powerful feature that modern databases support to help developers build and maintain complex SQL queries without worrying too much about performance side effects.
It is possible because SQL is a 4GL (fourth-generation programming language), i.e. a declarative language whose parsed expression trees can “easily” be transformed into equivalent, simpler, and faster to execute expression trees. This work is constantly done by the optimiser behind the scenes, often without you even noticing (see example where unnecessary ACTOR
and FILM
tables were removed).
Currenly, DB2 and SQL Server are the leaders here with Oracle being a close runner-up, at least in my investigation. There’s hope for PostgreSQL, and a bit less hope for MySQL right now. There are tons of other interesting SQL transformations, which I’ll blog about in future blog posts, which may make the difference in other kinds of complex queries.
If you were intrigued by this sort of functionality, do have a look at my most recent SQL talk, which helps developers understand the real value of SQL in the context of the optimiser:
Dear Lukas,
please check for Oracle 12.2 – we’ve expanded the join elimination capabilities.
Jonathan Lewis reported about a relevant example at https://jonathanlewis.wordpress.com/2017/01/10/join-elimination-12-2/
At times you want to control Oracle (many people do it) – here is how.
Another waw to look at this: use OPTIMIZER_FEATURES_ENABLE to steer
the optimizer in Oracle – https://docs.oracle.com/database/122/REFRN/OPTIMIZER_FEATURES_ENABLE.htm#REFRN10141
Kind regards
Thomas
You’re right, my mistake. I was actually using 12.2.0.1, not 12.1. Your turn :)
Nice example by Jonathan Lewis, btw. Those correspond to my examples 1/2 where the “to-one” join to the parent table is eliminated. My last example (which doesn’t work in Oracle 12.2) should remove the “to-many” join to a child table.
Yep – I’ll notify our optimzer folks and check our BUGDB. If you have a customer case, please notify me directly by email.
I don’t have a customer case. We rewrote the query manually – it wasn’t too difficult.
Note that MariaDB supports join elimination (although not all the three use cases) since the version 5.1 (end of 2009): https://mariadb.com/kb/en/table-elimination/
Thanks for the hint, nice to know!
I’m surprised that MariaDB 10 was left out of the comparison, even with limited join elimination.
Why are you surprised? :) Many RDBMS could have been considered, but only the “top 5” made it in this comparison:
https://db-engines.com/en/ranking
Hi,
The data warehouse modeling technique AnchorModeling depends on table elimination:
http://www.anchormodeling.com/?page_id=22
They use it in view and in table functions in SQL Server.
Regards,
Juan-José