Relational algebra nicely describes the various operations that we know in SQL as well from a more abstract, formal perspective. One of the most common relational JOIN operations is the “equi-join” or SQL INNER JOIN
.
The above example “equi-joins” the ACTOR
, FILM_ACTOR
, and FILM
tables from the Sakila database, in order to produce a new relation consisting of all the actors and all their associated films.
Relational operators without equivalent SQL syntax
In most cases, SQL is much much more powerful than relational algebra. However, there are three operators in relational algebra, that have no exact representation in SQL, and can only be expressed through “workarounds”. These operators are:
- Semi join
- Anti join
- Division (see also our previous article on division)
We’ll be looking only at the first two in this article.
The Wikipedia article on relational algebra nicely explains semi join and anti join visually:
Semi join
As you can see, the semi join relation Employee ⋉ Dept
only contains attributes from the Employee relation, not from the Dept relation. “Semi” means that we don’t really join the right hand side, we only check if a join would yield results for any given tuple.
In SQL, we would write the same relation using IN
or EXISTS
:
-- IN SELECT * FROM Employee WHERE DeptName IN ( SELECT DeptName FROM Dept ) -- EXISTS SELECT * FROM Employee WHERE EXISTS ( SELECT 1 FROM Dept WHERE Employee.DeptName = Dept.DeptName )
Anti join
As you can see, the anti join relaion Employee ▷ Dept
only contains attributes from the Employee relation, not from the Dept relation. “Anti” means that we don’t really join the right hand side, we only check if a join would NOT yield results for any given tuple.
In SQL, we would write the same relation using NOT IN
or NOT EXISTS
(although, in the case of NOT IN
, we need to be extra careful with NULLs
):
-- NOT IN SELECT * FROM Employee WHERE DeptName NOT IN ( SELECT DeptName FROM Dept ) -- NOT EXISTS SELECT * FROM Employee WHERE NOT EXISTS ( SELECT 1 FROM Dept WHERE Employee.DeptName = Dept.DeptName )
A better SQL with native SEMI JOIN / ANTI JOIN
While the above IN / NOT IN
and EXISTS / NOT EXISTS
predicates are useful, they are not at all as expressive as native SEMI JOIN
or ANTI JOIN
support would be. Imagine, we could write the above statements like this, instead:
Semi join
-- Natural semi join SELECT * FROM Employee NATURAL LEFT SEMI JOIN Dept -- Semi join with USING clause SELECT * FROM Employee LEFT SEMI JOIN Dept USING (DeptName) -- Semi join with ON clause SELECT * FROM Employee e LEFT SEMI JOIN Dept d ON e.DeptName = d.DeptName
Anti join
-- Natural anti join SELECT * FROM Employee NATURAL LEFT ANTI JOIN Dept -- Anti join with USING clause SELECT * FROM Employee LEFT ANTI JOIN Dept USING (DeptName) -- Anti join with ON clause SELECT * FROM Employee e LEFT ANTI JOIN Dept d ON e.DeptName = d.DeptName
With all of the above options, SQL would be a much more concise language for those cases where we’d like to quickly semi/anti join two relations. In fact, many developers accidentally use INNER JOIN
instead, because INNER JOIN
can implement a SEMI JOIN
when joining a 1:1 or a M:1 relationship. But when they get used to abusing INNER JOIN
, they’ll do so as well for 1:N and M:N relationships, ending up with duplicates and removing those again with DISTINCT
(see item #6 on this list of 10 common SQL mistakes)
Interestingly enough, Cloudera Impala’s SQL dialect supports these JOIN syntaxes:
SELECT select_list FROM table_or_subquery1 [INNER] JOIN table_or_subquery2 | table_or_subquery1 {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} JOIN table_or_subquery2 | table_or_subquery1 {LEFT | RIGHT} SEMI JOIN table_or_subquery2 | table_or_subquery1 {LEFT | RIGHT} ANTI JOIN table_or_subquery2 | [ ON col1 = col2 [AND col3 = col4 ...] | USING (col1 [, col2 ...]) ] [other_join_clause ...] [ WHERE where_clauses ]
And so will jOOQ 3.7
With jOOQ 3.7, you can now write exactly this useful short form:
Semi join
ctx.select() .from(Employee) .leftSemiJoin(Dept) .on(Employee.DeptName.eq(Dept.DeptName)) .fetch();
Anti join
ctx.select() .from(Employee) .leftAntiJoin(Dept) .on(Employee.DeptName.eq(Dept.DeptName)) .fetch();
jOOQ will make sure that the generated SQL correctly renders an equivalent [ NOT ] EXISTS
predicate, regardless of how many JOIN
expressions you choose to write.
Conclusion
SQL is still a moving target. Many many years after relational algebra has been made usefully accessible to our industry via SQL, however, we still do not have native support for all relational operators. Semi join and anti join are two of them, division is a third.
Cloudera Impala has shown how easy this syntax could be in an actual DBMS. We follow suit and added support as well.
Dear RDBMS vendors: Please add native SEMI JOIN
and ANTI JOIN
to your databases. Thank you.
Nice article and request for vendors :)
IIUC, to do a semi join you don’t really need to introduce a new syntax; short of using the nested SELECT, you could also just:
SELECT Employee.* FROM Employee INNER JOIN Dept ON Employee.dept=Dept.name;
(Not sure if I am relying on a MySQL extension here)
Oh no, inner join and semi join are decidedly not the same thing, although by “coincidence” they may be in some situations, namely when joining a to-one relationship.
According to the Wiki page, the left semi join is the restriction of returned columns of the natural join, in notation: R ⋉ S = πa1,..,an(R ⋈ S). I think @bohay71 is correct.
If you look at it closely, a left semi join projects only the attributes from R, and given that relational algebra is about sets (not multisets), where there are no duplicates, this means that a left semi join does not produce the cartesian product an inner join would produce.
They’re really not the same thing. @bohay71’s claim would have been somewhat more correct if they added DISTINCT to the query, but that would have been risky in the event of the absence of a candidate key in the resulting projection, in case of which (at least in SQL, not in relational algebra), DISTINCT could have removed too many rows from the result.
There is also U-SQL: https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/semijoin-u-sql so we have two :)
It would be really nice if more vendors support these keywords. Especially SQL Server which under the hood uses SEMI/ANTISEMI JOIN operator: https://sqlperformance.com/wp-content/uploads/2018/02/pw-rg2-11.png
Hah, cool. I forgot about U-SQL.
All modern databases (including DB2, Oracle, PostgreSQL, SQL Server) parse EXISTS (or IN) / NOT EXISTS (or in some cases NOT IN) to semi / anti joins rather than running inefficient correlated subqueries. Seems like a very reasonable optimisation to do, given the optimisation potential.
Yes, I know that. What I meant was if DBs internally perform SEMIJOIN they could expose syntax to end users, so we can avoid writing `EXISTS` stuff.
Sure. Sorry, didn’t mean to educate you :) Just put this info for future readers. Yes, I fully agree!