Semi Join and Anti Join Should Have Their Own Syntax in SQL

Relational algebra nicely describes the various operations that we know in SQL as well from a more abstract, formal perspective. One of the most common relational JOIN operations is the “equi-join” or SQL INNER JOIN.

equijoin

The above example “equi-joins” the ACTOR, FILM_ACTOR, and FILM tables from the Sakila database, in order to produce a new relation consisting of all the actors and all their associated films.

Relational operators without equivalent SQL syntax

In most cases, SQL is much much more powerful than relational algebra. However, there are three operators in relational algebra, that have no exact representation in SQL, and can only be expressed through “workarounds”. These operators are:

We’ll be looking only at the first two in this article.

The Wikipedia article on relational algebra nicely explains semi join and anti join visually:

Semi join

wiki-semi-join

As you can see, the semi join relation Employee ⋉ Dept only contains attributes from the Employee relation, not from the Dept relation. “Semi” means that we don’t really join the right hand side, we only check if a join would yield results for any given tuple.

In SQL, we would write the same relation using IN or EXISTS:

-- IN
SELECT *
FROM Employee
WHERE DeptName IN (
  SELECT DeptName
  FROM Dept
)

-- EXISTS
SELECT *
FROM Employee
WHERE EXISTS (
  SELECT 1
  FROM Dept
  WHERE Employee.DeptName = Dept.DeptName
)

Anti join

wiki-anti-join

As you can see, the anti join relaion Employee ▷ Dept only contains attributes from the Employee relation, not from the Dept relation. “Anti” means that we don’t really join the right hand side, we only check if a join would NOT yield results for any given tuple.

In SQL, we would write the same relation using NOT IN or NOT EXISTS (although, in the case of NOT IN, we need to be extra careful with NULLs):

-- NOT IN
SELECT *
FROM Employee
WHERE DeptName NOT IN (
  SELECT DeptName
  FROM Dept
)

-- NOT EXISTS
SELECT *
FROM Employee
WHERE NOT EXISTS (
  SELECT 1
  FROM Dept
  WHERE Employee.DeptName = Dept.DeptName
)

A better SQL with native SEMI JOIN / ANTI JOIN

While the above IN / NOT IN and EXISTS / NOT EXISTS predicates are useful, they are not at all as expressive as native SEMI JOIN or ANTI JOIN support would be. Imagine, we could write the above statements like this, instead:

Semi join

-- Natural semi join
SELECT *
FROM Employee
NATURAL LEFT SEMI JOIN Dept

-- Semi join with USING clause
SELECT *
FROM Employee
LEFT SEMI JOIN Dept USING (DeptName)

-- Semi join with ON clause
SELECT *
FROM Employee e
LEFT SEMI JOIN Dept d ON e.DeptName = d.DeptName

Anti join

-- Natural anti join
SELECT *
FROM Employee
NATURAL LEFT ANTI JOIN Dept

-- Anti join with USING clause
SELECT *
FROM Employee
LEFT ANTI JOIN Dept USING (DeptName)

-- Anti join with ON clause
SELECT *
FROM Employee e
LEFT ANTI JOIN Dept d ON e.DeptName = d.DeptName

With all of the above options, SQL would be a much more concise language for those cases where we’d like to quickly semi/anti join two relations. In fact, many developers accidentally use INNER JOIN instead, because INNER JOIN can implement a SEMI JOIN when joining a 1:1 or a M:1 relationship. But when they get used to abusing INNER JOIN, they’ll do so as well for 1:N and M:N relationships, ending up with duplicates and removing those again with DISTINCT (see item #6 on this list of 10 common SQL mistakes)

Interestingly enough, Cloudera Impala’s SQL dialect supports these JOIN syntaxes:

SELECT select_list FROM
  table_or_subquery1 [INNER] JOIN table_or_subquery2 |
  table_or_subquery1 
    {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} 
      JOIN table_or_subquery2 |
  table_or_subquery1 
    {LEFT | RIGHT} SEMI JOIN table_or_subquery2 |
  table_or_subquery1 
    {LEFT | RIGHT} ANTI JOIN table_or_subquery2 |
    [ ON col1 = col2 [AND col3 = col4 ...] |
      USING (col1 [, col2 ...]) ]
  [other_join_clause ...]
[ WHERE where_clauses ]

And so will jOOQ 3.7

jOOQ, the best way to write SQL in Java

With jOOQ 3.7, you can now write exactly this useful short form:

Semi join

ctx.select()
   .from(Employee)
   .leftSemiJoin(Dept)
   .on(Employee.DeptName.eq(Dept.DeptName))
   .fetch();

Anti join

ctx.select()
   .from(Employee)
   .leftAntiJoin(Dept)
   .on(Employee.DeptName.eq(Dept.DeptName))
   .fetch();

jOOQ will make sure that the generated SQL correctly renders an equivalent [ NOT ] EXISTS predicate, regardless of how many JOIN expressions you choose to write.

Conclusion

SQL is still a moving target. Many many years after relational algebra has been made usefully accessible to our industry via SQL, however, we still do not have native support for all relational operators. Semi join and anti join are two of them, division is a third.

Cloudera Impala has shown how easy this syntax could be in an actual DBMS. We follow suit and added support as well.

Dear RDBMS vendors: Please add native SEMI JOIN and ANTI JOIN to your databases. Thank you.

NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL

When you’re spoiled with Oracle’s fabulous query transformation capabilities and its really well-done cost-based optimiser, then you might forget how difficult SQL query tuning used to be in the “old days” or with those less sophisticated databases. Here’s a really nice explanation of the various means of implementing an ANTI-JOIN in MySQL:

http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/

Advanced SQL: Relational division in jOOQ

Relational algebra has its treats. One of the most academic features is the relational division. It is hardly ever used, but comes in handy every now and then. And when you need it, you’ll probably hate yourself for having slept during the relevant classes at the university.

What is relational division?

Relational division is the inverse of a cross join operation. The following is an approximate definition of a relational division:

Assume the following cross join / cartesian product 
C = A × B 

Then it can be said that 
A = C ÷ B 
B = C ÷ A

What does it mean, typically?

Let’s have a look at the sample provided on Wikipedia:

Wikipedia example of a relational division

Wikipedia example of a relational division

This looks sensible. The division of Completed ÷ DBProject leads to a list of students that have completed all projects.

Now how to phrase that in SQL??

That’s not so simple as it looks. The most commonly documented solution involves a doubly-nested select statement using anti-joins. In human language (using double negative), it is like Fred and Sarah saying “there is no DBProject that we have not Completed“. Or in SQL:

SELECT DISTINCT "c1".Student FROM Completed "c1"
WHERE NOT EXISTS (
  SELECT 1 FROM DBProject
  WHERE NOT EXISTS (
    SELECT 1 FROM Completed "c2"
    WHERE "c2".Student = "c1".Student
    AND "c2".Task = DBProject.Task
  )
)

Now, no one sane wants to remember this just for the 1-2 times in a SQL developer’s life that they actually need it. So they use jOOQ, which wraps up the above monster in a concise syntax:

create.select().from(
  Completed.divideBy(DBProject)
           .on(Completed.Task.equal(DBProject.Task))
           .returning(Completed.Student)
);

Note that from the above SQL statement, it is immediately clear that proper indexing is of the essence. Be sure to have indexes on all columns referenced from the on(…) and returning(…) clauses.

More information

For more information about relational division and some nice, real-life examples, see