Tag Archive | relational algebra

# Semi Join and Anti Join Should Have its own Syntax in SQL

Relational algebra nicely describes the various operations that we know in SQL as well from a more abstract, formal perspective. One of the most common relational JOIN operations is the “equi-join” or SQL `INNER JOIN`.

The above example “equi-joins” the `ACTOR`, `FILM_ACTOR`, and `FILM` tables from the Sakila database, in order to produce a new relation consisting of all the actors and all their associated films.

## Relational operators without equivalent SQL syntax

In most cases, SQL is much much more powerful than relational algebra. However, there are three operators in relational algebra, that have no exact representation in SQL, and can only be expressed through “workarounds”. These operators are:

The Wikipedia article on relational algebra nicely explains semi join and anti join visually:

Semi join

As you can see, the semi join relation `Employee ⋉ Dept` only contains attributes from the Employee relation, not from the Dept relation. “Semi” means that we don’t really join the right hand side, we only check if a join would yield results for any given tuple.

In SQL, we would write the same relation using `IN` or `EXISTS`:

```-- IN
SELECT *
FROM Employee
WHERE DeptName IN (
SELECT DeptName
FROM Dept
)

-- EXISTS
SELECT *
FROM Employee
WHERE EXISTS (
SELECT 1
FROM Dept
WHERE Employee.DeptName = Dept.DeptName
)
```

Anti join

As you can see, the anti join relaion `Employee ▷ Dept` only contains attributes from the Employee relation, not from the Dept relation. “Anti” means that we don’t really join the right hand side, we only check if a join would NOT yield results for any given tuple.

In SQL, we would write the same relation using `NOT IN` or `NOT EXISTS` (although, in the case of `NOT IN`, we need to be extra careful with `NULLs`):

```-- NOT IN
SELECT *
FROM Employee
WHERE DeptName NOT IN (
SELECT DeptName
FROM Dept
)

-- NOT EXISTS
SELECT *
FROM Employee
WHERE NOT EXISTS (
SELECT 1
FROM Dept
WHERE Employee.DeptName = Dept.DeptName
)
```

## A better SQL with native SEMI JOIN / ANTI JOIN

While the above `IN / NOT IN` and `EXISTS / NOT EXISTS` predicates are useful, they are not at all as expressive as native `SEMI JOIN` or `ANTI JOIN` support would be. Imagine, we could write the above statements like this, instead:

Semi join

```-- Natural semi join
SELECT *
FROM Employee
NATURAL LEFT SEMI JOIN Dept

-- Semi join with USING clause
SELECT *
FROM Employee
LEFT SEMI JOIN Dept USING (DeptName)

-- Semi join with ON clause
SELECT *
FROM Employee e
LEFT SEMI JOIN Dept d ON e.DeptName = d.DeptName
```

Anti join

```-- Natural semi join
SELECT *
FROM Employee
NATURAL LEFT ANTI JOIN Dept

-- Semi join with USING clause
SELECT *
FROM Employee
LEFT ANTI JOIN Dept USING (DeptName)

-- Semi join with ON clause
SELECT *
FROM Employee e
LEFT ANTI JOIN Dept d ON e.DeptName = d.DeptName
```

With all of the above options, SQL would be a much more concise language for those cases where we’d like to quickly semi/anti join two relations. In fact, many developers accidentally use `INNER JOIN` instead, because `INNER JOIN` can implement a `SEMI JOIN` when joining a 1:1 or a M:1 relationship. But when they get used to abusing `INNER JOIN`, they’ll do so as well for 1:N and M:N relationships, ending up with duplicates and removing those again with `DISTINCT` (see item #6 on this list of 10 common SQL mistakes)

Interestingly enough, Cloudera Impala’s SQL dialect supports these JOIN syntaxes:

```SELECT select_list FROM
table_or_subquery1 [INNER] JOIN table_or_subquery2 |
table_or_subquery1
{LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]}
JOIN table_or_subquery2 |
table_or_subquery1
{LEFT | RIGHT} SEMI JOIN table_or_subquery2 |
table_or_subquery1
{LEFT | RIGHT} ANTI JOIN table_or_subquery2 |
[ ON col1 = col2 [AND col3 = col4 ...] |
USING (col1 [, col2 ...]) ]
[other_join_clause ...]
[ WHERE where_clauses ]
```

And so will jOOQ 3.7

With jOOQ 3.7, you can now write exactly this useful short form:

Semi join

```ctx.select()
.from(Employee)
.leftSemiJoin(Dept)
.on(Employee.DeptName.eq(Dept.DeptName))
.fetch();
```

Anti join

```ctx.select()
.from(Employee)
.leftAntiJoin(Dept)
.on(Employee.DeptName.eq(Dept.DeptName))
.fetch();
```

jOOQ will make sure that the generated SQL correctly renders an equivalent `[ NOT ] EXISTS` predicate, regardless of how many `JOIN` expressions you choose to write.

## Conclusion

SQL is still a moving target. Many many years after relational algebra has been made usefully accessible to our industry via SQL, however, we still do not have native support for all relational operators. Semi join and anti join are two of them, division is a third.

Cloudera Impala has shown how easy this syntax could be in an actual DBMS. We follow suit and added support as well.

Dear RDBMS vendors: Please add native `SEMI JOIN` and `ANTI JOIN` to your databases. Thank you.

# There is no Such Thing as Object-Relational Impedance Mismatch

Much of the ORM criticism of the last decade missed the point, being inaccurate. By the end of this article, we will conclude with the following:

There is no significant difference between the relational (data) model and object oriented models

How to come to this conclusion? Read on!

### How we came to believe in this fallacy

Many popular bloggers and opinion leaders have missed no chance to bash ORMs for their “obvious” impedance mismatch with the relational world. N+1, inefficient queries, library complexity, leaky abstractions, all sorts of buzzwords have been employed to dismiss ORMs – often containing a lot of truth, albeit without providing a viable alternative.

### But are these articles really criticising the right thing?

Few of the above articles recognise a central fact, which has been elicited eloquently and humorously by Erik Meijer and Gavin Bierman in his very interesting paper “A co-Relational Model of Data for Large Shared Data Banks“, subtitled:

Contrary to popular belief, SQL and noSQL are really just two sides of the same coin.

Or in other words: The “hierarchical” object world and the “relational” database world model the exact same thing. The only difference is the direction of the arrows that you draw in your diagrams.

Let this sink in.

• In the relational model, children point to their parent.
• In the hierarchical model, parents point to their children.

That’s all there is to it.

### What is an ORM?

ORMs fill the bridge between the two worlds. They’re the inverters of arrows, if you will. They will make sure that every “relation” in your RDBMS can be materialised as an “aggregation” or “composition” in your “hierarchical” world (this works for objects, XML, JSON, and any other format). They make sure that such materialisation is properly transacted. That changes to individual attributes or to relational (aggregational, compositional) attributes are properly tracked and purged back into the master model, the database – where the model is persisted. Individual ORMs differ in terms of offered features and in how much mapping logic they offer in addition to mapping individual entities to individual types.

• Some may focus merely on a 1:1 mapping between these classes and tables

But all ORMs do one very simple thing. Ultimately, they take rows from your tables and materialise them as objects in your class model and vice-versa.

### Tables and classes are the same thing

Give or take 1-2 implementation details, an RDBMS’s table and an OO language’s class is the same thing. A specification of a set of grouped attributes, each with their associated type. Consider the following example, using SQL and Java:

SQL

```CREATE TABLE author (
first_name VARCHAR(50),
last_name VARCHAR(50)
);
```

Java

```class Author {
String firstName;
String lastName;
}
```

There is absolutely no conceptual difference between the two – the mapping is straightforward. The mapping is even straightforward when you consider “relations” / “compositions” between different entities / types:

SQL (let’s leave away constraints for simplicity)

```CREATE TABLE author (
id BIGINT,
first_name VARCHAR(50),
last_name VARCHAR(50)
);

CREATE TABLE book (
id BIGINT,
author_id BIGINT,
title VARCHAR(50),
);
```

Java

```class Author {
Long id;
String firstName;
String lastName;
Set<Book> books;
}

class Book {
Long id;
Author author;
String title;
}
```

The implementation details are omitted (and probably account for half of the criticism). But omitting further details allows for straight-forward 1:1 mapping of individual rows from your database to your Java model, without any surprises. Most ORMs – in the Java ecosystem Hibernate in particular – have managed to implement the above idea very well, hiding away all the technical details of actually doing such a model transfer between the RDBMS and Java.

In other words:

There is absolutely nothing wrong with this mapping approach!

### Yet: There *IS* an impedance mismatch, somewhere

The “problems” that many bloggers criticise arise not from the non-existing mismatch between the two model representations (“relational” vs. “hierarchical”). The problems arise from SQL, which is a decent implementation of relational algebra.

In fact, the very same mismatch that everyone criticises is also present between:

Relational algebra has been defined in order to be able to query relations and to form new ad-hoc relations as an output of such queries. Depending on the operations and transformations that are applied, the resulting tuples may have absolutely nothing to do with the individual entities involved in a query. In other, ORM-y words: The product of relational algebra, and in particular of SQL has no use, as it can no longer be further processed by the ORM, let alone persisted back into the database.

To make things “worse”, SQL today is a large super-set of the features offered by relational algebra. It has gotten much more useful than when it was conceived.

### Why this mismatch still affects modern ORMs

The previous paragraphs outlined the single main reason why ORMs are really criticised, even if such criticism often doesn’t mention this exact reason:

SQL / relational algebra is not really appropriate to partially materialise relations into a client / store changes back into the database. Yet, most RDBMS offer only SQL for that job.

Back to the author / book example. When you want to load and display an author and their books to a web application’s user, you’d like to simply fetch that author and their books, call simple methods like `author.add(book)` as well as `author.remove(book)` and let some magic flush your data back into the storage system.

Thinking about the amount of SQL code to be written for such a simple CRUD task makes everyone squeal.

Life’s too short to spend time on CRUD

Perhaps QUEL might have been a better language for CRUD, but that ship has sailed. And unfortunately, because of SQL being an inappropriate language for this job, you cannot ignore that “magic” but have to know well what happens behind the scenes, e.g. by tweaking Hibernate’s fetching strategies.

Translated to SQL, this may be implemented in several ways:

1. Fetching with JOIN

Using outer joins, all the involved entities can be queried in one go:

```SELECT author.*, book.*
FROM author
LEFT JOIN book ON author.id = book.author_id
WHERE author.id = ?
```

• A single query can be issued and all the data can be transferred at once

• The author attributes are repeated in every tuple. The client (ORM) has to de-duplicate authors first, before populating the author-book relationship. This can be particularly bad when you have many nested relations that should be fetched at once.

2. Fetching with SELECT

A single query is issued for each entity:

```SELECT *
FROM author
WHERE id = ?

SELECT *
FROM book
WHERE author_id = ?
```

• The amount of data to be transferred is minimal: Each row is transferred exactly once.

• The amount of queries that are issued may explode into the well-known N+1 problem.

Hibernate in particular knows other types of fetch strategies, although they are essentially a variant / optimisation of one of the above.

### Why not use SQL MULTISET?

The ideal way to fetch all data in this case using advanced SQL would be by using `MULTISET`:

```SELECT author.*, MULTISET (
SELECT book.*
FROM book
WHERE book.author_id = author.id
) AS books
FROM author
WHERE id = ?
```

The above will essentially create a nested collection for each author:

```first_name  last_name   books (nested collection)
--------------------------------------------------

Leonard     Cohen       title
--------------------------
Book of Mercy
Stranger Music
Book of Longing

Ernest      Hemingway   title
--------------------------
For Whom the Bell Tolls
The Old Man and the Sea
```

If you add another nested entity, it is easy to see how another `MULTISET` could allow for additionally nested data:

```SELECT author.*, MULTISET (
SELECT book.*, MULTISET (
SELECT c.*
FROM language AS t
JOIN book_language AS bl
ON c.id = bc.language_id
AND book.id = bc.book_id
) AS languages
FROM book
WHERE book.author_id = author.id
) AS books
FROM author
WHERE id = ?
```

The outcome would now be along the lines of:

```first_name  last_name   books
-----------------------------------------------------

Leonard     Cohen       title            languages
-----------------------------
Book of Mercy    language
------------
en

Stranger Music   language
------------
en
de

Book of Longing  language
------------
en
fr
es
```

• A single query can materialise all eager-loaded rows with minimal bandwidth usage.

• None.

### Unfortunately, MULTISET is poorly supported by RDBMS.

`MULTISET` (as well as arrays and other collection types) have been introduced formally into the SQL standard as of SQL:2003, as a part of an initiative to embed OO features into the SQL language. Oracle, for instance, has implemented much of it, much like Informix did, or the lesser-known CUBRID (although using vendor-specific syntax).

Other databases like PostgreSQL allow for aggregating nested rows into typed arrays, which works the same way although with a bit more syntactic effort.

`MULTISET` and other ORDBMS SQL features are the perfect compromise, allowing for combining the best of the “relational” model with the best of the “hierarchical” model. Allowing for combining CRUD operations with querying in one go, removing the need for sophisticated ORMs, as the SQL language can be used directly to map all your data from your (relational) database to your (hierarchical) client representation with no friction.

### Conclusion and call to action!

We’re living through exciting times in our industry. The elephant (SQL) in the room is still here, learning new tricks all the time. The relational model has served us well, and has been enriched with hierarchical models in various implementations. Functional programming is gaining traction, complementing object orientation in very useful ways.

Think of the glue, putting all these great technological concepts together, allowing for:

• Storing data in the relational model
• Materialising data in the hierarchical model
• Processing data using functional programming

That awesome combination of techniques is hard to beat – we’ve shown how SQL and functional programming can work with jOOQ. All that’s missing – in our opinion – is better support for `MULTISET` and other ORDBMS features from RDBMS vendors.

Thus, we urge you, PostgreSQL developers: You’re creating one of the most innovative databases out there. Oracle is ahead of you in this area – but their implementation is too strongly tied to PL/SQL, which makes it clumsy. Yet, you’re missing out on one of the most awesome SQL feature sets. The ability to construct nested collections (not just arrays), and to query them efficiently. If you lead the way, other RDBMS will follow.

And we can finally stop wasting time talking about the object-relational impedance non-mismatch.

# SQL and NoSQL are Really Just Two Sides of the Same Coin

In a recent debate about NoSQL vs. SQL on Hackernews, I was made aware of a quite amusing paper by Erik Meijer and Gavin Bierman. Remember, Erik Meijer has brought LINQ to the .NET universe, a formidable unified query DSL whose main purpose was to unify typesafe querying against XML, SQL, and object-oriented data structures.

It is important to note that LINQ surfaced shortly before the NoSQL hype really caught on, so NoSQL data structures (e.g. key / value stores, document stores, graph databases) were not yet in scope for LINQ, and LINQ providers might have had to tweak the odd implementation detail to fully match the LINQ API.

What Erik Meijer and Gavin Bierman are claiming in their article (which was also discussed intensively on Hackernews) is the fact that SQL and NoSQL are duals of each other, i.e. two sides of the same coin. In their quest to unify all query languages, the LINQ people would obviously love to simplify things to such a level. To us, who are focusing on SQL only via jOOQ, this seems more like a plea for LINQ than anything else. It would be all too nice if things were as easy as a simple duality, specifically given the fact that Erik Meijer has now also created a company called Applied Duality Inc… What we found very interesting, though, is Erik’s and Gavin’s hint about the relational model and other models inversing arrows between relationships. When child records point to parent records in the relational model, parent objects point to child objects in object-oriented design, or in XML.

But history will teach us where these things go. I currently don’t see a second E.F. Codd to solve the complexity introduced with the new abundance of NoSQL data stores – yet. But maybe, we’ll eventually remember Erik Meijer as the new E.F. Codd, bringing NoSQL to full circle.

# Advanced SQL: Relational division in jOOQ

Relational algebra has its treats. One of the most academic features is the relational division. It is hardly ever used, but comes in handy every now and then. And when you need it, you’ll probably hate yourself for having slept during the relevant classes at the university.

### What is relational division?

Relational division is the inverse of a cross join operation. The following is an approximate definition of a relational division:

```Assume the following cross join / cartesian product
C = A × B

Then it can be said that
A = C ÷ B
B = C ÷ A```

### What does it mean, typically?

Let’s have a look at the sample provided on Wikipedia:

Wikipedia example of a relational division

This looks sensible. The division of Completed ÷ DBProject leads to a list of students that have completed all projects.

### Now how to phrase that in SQL??

That’s not so simple as it looks. The most commonly documented solution involves a doubly-nested select statement using anti-joins. In human language (using double negative), it is like Fred and Sarah saying “there is no DBProject that we have not Completed“. Or in SQL:

```SELECT DISTINCT "c1".Student FROM Completed "c1"
WHERE NOT EXISTS (
SELECT 1 FROM DBProject
WHERE NOT EXISTS (
SELECT 1 FROM Completed "c2"
WHERE "c2".Student = "c1".Student
)
)
```

Now, no one sane wants to remember this just for the 1-2 times in a SQL developer’s life that they actually need it. So they use jOOQ, which wraps up the above monster in a concise syntax:

```create.select().from(
Completed.divideBy(DBProject)