In recent times, there have been a couple of tremendously popular blog posts explaining JOINs using Venn Diagrams. After all, relational algebra and SQL are set oriented theories and languages, so it only makes sense to illustrate set operations like JOINs using Venn Diagrams. Right?

Google seems to say so:

Everyone uses Venn Diagrams to explain JOINs. But that’s…

## PLAIN WRONG!

Venn Diagrams are perfect to illustrate … actual set operations! SQL knows three of them:

- UNION
- INTERSECT
- EXCEPT

And they can be explained as such:

(all of these slides are taken from our Data Geekery SQL Training, do check it out!)

Most of you use `UNION`

occasionally. `INTERSECT`

and `EXCEPT`

are more exotic, but do come in handy every now and then.

The point here is: these set operations operate on sets of elements (tuples), which are all of the *same type*. As in the examples above, all elements are people with first and last names. This is also why `INTERSECT`

and `EXCEPT`

are more exotic, because they’re usually not very useful. `JOIN`

is much more useful. For instance, you want to combine the set of actors with their corresponding set of films.

A `JOIN`

is really a cartesian product (also cross product) with a filter. Here’s a nice illustration of a cartesian product:

## So, what’s a better way to illustrate JOIN operations?

`JOIN`

diagrams! Let’s look at `CROSS JOIN`

first, because all other `JOIN`

types can be derived from `CROSS JOIN`

:

Remember, in a cross join (in SQL also written with a comma separated table list, historically) is just taking every item on the left side, and combines it with every item on the right side. When you `CROSS JOIN`

a table of 3 rows with a table of 4 rows, you will get 3×4=12 result rows. See, I’m using an “x” character to write the multiplication. I.e. a “cross”.

## INNER JOIN

All other joins are still based on cross joins, but with additional filters, and perhaps unions. Here’s an explanation of each individual `JOIN`

type.

In plain text, an `INNER JOIN`

is a `CROSS JOIN`

in which only those combinations are retained which fulfil a given predicate. For instance:

-- "Classic" ANSI JOIN syntax SELECT * FROM author a JOIN book b ON a.author_id = b.author_id -- "Nice" ANSI JOIN syntax SELECT * FROM author a JOIN book b USING (author_id) -- "Old" syntax using a "CROSS JOIN" SELECT * FROM author a, book b WHERE a.author_id = b.author_id

## OUTER JOIN

`OUTER JOIN`

types help where we want to retain those rows from either the `LEFT`

side or the `RIGHT`

or both (`FULL`

) sides, for which there was no matching row where the predicate yielded true.

A `LEFT OUTER JOIN`

in relational algebra is defined as such:

Or more verbosely in SQL:

SELECT * FROM author a LEFT JOIN book b USING (author_id)

This will produce all the authors and their books, but if an author doesn’t have any book, we still want to get the author with NULL as their only book value. So, it’s the same as writing:

SELECT * FROM author a JOIN book b USING (author_id) UNION SELECT a.*, NULL, NULL, NULL, ..., NULL FROM ( SELECT a.* FROM author a EXCEPT SELECT a.* FROM author a JOIN book b USING (author_id) ) a

But no one wants to write that much SQL, so `OUTER JOIN`

was implemented.

## Conclusion: Say NO to Venn Diagrams

`JOIN`

s are relatively easy to understand intuitively. And they’re relatively easy to explain using Venn Diagrams. But whenever you do that, remember, that you’re making a wrong analogy. A `JOIN`

is not strictly a set operation that can be described with Venn Diagrams. A `JOIN`

is always a cross product with a predicate, and possibly a `UNION`

to add additional rows to the `OUTER JOIN`

result.

So, if in doubt, please use `JOIN`

diagrams rather than Venn Diagrams. They’re more accurate and visually more useful.

Psst – about 8% of us guys are colorblind, and those diagrams you suggest don’t work for us. May try adding patterns or symbols.

Oops, thanks for the hint! Haven’t thought of that. And apparently, neither have the creators of Microsoft Powerpoint (where I took the screenshots from) :-/

… will add 1-2-3 / A-B-C letters to the boxes. That’ll resolve all ambiguities. Thanks again for the hint!

Great post, especially that set notation for the outer join :-)

I don’t think though that the labels are helpful. You are really implying a natural join here, and the alphabetic characters matched against numbers tends to negate one of your main points about a join just being a Cartesian products with a selection predicate acting as a “filter”.

How to best do a diagram, right? I added the labels after an interesting comment about accessibility here: https://blog.jooq.org/2016/07/05/say-no-to-venn-diagrams-when-explaining-joins/#comment-145200.

Except the labels don’t accomplish what you’re trying to achieve here. Still only the colors communicate that there is a match criteria. If you want to add labels, then they should be the same cardinality…

Table 1: A, B, C

Table 2: B, C, D

Why? Consider the labels to be “ROWIDs”, then it makes sense. The colours are already sufficient in terms of matching criteria. I could have added actual columns to each row, maybe that would have made it clearer. Then again, I could have displayed actual tables and not used an analogy…

There’s no relationship described between (1,2,3) and (A,B,C). The colors show the relationship, but not if you’re colorblind. You should use some

Thanks, Dave. I had chosen the colors to have very different shades using https://colororacle.org. Apparently, that’s not good enough. I will fix the identifiers to be (A1, B1, C1) and (A2, B2, C2) to make that more clear.

Yes, colours in the second diagram seem to represent subrow values for equality tests of eqiJOIN ON, JOIN USING or NATURAL JOIN but not arbitrary JOINs. Given that limitation, rows would be better illustrated as boxes partitioned into don’t-care non-match column subrows and coloured match column subrows. Preferably with non-match column subrows of different lengths, and match column subrows of the same length, in left & right inputs. As to illustrating arbitrary match conditions, that requires indicating *matches* in CROSS JOIN output rows, not colouring of subrow values. Regardless of whether you illustrate arbitrary JOINs: For the colourblind, something should be 1:1 with colours, and right now it seems to be distance of numerals/letters from 1/A, and it’s not clear why both numerals & letters are used. Using numerals to identify colours & dropping letters, the first diagram’s ABC would be be 453 and the second’s 234. Although the first diagram’s inputs should be the same as the second diagram’s, to illustrate that non-CROSS JOINs are filtered from intermediate CROSS JOINs. Although the inputs should be something like 123 & 233 to also illustrate JOIN involving duplicate rows. Although the inputs should be something like 12N(ULL) & 22N to also illustrate JOIN involving rows with NULL(s) and NULL not matching anything. And probably all-NULL subrows shouldn’t be represented by *nothing*. Lukas: Try writing the *key* and *legend* for your diagrams. Currently the key involves things like, numerals and letters identify (left & right?) row values, and distances of numerals/letters from 1/A identify colours, while the legends limit the illustrations to equi/USING/NATURAL JOINs. Complicated.

How about you? Are you colorblind? :)

The point of using a Venn Diagram is for illustrative purposes. If I’m using a Venn Diagram when I’m showing someone Joins, its probably their first exposure to joins (or nearly first). As soon as I say cartesian product I’ve lost them, and the graphic is Orders of magnitude harder to under stand than the venn diagram. While it may be imperfect, for someone new it is a lot clearer and less likely to push them away or scare them off from SQL.

That inaccuracy causes so much damage later on. I’ll follow up on this post with an explanation why so many people misuse JOIN when they should be using SEMI JOIN. It’s precisely because of the lack of understanding of JOIN being a “fancy cartesian product”.

lukaseder – Sometimes, we can take a middle ground here. Introduce beginners to Cartesian product & show a simple join very quickly. Then, show them joins by using Venn diagrams if that helps them to remember concepts easily. IMO, in this case, it is more important to remember the joins.

Yes, there’s a middle ground. Venn diagrams appeal to intuition, but if you look at the screenshot of the google image search,

everyoneuses Venn diagrams, andno oneuses a more accurate description.But Venn diagrams *do not illustrate join*. You will see this for yourself if you actually try to give a clear *legend* for what the various parts of the diagram denote. See my comment on the main post.

If you want to dig into the results a bit more, I can see where this is helpful. The nice thing about the Venn diagram is the uniformity of the illustration. I don’t have to connect boxes to get a feel for what is included in the result. Interesting thought.

But this way of thinking works for equi join only. An

`INNER JOIN`

, for instance, may still yield a cartesian product, if not joining by a primary key / foreign key relationship. This will then be completely unexpected, when following only the Venn Diagram approach.Interesting, while your diagrams are more accurate, I have found that the Venn diagrams are an huge help when trying to explain joins.

Here is what I use when explaining how joins work, yes Venn diagrams, and it works for people new to SQL to get the concept.

http://stevestedman.com/2015/05/tsql-join-types-poster-version-4-1/

Perhaps, we can just display both to new users…

I also hate using Venn diagrams for explaning SQL joins. The set-bag discrepancy is one of the reasons. The other one is multiple joins. Venn diagrams completely break for anything above 3 sets/tables (2 joins) and may confuse you more than actually explain things.

Yes, indeed. It would be interesting to display an M:N relationship using a join table with the different diagram notations.

The INNER JOIN visual results are all incorrect; each source row appears only once.

Thanks, you’re right. There was a different diagram, originally (https://lukaseder.files.wordpress.com/2016/07/venn-join.png), which was adapted to improve colours and remove the duplicates for simplicity. The text wasn’t adapted, though.

I don’t disagree, but I have one of the Venn diagrams on my desk for the quick reference. I know I can’t treat is as law, but it helps me out when I can’t remember if I want a left or right join.

You probably want a left join. (Almost) No one uses right join.

Disagree, the Venn diagrams are less confusing and more descriptive visually to this visually-oriented thinker, at least. Maybe if you’re a math expert it’s both clearer and preferable to you, but I’m not by any stretch, so I find the Venn diagrams easier to digest in describing joins.

How about using

bothas in this comment on redditI face palm when I see all the replies saying um yeah what you said but Venn diagrams are much easier to understand!… First of all, who are you explaining joins to.. Some developer right? Is it too much to ask that people who are paid professionals building systems for $$$ should understand how these tools exactly work?

I guess it’s because the article was (intentionally) absolute in the way it rejects the status quo and suggests an alternative. A more constructive solution can be seen in this comment on reddit

Nice article, venn diagrams never really made me visualize the result completely, although they helped. But Join diagrams make more sense, always good for a quick reference on syntax.

Wow. Been using SQL for 15 years, and didn’t even know about Except and Intersect!

You made a good point, and are right about JOIN being cartesian product with filters.

But you should also notice that in all of your examples, the join conditions (filters, in your terms) are based on author_id. We are joining heterogeneous tuples based on identifiers of the exactly same type – the foreign key.

And that’s where the Venn diagram comes in.

What it represents is how joins work on two sets of foreign keys, and considering that we almost always want to join over foreign keys (and how exotic unfiltered CROSS JOINs are), it is an acceptable generalization for me.

I think your diagrams are very good but can be improved.

Please add an arrow between the joined “rows” (I guess colored boxes in your diagram)

A two headed arrow for an inner join.

A left headed arrow for a left join.

A right headed arrow for a right join

No arrow for full join.

Not only does this make the diagram clearer it ALSO explains the names.

Interesting idea, thanks for suggesting. I guess it is a matter of style. But how about this: You copy the image, adapt it accordingly, and re-publish it as a blog post with comments on your improvements. This way, we can “beat” Venn diagrams on Google searches :)

Although Venn diagrams are unsuitable for explaining inner join, *one* diagram under a suitable interpretation is useful for comparing inner, left outer, right outer & full outer join: a left and right circle that are the tuples returned from left and right outer joins respectively. See my comments on various questions & answers at http://stackoverflow.com/questions/38549/difference-between-inner-and-outer-joins and http://stackoverflow.com/questions/17759687/cross-join-vs-inner-join-in-sql-server-2008/25957600#25957600 . PS I came here via http://www.dbdebunk.com/2016/07/this-week_10.html#more .

Thanks for your comments, and for the link to dbdebunk.com. Looks like a very nice resource!

PS 1 And of course Venn diagrams are appropriate for UNION/INTERSECT/EXCEPT.

PS 2 And of course Venn diagrams only work for SQL tables that are *sets*. (The way to give SQL semantics–*bag* semantics–for JOINs correctly & clearly (visually or not) is via INNER JOIN output rows as 1:1 with matches of a row from the left input & a row from the right input, and via OUTER JOIN additional output rows as 1:1 with NULL-extended unmatched rows from left/right/both inputs per LEFT/RIGHT/FULL.)

PS 3 The wikipedia entries for Venn diagram & Euler diagram address their use & abuse.

PS TL;DR Those who think that their Venn (which are often not Venn but Euler) diagrams illustrate how a JOIN’s output is based on its inputs (for sets or bags) should (try to) write the *key* by which to read them and mechanically convert between a diagram and its code. Those people will find that their conversions & diagrams are not straightforward and not illustrating.

another different approach to showing why Venn diagrams are weak and showing a different graphic to show what is happening

http://datavirtualizer.com/sql-joins-visualized-in-a-surprising-way

Indeed, thanks for linking. Those arrows do help visualise what’s going on.

True. Set theory = UNION, INTERSECT, EXCEPT. Joins = cartesian and predicate

INTERSECT is useful in GIS, but otherwise this was very insightful- thanks for posting!

Interesting, what would be a nice example for INTERSECT with GIS?

Your statement is a bit limited. It is true the UNION, INTERSECT, EXCEPT operates as set operation on rows, but all these operation can also be viewed as filters applied on the universe set. What is the UNION of A and B: it’s a filter on the universe set with the condition a is IN A plus b is in B but not in A. And as such are not very different from JOINs.

To derive all JOINs, one Cartesian product is not enough. You need the UNION (not perhaps) of the following Cartesian products: (A x {}) U (A x B) U ({} x B) U ({} x {}), where {} is the empty set. Here, the universe set is made of tuples (a,b) where a is in A or is empty and b is in B or is empty.

So the use of the Venn diagram, I think is perfectly legitimate although it is done on different universe sets.

Thanks for chiming in. Would you mind explaining those additional cartesian products. (A x {}) is an empty set. Did you mean to write e.g. (A x {(ω, ω, …, ω)}) to describe a left outer join?

Anyway, indeed Venn diagrams can be used to describe joins on a different level, but most people who do that will not use Venn diagrams this way, they use them in an “intuitive” way, which hides the fact that there is a cartesian product in there somewhere.

In my SQL training, I always ask people to count all the actors that have the same first and last names (e.g. two actors called “John Doe”). A lot of trainees will inner join actor with itself on (a1.first_name, a1.last_name) = (a2.first_name, a2.last_name) and a1.id <> a2.id, not noticing their accidental cartesian product. It accidentally works for 2 John Does, but breaks for 3 or more.

Once the cartesian product is understood, it can be interesting to get back to reasoning about set theory and Venn diagrams with respect to joins. But until then, I think they are part of the confusion around joins with a lot of people who aren’t writing SQL every day.

In ZFC set theory, a 2tuple set where (a,b) is an element of Ax{} is an empty set, but because all the domains of types in SQL include the null value, the free logic set theory must be used where a 2tuple set where (a,b) is an element of Ax{} is not an empty set. So, for example, if a, b and c are the non-null elements of A, A would be define by {null,a,b,c} and the empty set (or empty domain) is define {null} and the Cartesian product of A and {} is {(null,null),(a,null),(b.null),(c.null)}.

I see, thanks for the explanation!

Hi, how would you define a predicate? I am not familiar with this term. I see that it has a general mathematic definition as a boolean function, and it seems in this context that it might mean the boolean function that is true when the specific join key is equal in each row. Is this true?

I see a cross join is a join en masse with nothing held back. It must be, then, that the addition of a predicate function turns the cross join into the other types of joins, because they are more specific and do not join willy nilly?

Thanks for your response and a great article,

Cheers.

In SQL, a predicate is any boolean expression, such as A = B or A IN (1, 2, 3) or EXISTS (SELECT 1 FROM x), etc. I.e. anything you can put in the WHERE or ON clauses.

Yes, the optimiser should be able to turn a cross join with appropriate join predicate in the WHERE clause into an inner join.

YES! Dude, I’ve thought this same thing for years. I absolutely abhor when joins are explained with Venns.

“The point here is: these set operations operate on sets of elements (tuples), which are all of the same type.”

That’s why people usually apply those Venn diagrams to the fields, not to the records which (as you well argued) would make no sense at all.

Be honest, the Venn diagram is way much clear than what you put here, sorry. Your explanation is excellent unless someone understand the math language you are using. Most people specially those think Scala is better than Java or node.js should be the tech stack used in backend….

Sure, I’m aware of this problem. Joins aren’t as simple as Venn diagrams make them look, hence the explanation of actual joins (via cartesian products) is also a bit less simple.

Most people appreciate the Venn diagrams because they help them remember what left and right mean in case of an outer join, and that’s fine. But in my SQL training, I ask delegates to write as simple query, to find the number of actors who have another actor by the same name (e.g. 2 Pierce Brosnan’s). ~40% of delegates will inner join using (first_name, last_name), and inadvertently produce a cartesian product, which leads to a wrong result!

Understanding why this happens is

much moreimportant than merely remembering the silly order of left/right join.Of course, people can get quite far with SQL

withoutunderstanding the underlying relational algebra. But as with all things that are based on some fundamental theory, understanding the theory helps. Ignoring the theory means making the same mistakes over and over again.The main target audience of this article are not just beginners, but also folk who understand SQL, understand relational algebra,

and yet, they resort to using Venn diagrams to explain SQL to others. I just think that’s not helpful.