Say NO to Venn Diagrams When Explaining JOINs

In recent times, there have been a couple of tremendously popular blog posts explaining JOINs using Venn Diagrams. After all, relational algebra and SQL are set oriented theories and languages, so it only makes sense to illustrate set operations like JOINs using Venn Diagrams. Right?

Google seems to say so:

venn-google

Everyone uses Venn Diagrams to explain JOINs. But that’s…

PLAIN WRONG!

Venn Diagrams are perfect to illustrate … actual set operations! SQL knows three of them:

  • UNION
  • INTERSECT
  • EXCEPT

And they can be explained as such:

venn-union
venn-intersection
venn-difference

(all of these slides are taken from our Data Geekery SQL Training, do check it out!)

Most of you use UNION occasionally. INTERSECT and EXCEPT are more exotic, but do come in handy every now and then.

The point here is: these set operations operate on sets of elements (tuples), which are all of the same type. As in the examples above, all elements are people with first and last names. This is also why INTERSECT and EXCEPT are more exotic, because they’re usually not very useful. JOIN is much more useful. For instance, you want to combine the set of actors with their corresponding set of films.

A JOIN is really a cartesian product (also cross product) with a filter. Here’s a nice illustration of a cartesian product:

venn-cross-product

So, what’s a better way to illustrate JOIN operations?

JOIN diagrams! Let’s look at CROSS JOIN first, because all other JOIN types can be derived from CROSS JOIN:

venn-cross-join

Remember, in a cross join (in SQL also written with a comma separated table list, historically) is just taking every item on the left side, and combines it with every item on the right side. When you CROSS JOIN a table of 3 rows with a table of 4 rows, you will get 3×4=12 result rows. See, I’m using an “x” character to write the multiplication. I.e. a “cross”.

INNER JOIN

All other joins are still based on cross joins, but with additional filters, and perhaps unions. Here’s an explanation of each individual JOIN type.

venn-join

In plain text, an INNER JOIN is a CROSS JOIN in which only those combinations are retained which fulfil a given predicate. For instance:

-- "Classic" ANSI JOIN syntax
SELECT *
FROM author a
JOIN book b ON a.author_id = b.author_id

-- "Nice" ANSI JOIN syntax
SELECT *
FROM author a
JOIN book b USING (author_id)

-- "Old" syntax using a "CROSS JOIN"
SELECT *
FROM author a, book b
WHERE a.author_id = b.author_id

OUTER JOIN

OUTER JOIN types help where we want to retain those rows from either the LEFT side or the RIGHT or both (FULL) sides, for which there was no matching row where the predicate yielded true.

A LEFT OUTER JOIN in relational algebra is defined as such:

dd81ee1373d922122ce1b3e0da74cb28

Or more verbosely in SQL:

SELECT *
FROM author a
LEFT JOIN book b USING (author_id)

This will produce all the authors and their books, but if an author doesn’t have any book, we still want to get the author with NULL as their only book value. So, it’s the same as writing:

SELECT *
FROM author a
JOIN book b USING (author_id)

UNION

SELECT a.*, NULL, NULL, NULL, ..., NULL
FROM (
  SELECT a.*
  FROM author a
  
  EXCEPT
  
  SELECT a.*
  FROM author a
  JOIN book b USING (author_id)
) a

But no one wants to write that much SQL, so OUTER JOIN was implemented.

Conclusion: Say NO to Venn Diagrams

JOINs are relatively easy to understand intuitively. And they’re relatively easy to explain using Venn Diagrams. But whenever you do that, remember, that you’re making a wrong analogy. A JOIN is not strictly a set operation that can be described with Venn Diagrams. A JOIN is always a cross product with a predicate, and possibly a UNION to add additional rows to the OUTER JOIN result.

So, if in doubt, please use JOIN diagrams rather than Venn Diagrams. They’re more accurate and visually more useful.

venn-google-say-no

(Remember, all of these slides are taken from our Data Geekery SQL Training, do get in touch, if you’re interested)

58 thoughts on “Say NO to Venn Diagrams When Explaining JOINs

  1. Psst – about 8% of us guys are colorblind, and those diagrams you suggest don’t work for us. May try adding patterns or symbols.

    1. Oops, thanks for the hint! Haven’t thought of that. And apparently, neither have the creators of Microsoft Powerpoint (where I took the screenshots from) :-/

      1. Great post, especially that set notation for the outer join :-)

        I don’t think though that the labels are helpful. You are really implying a natural join here, and the alphabetic characters matched against numbers tends to negate one of your main points about a join just being a Cartesian products with a selection predicate acting as a “filter”.

          1. Except the labels don’t accomplish what you’re trying to achieve here. Still only the colors communicate that there is a match criteria. If you want to add labels, then they should be the same cardinality…

            Table 1: A, B, C
            Table 2: B, C, D

            1. Why? Consider the labels to be “ROWIDs”, then it makes sense. The colours are already sufficient in terms of matching criteria. I could have added actual columns to each row, maybe that would have made it clearer. Then again, I could have displayed actual tables and not used an analogy…

              1. There’s no relationship described between (1,2,3) and (A,B,C). The colors show the relationship, but not if you’re colorblind. You should use some

        1. Yes, colours in the second diagram seem to represent subrow values for equality tests of eqiJOIN ON, JOIN USING or NATURAL JOIN but not arbitrary JOINs. Given that limitation, rows would be better illustrated as boxes partitioned into don’t-care non-match column subrows and coloured match column subrows. Preferably with non-match column subrows of different lengths, and match column subrows of the same length, in left & right inputs. As to illustrating arbitrary match conditions, that requires indicating *matches* in CROSS JOIN output rows, not colouring of subrow values. Regardless of whether you illustrate arbitrary JOINs: For the colourblind, something should be 1:1 with colours, and right now it seems to be distance of numerals/letters from 1/A, and it’s not clear why both numerals & letters are used. Using numerals to identify colours & dropping letters, the first diagram’s ABC would be be 453 and the second’s 234. Although the first diagram’s inputs should be the same as the second diagram’s, to illustrate that non-CROSS JOINs are filtered from intermediate CROSS JOINs. Although the inputs should be something like 123 & 233 to also illustrate JOIN involving duplicate rows. Although the inputs should be something like 12N(ULL) & 22N to also illustrate JOIN involving rows with NULL(s) and NULL not matching anything. And probably all-NULL subrows shouldn’t be represented by *nothing*. Lukas: Try writing the *key* and *legend* for your diagrams. Currently the key involves things like, numerals and letters identify (left & right?) row values, and distances of numerals/letters from 1/A identify colours, while the legends limit the illustrations to equi/USING/NATURAL JOINs. Complicated.

  2. The point of using a Venn Diagram is for illustrative purposes. If I’m using a Venn Diagram when I’m showing someone Joins, its probably their first exposure to joins (or nearly first). As soon as I say cartesian product I’ve lost them, and the graphic is Orders of magnitude harder to under stand than the venn diagram. While it may be imperfect, for someone new it is a lot clearer and less likely to push them away or scare them off from SQL.

    1. That inaccuracy causes so much damage later on. I’ll follow up on this post with an explanation why so many people misuse JOIN when they should be using SEMI JOIN. It’s precisely because of the lack of understanding of JOIN being a “fancy cartesian product”.

      1. lukaseder – Sometimes, we can take a middle ground here. Introduce beginners to Cartesian product & show a simple join very quickly. Then, show them joins by using Venn diagrams if that helps them to remember concepts easily. IMO, in this case, it is more important to remember the joins.

        1. Yes, there’s a middle ground. Venn diagrams appeal to intuition, but if you look at the screenshot of the google image search, everyone uses Venn diagrams, and no one uses a more accurate description.

    2. But Venn diagrams *do not illustrate join*. You will see this for yourself if you actually try to give a clear *legend* for what the various parts of the diagram denote. See my comment on the main post.

  3. If you want to dig into the results a bit more, I can see where this is helpful. The nice thing about the Venn diagram is the uniformity of the illustration. I don’t have to connect boxes to get a feel for what is included in the result. Interesting thought.

    1. But this way of thinking works for equi join only. An INNER JOIN, for instance, may still yield a cartesian product, if not joining by a primary key / foreign key relationship. This will then be completely unexpected, when following only the Venn Diagram approach.

  4. I also hate using Venn diagrams for explaning SQL joins. The set-bag discrepancy is one of the reasons. The other one is multiple joins. Venn diagrams completely break for anything above 3 sets/tables (2 joins) and may confuse you more than actually explain things.

    1. Yes, indeed. It would be interesting to display an M:N relationship using a join table with the different diagram notations.

  5. I don’t disagree, but I have one of the Venn diagrams on my desk for the quick reference. I know I can’t treat is as law, but it helps me out when I can’t remember if I want a left or right join.

  6. Disagree, the Venn diagrams are less confusing and more descriptive visually to this visually-oriented thinker, at least. Maybe if you’re a math expert it’s both clearer and preferable to you, but I’m not by any stretch, so I find the Venn diagrams easier to digest in describing joins.

  7. I face palm when I see all the replies saying um yeah what you said but Venn diagrams are much easier to understand!… First of all, who are you explaining joins to.. Some developer right? Is it too much to ask that people who are paid professionals building systems for $$$ should understand how these tools exactly work?

  8. Nice article, venn diagrams never really made me visualize the result completely, although they helped. But Join diagrams make more sense, always good for a quick reference on syntax.

  9. You made a good point, and are right about JOIN being cartesian product with filters.

    But you should also notice that in all of your examples, the join conditions (filters, in your terms) are based on author_id. We are joining heterogeneous tuples based on identifiers of the exactly same type – the foreign key.

    And that’s where the Venn diagram comes in.

    What it represents is how joins work on two sets of foreign keys, and considering that we almost always want to join over foreign keys (and how exotic unfiltered CROSS JOINs are), it is an acceptable generalization for me.

  10. I think your diagrams are very good but can be improved.

    Please add an arrow between the joined “rows” (I guess colored boxes in your diagram)

    A two headed arrow for an inner join.

    A left headed arrow for a left join.

    A right headed arrow for a right join

    No arrow for full join.

    Not only does this make the diagram clearer it ALSO explains the names.

    1. Interesting idea, thanks for suggesting. I guess it is a matter of style. But how about this: You copy the image, adapt it accordingly, and re-publish it as a blog post with comments on your improvements. This way, we can “beat” Venn diagrams on Google searches :)

  11. Although Venn diagrams are unsuitable for explaining inner join, *one* diagram under a suitable interpretation is useful for comparing inner, left outer, right outer & full outer join: a left and right circle that are the tuples returned from left and right outer joins respectively. See my comments on various questions & answers at https://stackoverflow.com/questions/38549/difference-between-inner-and-outer-joins and https://stackoverflow.com/questions/17759687/cross-join-vs-inner-join-in-sql-server-2008/25957600#25957600 . PS I came here via http://www.dbdebunk.com/2016/07/this-week_10.html#more .

  12. PS 1 And of course Venn diagrams are appropriate for UNION/INTERSECT/EXCEPT.
    PS 2 And of course Venn diagrams only work for SQL tables that are *sets*. (The way to give SQL semantics–*bag* semantics–for JOINs correctly & clearly (visually or not) is via INNER JOIN output rows as 1:1 with matches of a row from the left input & a row from the right input, and via OUTER JOIN additional output rows as 1:1 with NULL-extended unmatched rows from left/right/both inputs per LEFT/RIGHT/FULL.)
    PS 3 The wikipedia entries for Venn diagram & Euler diagram address their use & abuse.
    PS TL;DR Those who think that their Venn (which are often not Venn but Euler) diagrams illustrate how a JOIN’s output is based on its inputs (for sets or bags) should (try to) write the *key* by which to read them and mechanically convert between a diagram and its code. Those people will find that their conversions & diagrams are not straightforward and not illustrating.

  13. Your statement is a bit limited. It is true the UNION, INTERSECT, EXCEPT operates as set operation on rows, but all these operation can also be viewed as filters applied on the universe set. What is the UNION of A and B: it’s a filter on the universe set with the condition a is IN A plus b is in B but not in A. And as such are not very different from JOINs.

    To derive all JOINs, one Cartesian product is not enough. You need the UNION (not perhaps) of the following Cartesian products: (A x {}) U (A x B) U ({} x B) U ({} x {}), where {} is the empty set. Here, the universe set is made of tuples (a,b) where a is in A or is empty and b is in B or is empty.

    So the use of the Venn diagram, I think is perfectly legitimate although it is done on different universe sets.

    1. Thanks for chiming in. Would you mind explaining those additional cartesian products. (A x {}) is an empty set. Did you mean to write e.g. (A x {(ω, ω, …, ω)}) to describe a left outer join?

      Anyway, indeed Venn diagrams can be used to describe joins on a different level, but most people who do that will not use Venn diagrams this way, they use them in an “intuitive” way, which hides the fact that there is a cartesian product in there somewhere.

      In my SQL training, I always ask people to count all the actors that have the same first and last names (e.g. two actors called “John Doe”). A lot of trainees will inner join actor with itself on (a1.first_name, a1.last_name) = (a2.first_name, a2.last_name) and a1.id <> a2.id, not noticing their accidental cartesian product. It accidentally works for 2 John Does, but breaks for 3 or more.

      Once the cartesian product is understood, it can be interesting to get back to reasoning about set theory and Venn diagrams with respect to joins. But until then, I think they are part of the confusion around joins with a lot of people who aren’t writing SQL every day.

  14. In ZFC set theory, a 2tuple set where (a,b) is an element of Ax{} is an empty set, but because all the domains of types in SQL include the null value, the free logic set theory must be used where a 2tuple set where (a,b) is an element of Ax{} is not an empty set. So, for example, if a, b and c are the non-null elements of A, A would be define by {null,a,b,c} and the empty set (or empty domain) is define {null} and the Cartesian product of A and {} is {(null,null),(a,null),(b.null),(c.null)}.

  15. Hi, how would you define a predicate? I am not familiar with this term. I see that it has a general mathematic definition as a boolean function, and it seems in this context that it might mean the boolean function that is true when the specific join key is equal in each row. Is this true?

    I see a cross join is a join en masse with nothing held back. It must be, then, that the addition of a predicate function turns the cross join into the other types of joins, because they are more specific and do not join willy nilly?

    Thanks for your response and a great article,
    Cheers.

    1. Hi, how would you define a predicate?

      In SQL, a predicate is any boolean expression, such as A = B or A IN (1, 2, 3) or EXISTS (SELECT 1 FROM x), etc. I.e. anything you can put in the WHERE or ON clauses.

      I see a cross join is a join en masse with nothing held back. It must be, then, that the addition of a predicate function turns the cross join into the other types of joins, because they are more specific and do not join willy nilly?

      Yes, the optimiser should be able to turn a cross join with appropriate join predicate in the WHERE clause into an inner join.

  16. YES! Dude, I’ve thought this same thing for years. I absolutely abhor when joins are explained with Venns.

  17. “The point here is: these set operations operate on sets of elements (tuples), which are all of the same type.”

    That’s why people usually apply those Venn diagrams to the fields, not to the records which (as you well argued) would make no sense at all.

  18. Be honest, the Venn diagram is way much clear than what you put here, sorry. Your explanation is excellent unless someone understand the math language you are using. Most people specially those think Scala is better than Java or node.js should be the tech stack used in backend….

    1. Sure, I’m aware of this problem. Joins aren’t as simple as Venn diagrams make them look, hence the explanation of actual joins (via cartesian products) is also a bit less simple.

      Most people appreciate the Venn diagrams because they help them remember what left and right mean in case of an outer join, and that’s fine. But in my SQL training, I ask delegates to write as simple query, to find the number of actors who have another actor by the same name (e.g. 2 Pierce Brosnan’s). ~40% of delegates will inner join using (first_name, last_name), and inadvertently produce a cartesian product, which leads to a wrong result!

      Understanding why this happens is much more important than merely remembering the silly order of left/right join.

      Of course, people can get quite far with SQL without understanding the underlying relational algebra. But as with all things that are based on some fundamental theory, understanding the theory helps. Ignoring the theory means making the same mistakes over and over again.

      The main target audience of this article are not just beginners, but also folk who understand SQL, understand relational algebra, and yet, they resort to using Venn diagrams to explain SQL to others. I just think that’s not helpful.

  19. This is a fantastic article and I periodically link it, but I’d like to point out an inaccurate detail. Union, intersect and except are not just set operations. They are also relational algebra primitive operations. But I agree that explaining them with Venn diagrams is fine.

      1. Maybe the article? :) It calls them “set operations”, but they aren’t.

        Set operations and relational operations may share the same names, but they are different operations, dealing with different objects and different branches of mathematics.

        Again, the article’s point is absolutely clear and correct. I remember why I wrote this comment, but I understand that without a boring, useless explanation it may sound like a smartarse correction.

  20. Very bright idea of visualizing joins. Issues about color blindness etc can be addressed easily. Take home message here is that Venn diagrams are not good for illustrating joins especially for beginners.

Leave a Reply