SQL JOIN or EXISTS? Chances Are, You’re Doing it Wrong

I’ve noticed this very consistently with a lot of customers, and also with participants of our Data Geekery SQL Workshop (which I highly recommend to everyone, if you excuse the advertising): A lot of developers get the distinction between JOIN and SEMI-JOIN wrong. Let me explain…

What are JOIN and SEMI-JOIN

A little bit of relational algebra first. What is an (INNER) JOIN? An JOIN is nothing but a filtered cartesian product. And what is a cartesian product? Wikipedia explains this very nicely:

for sets A and B, the Cartesian product A × B is the set of all ordered pairs (a, b) where a ∈ A and b ∈ B.

That was the technical way of putting it. A more understandable way might be the following:

ranks = {A, K, Q, J, 10, 9, 8, 7, 6, 5, 4, 3, 2}
suits = {♠, ♥, ♦, ♣}

so, ranks × suits =
{(A, ♠), (A, ♥), (A, ♦), (A, ♣), (K, ♠),
…,
(3, ♣), (2, ♠), (2, ♥), (2, ♦), (2, ♣)}

Or, as an image:

By Trainler - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=7104281
By Trainler – Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=7104281

The above cartesian product models the combination of each rank with each suite. Simple, right?

In SQL, a cartesian product can be written as either a CROSS JOIN, or a table list in the FROM clause. The following query combines every customer with every staff member:

-- CROSS JOIN
SELECT c.last_name, s.last_name
FROM customer AS c
CROSS JOIN staff AS s

-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
     staff AS s

Now, as I mentioned before, an (INNER) JOIN is nothing but a filtered CROSS JOIN, where the filter is applied in a dedicated USING or ON clause.

-- INNER JOIN with USING
SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

-- INNER JOIN with ON
SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s 
  ON c.last_name = s.last_name

The above query will match only those customers with those users whose last_name are the same. As I’ve told you before, an (INNER) JOIN is just a filtered CROSS JOIN, so the below queries will be semantically equivalent to the above:

-- CROSS JOIN
SELECT c.last_name, s.last_name
FROM customer AS c
CROSS JOIN staff AS s
WHERE c.last_name = s.last_name

-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
     staff AS s
WHERE c.last_name = s.last_name

Specifically the last version is still used in many SQL codebases, which have not yet migrated to the ANSI JOIN syntax (even if ANSI joins should be preferred for readability reasons).

But that might be wrong

Unfortunately, I’m seeing this mistake all the time, as I’ve mentioned before. JOIN might appear like a useful tool to match rows between tables. But remember one thing, and I’m starting to repeat myself:

(INNER) JOIN is just a filtered CROSS JOIN

This means that if you choose INNER JOIN to find those customers for which there are matching staff, you will create a cartesian product between customer and staff, and then apply a filter. Why is that a problem? Let’s assume the following:

Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Alice      | Miller    |
| Max        | Doe       |
+------------+-----------+

Staff:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Alice      | Peterson  |
| Jane       | Doe       |
+------------+-----------+

What happens when you run the above queries that use (INNER) JOIN to match customers with staff? Exactly. You’ll form a cartesian product first:

{ (John Doe, John Doe),
  (John Doe, Alice Peterson),
  (John Doe, Jane Doe),
  (Alice Miller, John Doe),
  (Alice Miller, Alice Peterson),
  (Alice Miller, Jane Doe),
  (Max Doe, John Doe),
  (Max Doe, Alice Peterson),
  (Max Doe, Jane Doe) }

… and then filter out the tuples that shouldn’t be in the result, i.e. the ones that don’t have matching last names (of course, the database might choose to optimise this and not materialise the entire cross product):

{ (John Doe, John Doe),
  (John Doe, Jane Doe),
  (Max Doe, John Doe),
  (Max Doe, Jane Doe) }

We’re now left with 4 tuples. That’s great, if that’s what you were after in the first place. A combination of all customers with all staff, for which the combination shares the same last name. But maybe you were asking yourself something else, namely:

Do we have any customers who are staff family members?

Use-case: Exclude such customers from a raffle (let’s assume that last names are a sufficient criteria here).

In that case, we’ll get “duplicate” records. Because the query that some of you might’ve written would have been:

-- INNER JOIN with USING
SELECT c.*
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

Yielding:

Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| John       | Doe       |
| Max        | Doe       |
| Max        | Doe       |
+------------+-----------+

Bummer. How to remove duplicates? With DISTINCT you might think:

-- INNER JOIN with USING
SELECT DISTINCT c.*
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

Yielding:

Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Max        | Doe       |
+------------+-----------+

What’s wrong with DISTINCT?

Using DISTINCT in this situation is a big mistake. Why?

  • Your accidental cartesian product loads too many records from disk, and produces too many records in memory, which have to be removed again
  • DISTINCT can be expensive in some databases, that implement it via sorting, rather than via hashing
  • DISTINCT may change the semantics of your SELECT clause, with nasty side-effects
  • In order to prevent those side-effects, you might even resort to wrapping this DISTINCT query in a subselect, making performance even worse

That’s horrible. See also this list of common SQL mistakes:
https://blog.jooq.org/2013/07/30/10-common-mistakes-java-developers-make-when-writing-sql

How to do it right?

By using a SEMI-JOIN. It is called semi join (i.e. “half” join) in relational algebra, because we only care about one side of the JOIN operation in the results, not the other side. In this example, we only care about customers in the result. We don’t want to have any staff records. The relational algebra notation would be

Customer ⋉ Staff

Unfortunately, SQL doesn’t have SEMI JOIN keywords, so the following isn’t possible:

SELECT *
FROM customer AS c
LEFT SEMI JOIN staff AS s 
  USING (last_name)

The SQL way to express a SEMI JOIN is by using EXISTS () or IN (). The following two are equivalent:

-- Using EXISTS
SELECT *
FROM customer AS c
WHERE EXISTS (
  SELECT *
  FROM staff AS s
  WHERE c.last_name = s.last_name
)

-- Using IN
SELECT *
FROM customer
WHERE last_name IN (
  SELECT last_name
  FROM staff
)

(Note, that NOT EXISTS and NOT IN are NOT equivalent)

Not only are these queries more correct, they are also much faster in most SQL databases for a simple reason. The database can stop searching for staff as soon as it has encountered at least one staff for which there is a matching customer. This is also nicely explained in Dan Martensen’s article SQL Performance of JOIN and WHERE EXISTS. And we’ve blogged about a related topic here: SQL Tip of the Day: Be Wary of SELECT COUNT(*).

Semi Join and Anti Join in jOOQ

We believe that these useful relational operators should be first class citizens in SQL as we have stated in our blog post:
https://blog.jooq.org/2015/10/13/semi-join-and-anti-join-should-have-its-own-syntax-in-sql

Semi join

ctx.select()
   .from(Employee)
   .leftSemiJoin(Dept)
   .on(Employee.DeptName.eq(Dept.DeptName))
   .fetch();

Anti join

ctx.select()
   .from(Employee)
   .leftAntiJoin(Dept)
   .on(Employee.DeptName.eq(Dept.DeptName))
   .fetch();

The above is much easier to write, and will transform into the corresponding (NOT) EXISTS predicate.

Exception

There are some databases that may unfortunately show worse performance for some of these semi join / anti join operators. See, for instance this outdated article on MySQL performance:
http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql

Do measure first, before you believe any of these articles, though!

Another exception is when you have a primary key / foreign key relationship that guarantees that an (INNER) JOIN produces no duplicate values, i.e. when you’re joining a one-to-one or many-to-one relationship, then JOIN is a correct solution, but it is usually equally fast, so semi join will still be more readable.

Conclusion

If you need to check whether you have any matches between a table A and a table B, but you only really care about the results from table A, do make sure you’re using a SEMI-JOIN (i.e. an EXISTS or IN predicate), not an (INNER) JOIN.

Further reading:

UI Developers! Choose Sensible Default Ordering!

Good decisions come from experience. Experience comes from making bad decisions.

― Mark Twain

Today, let’s look at one piece of experience and how we can turn that into good decisions when implementing UI logic. Please, all UI developers read this.

The bad decision

When UI developers display tabular data, it is very common for the table to offer sorting on each column. This is extremely useful, as it helps the user extract basic insight from the data by just performing a single click. Let’s look at an example where the choice of sorting default seemed to be correct at first, but was wrong later on:

Bing Webmaster tools for the jOOQ blog. When I reach the page traffic website to see the Bing traffic for the last month, I can see this:

bing-default

The default table ordering is applied to one of the obvious columns: “Appeared in search”, descendingly. Personally, I might have preferred it to be applied to “Clicks from search” per default, but what’s important: The column is ordered descendingly. I really only care about our top 5 best blog posts. Not about the worst.

So, this is good. Let’s see what happens if I do click on “Clicks from search”, however:

bing-worst

I get an overview of our worst-performing blog posts for that week. Yes, there are some posts that don’t attract any audience for an entire month from Bing. Bummer. (Let’s blame it on Bing’s popularity, not on our blog’s). But that’s not what I cared about. I wanted to see the inverse: The top performing blog posts. In order to see those, I have to click again on the column, to invert the sort order.

The experience

I see this behaviour all the time. UI developers mindlessly defaulting to the technical natural sort order on UI table widgets. In many cases, as a user, this is a frustrating experience, because:

  1. Heck, what did I do? Why is it displaying this data
  2. Aaah, it is sorting ascendingly
  3. Crap, I have to click again

The cognitive dissoncance between steps 1) and 2) shouldn’t be underestimated. Depending on the complexity of the task, or the data that is being displayed, a user might first be confused before they realise that the wanted behaviour is 2 clicks away, not 1. While it should be a technical detail that there are things like ascending and descending orderings, in UIs there is a third notion: That of natural ordering.

Why do developers get this wrong so often? Simple! Because there is also a technical natural ordering, and that’s almost always the ascending order. For instance, in Java, when you do this:

TreeSet<Integer> set = new TreeSet<>();
set.add(1);
set.add(12);
set.add(3);
System.out.println(set);

You’ll get the nicely sorted data as such:

[1, 3, 12]

The technical natural ordering depends on the “raw” data type only. In the case of Integer, this is simply the natural ascending integer number ordering.

The UI natural ordering, however, depends on the context of a data type. While a meaningless integer might still be sorted ascendingly, the previous count value (also an integer!) should be sorted descendingly by default.

The good decision

So, are there any rules with respect to the UI natural ordering? Intuition, yes! But also the following more concrete (and far from exhaustive) list of hints:

Data types and contexts in favour of ascending natural ordering

  • names: All sorts of names of things like people, cities, countries, etc. should be ordered ascendingly in their alphabetical (case-insensitive) order. That is how people skim phone books, that’s how they expect names to be ordered
  • phone numbers: (and other similar numbers) should be ordered like integers: ascendingly. But beware of their specific formatting. It is very likely that the special characters in US formats (like (555) 123-4567) shouldn’t matter when comparing numbers (e.g. with +1-800-1112222), or with numbers from other countries
  • row numbers: This is an obvious candidate for ascending ordering because the row number itself already yields an implicit order, by which it was calculated (see also our article on SQL ROW_NUMBER())
  • dates in the future: If you know your dates are in the future, then you should order them ascendingly, as users want to see the closest date first. Think of a calendar, for instance. Do you really want to display a date in the year 8375, just because you happen to celebrate your 6394th birthday? Probably not. But if in doubt, with dates, better sort them descendingly (see below), as you usually have most dates in the past and only few dates in the future.
  • aggregations: There are few aggregations that should be sorted ascendingly by default. One of them is SQL’s MIN() aggregation. If you’re really looking for the lowest value, the lowest value should be on top, followed by higher values. Other aggregations that are OK to default to ascending order are percentiles (e.g. the MEDIAN()), or standard deviations, or linear regression functions, because it is not clear whether the user cares about the highest or the lowest value. In this case, it is OK to simply default to the technical natural ordering. Most other aggregations, however, should be sorted descendingly (see below)
  • members of a period: A period is something like a year, month, week, day. Periods come with a finite, discrete number of members, such as day of year, week of year, month (for year), day of month (for month), weekday (for week), hour, minute, second (for day). The default ordering here is obvious: always ascendingly, in order of period traversal
  • money (price): No one wants to buy the most expensive flight! Obvious, right? But be careful. Prices are expressed in money, and money isn’t always best sorted ascendingly. If in doubt, order money owed to someone ascendingly, and money owned descendingly. What a difference a little "n" makes!

Data types and contexts in favour of descending natural ordering

  • dates: This is a tricky one, but there are few occasions where dates really should be sorted ascendingly, so default to sorting them descendingly, if most dates lie in he past. The reason for this rationale is the fact that users want to see the closest date first, e.g. the most recent date of a bank account transaction.
  • aggregations: When you run SQL COUNT(*) or SUM() or AVG() or MAX() aggregations, users will really care about the highest values only, as we saw in our Bing Webmaster tools example. Please do sort these aggregations descendingly by default!
  • changes: If the change between your current value and e.g. last week’s value is the thing of interest (e.g. stock market, or again Bing Webmaster tools), then both orderings are interesting. The biggest winners / losers are both useful pieces of information. However, let’s stay positive here and order stuff descendingly by default in order to display the biggest winners first. We don’t want to be negative by default. Whether the change in percent or the absolute change is of interest is another story and depends on the domain.
  • file sizes: Probably, the user is looking for the biggest files – e.g. to see what to delete to save most disk space. Order descendingly by default. If in doubt, think of sizes as the COUNT(*) value of any content. And that should clearly be ordered descendingly.
  • booleans: When ordering a column that contains true/false information (e.g. E-Mail does or does not have any attachments), then the true information is usually more interesting. Since true = 1 and false = 0, order these columns descendingly by default.
  • money (balance): Unlike prices (money owed to someone), balances (money owned) should be ordered descendingly. We want to know how many billions we have. No one cares about their worst assets.

Data types without ordering

There are some data types that simply shouldn’t be ordered. Don’t offer ordering by default on them, it might confuse the user and it might kill your server! These include:

  • URLs: In the case of Bing Webmaster tools, there is really no point in ordering URLs. I mean, the natural order would be http vs. https first, then the domain (but not from top level domain down to the irrelevant sharding identifier), then possibly the port (completely useless piece of information for the user), then the path (probably ordered by date in blogs, but pretty random otherwise). Ordering by URLs doesn’t add value, so don’t offer it. Caveat: If you display only a URL part (e.g. the domain name), ordering might make sense.
  • text: Now, plain text (e.g. E-Mail content) is really the very last thing you want to order. Most SQL databases don’t even allow for ordering CLOB content. This should be obvious, just don’t do it.
  • composite data: If data points are structured (like age and sex in one data point, in case the combination matters for your domain), they’re very hard to order correctly. Specifically, sex doesn’t have any non-technical order. If in doubt, better don’t offer ordering, or decompose the data point.

Data types where sorting challenges mean that tables are the wrong tool

Some data types are tricky to sort by default. Mostly, this is because we’re dealing with discrete or continuous values that go in both directions of a “zero” value. E.g. numbers, percentages, dates (where zero=today):

  • dates: As we’ve seen above, dates are a bit of a tricky data type to sort in tables, as the user experience depends on whether dates are mostly in the past (like bank account transactions) or mostly in the future (like appointments), or both (like calendar entries). A much better UI widget to display timelines that expand both into the past and the future are .. well .. timelines, which cannot be sorted by the user. They’re always ordered by date, displaying today’s date by default
  • percentages: If percentages are the most interesting data point in a data set (e.g. stock option changes), then chances are, that the value 0.00% is the center of your data, e.g. in a winners/losers display widget. While they’re the center of your data, they’re not the center of interest. The most interesting values will still be the top winners and the top losers. This is hard to display with sorting only. Filtering (or pagination) will need to apply in order to remove the stocks that are in the middle
  • (approximate) search results: You don’t see any means of ordering Google search results, right? That’s because Google searches are approximate, i.e. their results are already ordered in terms of relevance. You usually don’t want to offer your users to re-order these results (at least not on the relevance scale). One exception might be ordering of exact search results by date (or something else), but this is really hard to get right from a UX perspective, as you risk displaying lots of irrelevant results based on their freshness.

Situations where the above is not true

Now, the above are useful advice for making the right decision in the case of simple and homogeneous tables, like the one exposed in Bing’s Webmaster Tools (all columns are either unsortable (URL) or aggregations). If you display arbitrary data, then it might not be wise to apply these rules as it will confuse the user if one column defaults to descending ordering, and another defaults to ascending order. In that case, revert to sorting all columns ascendingly. The user will understand.

Conclusion

If you’re a UI developer, make natural ordering flags first class citizens of your software design. Pretty much every data type ships with an intuitive, and obvious value for default ordering, i.e. one of:

  • Ascending
  • Descending
  • No ordering

Every time you design a table, please do think of the above. It’s that little extra effort that will make your user interface much more meaningful. And, beware. This is really what you as a UI developer need to do. The backend developers operating on the database cannot specify this, because:

  • Databases contain raw, context-free data (e.g. of type NUMBER or VARCHAR)
  • UI ordering is not necessarily the same as SQL ordering

How to Support Java 6, 8, 9 in a Single API

With jOOQ 3.7, we have finally added formal support for Java 8 features. This opened the door to a lot of nice improvements, such as:

Creating result streams

try (Stream<Record2<String, String>> stream =
     DSL.using(configuration)
        .select(FIRST_NAME, LAST_NAME)
        .from(PERSON)
        .stream()) {

    List<String> people =
    stream.map(p -> p.value1() + " " + p.value2())
          .collect(Collectors.toList());
}

Calling statements asynchronously (jOOQ 3.8+)

CompletionStage<Record> result =
DSL.using(configuration)
   .select(...)
   .from(COMPLEX_TABLE)
   .fetchAsync();

result.thenComposing(r -> ...);

But obviously, we didn’t want to disappoint our paying customers who are stuck with Java 6 because of their using an older application server, etc.

How to support several Java versions in a single API

This is why we continue publishing a Java 6 version of jOOQ for our commercial customers. How did we do it? Very easily. Our commercial code base (which is our main code base) contains tons of “flags” as in the following example:

public interface Query 
extends 
    QueryPart, 
    Attachable 
    /* [java-8] */, AutoCloseable /* [/java-8] */ 
{

    int execute() throws DataAccessException;

    /* [java-8] */
    CompletionStage<Integer> executeAsync();
    CompletionStage<Integer> executeAsync(Executor executor);
    /* [/java-8] */

}

(Sure, AutoCloseable was available already in Java 7, but we don’t have a Java 7 version).

When we build jOOQ, we build it several times after using a preprocessor to strip logic from the source files:

  • The commercial Java 8 version is built first as is
  • The commercial Java 6 version is built second by stripping all the code between [java-8] and [/java-8] markers
  • The commercial free trial version is built by adding some code to the commercial version
  • The open source version is built third by stripping all the code between [pro] and [/pro] markers

Advantages of this approach

There are several advantages of this approach compared to others:

  • We only have a single source of truth, the original commercial source code.
  • The line numbers are the same in all different versions
  • The APIs are compatible to a certain extent
  • No magic is involved via class loading or reflection

The disadvantages are:

  • Committing to repositories is a bit slower as we have several repositories.
  • Publishing releases takes longer as the different versions need to be built and integration tested several times
  • Sometimes, we simply forget adding a marker and have to re-build again when the Java-6 nightly build crashes
  • We still cannot use lambda expressions in ordinary code that is contained in the Java 6 version (most code)

In our opinion, the advantages outweigh clearly. It’s OK if we can’t implement top-notch Java features as long as our customers can, and as long as those customers who are stuck with old versions can still upgrade to the latest jOOQ version.

We’re looking forward to supporting JDK 9 features, like modularity and the new Flow API without any compromise to existing users.

What about you?

How do you approach cross JDK version compatibility?