Don’t Even use COUNT(*) For Primary Key Existence Checks

In a recent blog post, I’ve advocated against the use of COUNT(*) in SQL, when a simple EXISTS() would suffice. This is important stuff. I keep tuning productive queries where a customer runs a COUNT(*) query like so:

SELECT count(*)
INTO v_any_wahlbergs
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'

… where after they discard the exact count to only check for existence:

IF v_any_wahlbergs = 0 THEN
  something();
ELSE
  something_else();
END IF;

It doesn’t matter if the client logic is written in PL/SQL (as above), or in any other language like Java, the overhead is significant compared to the following, much simpler EXISTS() query:

SELECT CASE WHEN EXISTS (
  SELECT * FROM actor a
  JOIN film_actor fa USING (actor_id)
  WHERE a.last_name = 'WAHLBERG'
) THEN 1 ELSE 0 END
INTO v_any_wahlbergs
FROM dual

Clearly, you can see the effect is the same, and the database can optimise the existence check by taking a short cut once it has found at least one result (instead of going through the entire result set to get the exact count, which wasn’t needed in the first place).

This is true also for primary key checks

The above is obvious. But then, during a recent execution of my SQL Masterclass training, one of the delegates asked me a very interesting question.

Is this also true for primary key checks?

Now, I personally always prefer the EXISTS() syntax, because it clearly communicates that I’m after an existence check, not an actual count query. But in principle, the following two queries are exactly the same:

-- Can be 0 or 1
SELECT count(*) FROM film WHERE film_id = 1;

-- Can also be 0 or 1
SELECT CASE WHEN EXISTS (
  SELECT * FROM film WHERE film_id = 1
) THEN 1 ELSE 0 END
FROM dual;

I always say: Never guess, always measure.

Comparing techniques in Oracle

Here’s a benchmark in Oracle 11g XE running against the Sakila database:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*) FROM film WHERE film_id = 1
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT * FROM film WHERE film_id = 1
      ) THEN 1 ELSE 0 END
      FROM dual
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The result is (only qualitative numbers, not actual seconds because benchmark results comparing Oracle with other DBs must not be published in Oracle 😦 )

Statement 1 : 0.0000012 slurbs
Statement 2 : 0.0000011 slurbs

As you can see, the EXISTS() query still slightly outperforms the COUNT(*) query in Oracle.

Comparing techniques in PostgreSQL

Let’s repeat the same in PostgreSQL 9.5:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000000;
  rec RECORD;
BEGIN
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*) FROM film WHERE film_id = 1
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); 
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT EXISTS (
        SELECT * FROM film WHERE film_id = 1
      )
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); 
END$$;

This time, with actual seconds because yay, Open Source can be benchmarked freely:

INFO:  Statement 1: 00:00:13.32556
INFO:  Statement 2: 00:00:08.350491

Oh wow! That’s a massive improvement in PostgreSQL! OK, we know that COUNT(*) is slow in PostgreSQL. Here’s tons of excuses why.

Comparing techniques in SQL Server

Let’s repeat on SQL Server 2014. And please, observe the beautiful procedural language called T-SQL, which doesn’t even support implicit cursor loops as Oracle and PostgreSQL:

USE sakila
DECLARE @ts DATETIME;
DECLARE @repeat INT = 1000000;
DECLARE @i INT;
DECLARE @dummy VARCHAR;

DECLARE @s1 CURSOR;
DECLARE @s2 CURSOR;

SET @s1 = CURSOR FOR 
SELECT count(*) FROM film WHERE film_id = 1;

SET @s2 = CURSOR FOR 
SELECT CASE WHEN EXISTS (
  SELECT * FROM film WHERE film_id = 1
) THEN 1 ELSE 0 END;


SET @ts = current_timestamp;
SET @i = 0;
WHILE @i < @repeat
BEGIN
  SET @i = @i + 1

  OPEN @s1;
  FETCH NEXT FROM @s1 INTO @dummy;
  WHILE @@FETCH_STATUS = 0
  BEGIN
    FETCH NEXT FROM @s1 INTO @dummy;
  END;

  CLOSE @s1;
END;

DEALLOCATE @s1;
PRINT 'Statement 1: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';

SET @ts = current_timestamp;
SET @i = 0;
WHILE @i < @repeat
BEGIN
  SET @i = @i + 1

  OPEN @s2;
  FETCH NEXT FROM @s2 INTO @dummy;
  WHILE @@FETCH_STATUS = 0
  BEGIN
    FETCH NEXT FROM @s2 INTO @dummy;
  END;

  CLOSE @s2;
END;

DEALLOCATE @s2;
PRINT 'Statement 2: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';

Again, I must not publish actual results because benchmarks results must not be published in SQL Server either, so let’s anonymise by introducing another new unit of measurement:

Statement 1: OVER 9000
Statement 2: OVER 8999

Oh excellent! No measurable difference in SQL Server, this time!

Conclusion

While the difference between COUNT(*) and EXISTS() queries is drastic for ordinary queries that may possibly return thousands of rows (and thus counts of > 1000), the difference for primary key checks is:

  • Marginal but still worth improving in Oracle
  • Significant in PostgreSQL
  • Non existent in SQL Server

So, there’s nothing wrong with consistently applying EXISTS() in all of your queries.

If you’re using jOOQ, getting it right is even easier. Just run:

boolean exists = ctx.fetchExists(
  select()
  .from(ACTOR)
  .join(FILM_ACTOR).using(ACTOR.ACTOR_ID)
  .where(ACTOR.LAST_NAME.eq("WAHLBERG"))
);

And jOOQ will wrap your query in that EXISTS() block for you.

Does Your Database Really Use Your Index?

Adding the right index to speed up your queries is essential. But after a while, as your system grows, you may find yourself with tons of indexes, which all slow down writing to the database – as with each write to the table, the index needs to be updated as well in the same transaction.

Perhaps, 5 years later, your database (and your queries) have evolved in a way for some indexes to no longer be needed. For instance, there are some obvious cases when two indexes are redundant:

-- Original design
CREATE INDEX ON customer (first_name);

-- 5 years later
CREATE INDEX ON customer (first_name, last_name);

But in many other cases, things aren’t so obvious. For instance…

  • you may have added an index on a foreign key, but as the table grew, your database started running more hash joins rather than nested loop joins, in case of which indexes are not used.
  • Or, you’ve just entirely stopped querying for first names / last names.
  • Or you’ve started using a predicate that is way more selective than actual names.
  • Or your customers are suddenly all called Smith.
Everyone is called Smith - Tough for Indexing

Everyone is called Smith – tough luck for indexing!

If your indexes are no longer used, you can (and should) delete them.

But how to find unused indexes

If you’re using Oracle database and you have access to the production system, there’s actually a very nice way to query the diagnostics tables in order to know whether there is any query in the cursor cache that is currently using your index. Just run:

SELECT sql_fulltext
FROM v$sql
WHERE sql_id IN (
  SELECT sql_id
  FROM v$sql_plan
  WHERE (object_owner, object_name)
     = (('OWNER', 'IDX_CUSTOMER_FIRST_NAME'))
)
ORDER BY sql_text;

What does this query do? It runs through all the SQL statements in the cursor cache (v$sql) and for each one of them, checks if there is any execution plan element in the cursor cache (v$sql_plan) which accesses the index. Done.

Of course, if this doesn’t return any results, it doesn’t mean that no one uses your index. There might still be a very rare query that happens only once a year, which gets purged from the cursor cache, otherwise.

But if you run this as a job over a while, you can already conclude that if this query doesn’t return any rows, your index is probably no longer needed.

Can I discover unneeded indexes?

Sure! Run a similar query that lists all the indexes that are not referenced from the v$sql_plan table:

SELECT owner, index_name
FROM all_indexes
WHERE owner = 'OWNER'
AND (owner, index_name) NOT IN (
  SELECT object_owner, object_name
  FROM v$sql_plan
  WHERE object_owner IS NOT NULL
  AND object_name IS NOT NULL
)
ORDER BY 1, 2

Again, this doesn’t say that your indexes will never be used, but they haven’t been used recently.

Now, I won’t actually show you the query that would use the above statement, run across its result in a PL/SQL loop and drop all the indexes using EXECUTE IMMEDIATE, because you might just actually do that to try it in production. But just in case you do want to try, here’s a hint 🙂

BEGIN
  FOR i IN (/* above query here */) LOOP
    EXECUTE IMMEDIATE 
     'DR0P INDEX "' || i.owner || '"."' || i.index_name || '"';
  END LOOP;
END;
/

But as I said. Don’t actually do this!

Update: New Oracle 12cR2 feature: DBA_INDEX_USAGE

As you can see in the comments section, Oracle 12c now has this exciting new DBA_INDEX_USAGE feature, which you can see here:

Thanks, Dan McGhan for mentioning this!

Avoid Using COUNT() in SQL When You Could Use EXISTS()

A while ago, I blogged about the importance of avoiding unnecessary COUNT(*) queries:
https://blog.jooq.org/2014/08/08/sql-tip-of-the-day-be-wary-of-select-count

… and how to replace them with equivalent EXISTS queries

exist

As I’m updating the jOOQ SQL Masterclass to show also PostgreSQL performance characteristics in addition to Oracle, I really have to reiterate this topic. Please repeat after me:

Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs

The rationale is simple

COUNT(*) needs to return the exact number of rows. EXISTS only needs to answer a question like:

  “Are there any rows at all?”

In other words, EXISTS can short-circuit after having found the first matching row. If your client code (e.g. written in Java or in PL/SQL, or any other client language) needs to know something like:

  “Did actors called “Wahlberg” play in any films at all?”

Then you have two options to write that query:

Very very bad: Use COUNT(*)

Using PostgreSQL syntax:

SELECT count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'

The above query will return a number > 0 if we any Wahlberg played in a film, or 0 if not. Notice that we don’t care how many films all the Wahlbergs played in, yet we ask the database to calculate the precise number.

Let’s run the above query against the Sakila database. The execution plans for the above query in Oracle:

Oracle Execution Plan for a Query with COUNT(*)

And in PostgreSQL:

mqlgukh1

Much much better: Use EXISTS()

Using PostgreSQL syntax:

SELECT EXISTS (
  SELECT * FROM actor a
  JOIN film_actor fa USING (actor_id)
  WHERE a.last_name = 'WAHLBERG'
)

The execution plans for the above query in Oracle:

Oracle Execution Plan for a Query with EXISTS()

And in PostgreSQL:

plwnqga1

How to read this?

As you can see from the above execution plans, the cost in Oracle is slightly better (going from 3 to 2) when using EXISTS than when using COUNT(*), because of a much better cardinality estimate in the middle of the execution plan. In other words, Oracle “knows” that we’re looking for only one record and as soon as that record has been found, we can stop looking.

In PostgreSQL, things are more drastic (going from 123 to 3.4). The EXISTS version has an associated cost that is almost 30x lower than the version that uses COUNT(*) for the same result.

You can gaze at the plan for a while and figure out what the exact difference is, or you can believe me that this is true:

It is obviously much faster to check for existence rather than to count all results, if what you’re really trying to do is checking for existence

Duh.

Does this apply to me?

Yes. I’m taking bets. Many many code bases out there get this wrong all the time. Checking for sizes to be zero is just too convenient. Not only in SQL, but also in Java. Consider this. Which one is better?

Collection<?> collection = ...

// EXISTS
if (!collection.isEmpty())
    doSomething();

// COUNT(*)
if (collection.size() == 0)
    doSomething();

Sometimes, this doesn’t really matter, e.g. in ArrayList, whose isEmpty() method reads:

public boolean isEmpty() {
    return size == 0;
}

But what if your collection is a lazy loaded Hibernate collection? Not all collections cache this size value, and even if they do, they may still produce overhead in the source system in order to calculate the exact size. In fact, they might even run a completely unnecessary query fetching all the child entities from the database just to check for existence.

Bonus exercise for my Hibernate-aficionado readers out there: Do the exercise with Hibernate. Because at this point, I for one would say: Just use SQL™

OK, costs. But what does it mean?

Let’s benchmark these two statements in Oracle and PostgreSQL.

Oracle

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 10000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT * FROM actor a
        JOIN film_actor fa USING (actor_id)
        WHERE a.last_name = 'WAHLBERG'
      ) THEN 1 ELSE 0 END
      FROM dual
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM actor a
      JOIN film_actor fa USING (actor_id)
      WHERE a.last_name = 'WAHLBERG'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

We get a slight but significant performance improvement of factor 1.3x:

Statement 1 : 3
Statement 2 : 4

(not actual times, because thank you Oracle legal for prohibiting all sorts of stuff). But you can check out the Sakila database yourself and run the above benchmark on your machine.

PostgreSQL

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000;
  rec RECORD;
BEGIN
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT * FROM actor a
        JOIN film_actor fa USING (actor_id)
        WHERE a.last_name = 'WAHLBERG'
      ) THEN 1 ELSE 0 END
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); 
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM actor a
      JOIN film_actor fa USING (actor_id)
      WHERE a.last_name = 'WAHLBERG'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); 
END$$;

A whopping factor 40x in terms of wallclock time gain!

INFO:  Statement 1: 00:00:00.023656
INFO:  Statement 2: 00:00:00.7944

Let me repeat this:

Factor 40x on PostgreSQL

That’s something! It looks as though COUNT(*) is much better optimised on Oracle (e.g. by counting leaf nodes in an index) than on PostgreSQL, but in any case, the amount of extra work is prohibitive in both databases.

Conclusion

I’m repeating myself, but this is important. Print it out and put it on your office wall:

Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs

Thank you.

jOOQ Tuesdays: Chris Saxon Explains the 3 Things Every Developer Should Know About SQL

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month (today, exceptionally on a Wednesday because of technical issues) where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

chris-saxon-headshot[1]

I’m very excited to feature today Chris Saxon who has worked with Oracle forever, and who is one of the brains behind the famous Ask Tom website.

Chris, you’re part of the famous Ask Tom team. Everyone working with Oracle has wound up on Ask Tom’s website at least once. You’ve answered an incredible amount of questions already. What’s it like to work for such a big community as Oracle’s?

It’s an amazing experience! My first real job was as a PL/SQL developer. My only knowledge of SQL was a couple of vaguely remembered lectures at university. Ask Tom was the main place I learned about SQL and Oracle Database. So it’s a huge honor to be on the other side, helping others get the best out of the technology.

The best part has to be the positive comments when you help someone solve a problem that’s been troubling them for days. That’s why we’re here. To help developers learn more about Oracle and improve their SQL skills. When you use the database to its full extent, you can write better, faster applications with less code!

What were your three most interesting questions, so far?

Any question that has a clear definition and a complete test case is interesting! 😉 Personally I enjoy using SQL to solve complex problems the best. So the first two do just that:

1. Finding the previous row in a different group

The poster had a series of transactions. These were grouped into two types. For each row, they wanted to show the id of the previous transaction from the other group.

At first this sounds like a problem you can solve using LAG or LEAD. But these only search for values within the same group. So you need a different method.

I provided a solution using the model clause. Using this, you can generate columns based on complex, spreadsheet-like formulas. Rows in your table are effectively cells in the sheet. You identify them by defining dimensions which can be other columns or expressions. By setting the transaction type as a dimension, you can then easily reference – and assign – values from one type to the other.

This worked well. But commenters were quick to provide solutions using window functions and 12c’s match_recognize clause. Both of which were faster than my approach!

I like this because it shows the flexibility of SQL. And it shows the value of an engaged community. No one knows everything. By sharing our knowledge and workin together we can all become better developers.

2. Improving SQL that deliberately generates cartesian product

The poster had a set of abbreviations for words. For example, Saint could also be “St.” or “St”. They wanted to take text containing these words. Then generate all combinations of strings using these abbreviations.

The “obvious” solution the user had is to split the text into words. Then for each word, join the abbreviation table, replacing the string as needed. So for a five word string, you have five joins.

There are a couple of problems with this method. The number of joins limits the number of words. So if you have a string with seven words, but only six table joins you won’t abbreviate the final word.

The other issue is performance. Joining the same table N times increases the work you do. If you have long sentences and/or a large number of abbreviations, the query could take a long time to run.

To overcome these you need to ask: “how can I join to the abbreviation table just once?”

The solution to do this starts the same as the original. Split the sentence into a table of words. Then join this to the abbreviations to give a row for each replacement needed.

You can then recursively walk down these rows using CTEs. These build up the sentence again, replacing words with their abbreviations as needed. A scalable solution that only needs a single pass of each table!

The final question relates to performance. Tom Kyte’s mantra was always “if you can do it in SQL, do it in SQL”. The reason is because a pure SQL solution is normally faster than one which combines SQL and other code. Yet a question came in that cast doubt on this:

3. Difference in performance SQL vs PL/SQL

The poster was updating a table. The new values came from another table. He was surprised that PL/SQL using bulk processing came out faster than the pure SQL approach.

The query in question was in the form:

update table1
set col1 = (select col2 from table2 where t1.code = t2.code);

It turned out the reason was due to “missing” indexes. Oracle executes the subquery once for every row in table1. Unless there’s an index on table2 (code), this will full scan table2 once for every row in table1!

The PL/SQL only approach avoided this problem by reading the whole of table2 into an array. So there was only one full scan of table2.
 

The problem here is there was no index on the join condition (t1.code = t2.code). With this in place Oracle does an index lookup of table2 for each row in table1. A massive performance improvement!
 

The moral being if your SQL is “slow”, particularly in compared to a combined SQL + other language method, it’s likely you have a missing index (or two).

This question again showed the strength and value of the Oracle community. Shortly after I posted the explanation, a reviewer was quick to point out the following SQL solution:

merge into table1
using  table2
on   (t1.code = t2.code)
when matched
  then update set t1.col = t2.col;

This came out significantly faster than both the original update and PL/SQL – without needing any extra indexes!

You’re running a YouTube channel called “The Magic of SQL”. Are SQL developers magicians?

Of course they are! In fact, I’d say that all developers are magicians. As Arthur C Clarke said:

“Any sufficiently advanced technology is indistinguishable from magic”

The amount of computing power you carry around in your phone today is mind blowing. Just ask your grandparents!

I think SQL developers have a special kind of magic though :). The ability to answer hard questions with a few lines of SQL is amazing. And for it to adapt to changes in the underlying data to give great performance without you changing it is astounding.

Your Twitter account has a pinned tweet about window functions. I frequently talk to Java developers at conferences, and few of them know about window functions, even if they’ve been in databases like Oracle for a very long time. Why do you think they’re still so “obscure”?

Oracle Database has had window functions has had them since the nineties. But many other RDBMSes have only fully supported them recently. So a combination of writing “database independent” code and people using other databases is certainly a factor.

Use of tools which hide SQL from developers is also a problem. If you’re not actively using SQL, it’s easy to overlook many of its features.

Fortunately I think this is changing. As more and more developers are realizing, SQL is a powerful language. Learning how to use it effectively is a key skill for all developers. Window functions and other SQL features mean you can get write better performing applications with less code. Who doesn’t want that? 😉

What are three things that every developer should know about SQL?

1. Understand set based processing

If you find yourself writing a cursor loop (select … from … loop), and inside that loop you run more SQL, you’re doing it wrong.

Think about it. Do you eat your cornflakes by placing one flake in your bowl, adding the milk, and eating that one piece? Then doing the same for the next. And the next. And so on? Or do you pour yourself a big bowl and eat all the flakes at once?

If you have a cursor loop with more SQL within the loop, you’re effectively doing this. There’s a lot of overhead in executing each SQL statement. This will slow you down if you have a large number of statements that each process one row. Instead you want few statements that process lots of rows where possible.

It’s also possible to do this by accident. As developers we’re taught that code reuse is A Good Thing. So if there’s an API available we’ll often use it. For example, say you’re building a batch process. This finds the unshipped orders, places them on shipments and marks them as sent.

If a ship_order function exists, you could write something like:

select order_id from unshipped_orders loop
  ship_order ( order_id );
end loop;

The problem here is ship_order almost certainly contains SQL. SQL you’ll be executing once for every order awaiting postage. If it’s only a few this may be fine. But if there’s hundreds or thousands this process could take a long time to run.

The way to make this faster is to process all the orders in one go. You can do this with SQL like:

insert into shipments
  select … from unshipped_orders;

update unshipped_orders
set  shipment_date = sysdate;

You may counter there’s other, non-SQL, processing you need to do such as sending emails. So you still need a query to find the order ids.

But you can overcome this! With update’s returning clause, you can get values from all the changed rows:

update unshipped_orders
set  shipment_date = sysdate
returning order_id bulk collect into order_array;

This gives you all the order ids to use as you need.

2. Learn what an execution plan is and how to create and read one

“How can I make my SQL faster” is one of the most common classes of questions posted on Ask Tom. The trouble is there’s scant one-size-fits-all advice when it comes to SQL performance. To help we need to know what your query is, what the tables and indexes are and details about the data. But most importantly we need to know what the query is actually doing!

For example, say you want me to help you figure out a faster route to work. To do this I need to know which route you currently use and how long each part of it takes. We can then compare this against other routes, checking how far they are, expected traffic and predicted speeds. But we need the data to do this!

So when answering performance questions, the first thing we look for is an execution plan. Many confuse this with an explain plan. An explain plan is just a prediction. Often it’s wrong. And even when it’s right, we still don’t know how much work each step does.

An execution plan shows exactly what the database did. It also gives stats about how much work, how often and how long it took to process each step. Combine this with a basic understanding of indexes and join methods and you can often solve your own performance problems.

3. Use bind variables

Sadly data breaches are all too common. There hardly seems to be a week that goes by without news of a major company leaking sensitive data. And the root cause of these attacks is often SQL injection.

This is a simple, well known attack vector. If you write vulnerable SQL on a web enabled application, eventually you’ll be attacked.

And this isn’t something you can avoid by using NoSQL databases. SQL injection like attacks are possible there too!

Fortunately the solution is easy: use bind variables. Not only do these secure your application, they can improve performance too.

Make sure your company is not tomorrow’s data leak headline. Use bind variables!

Last but not least: When will Oracle have a BOOLEAN type? 🙂

We have a BOOLEAN type! It’s just only in PL/SQL ;P

There’s currently a push in the community to for us to add a SQL BOOLEAN type. If this is a feature you’d like to see, you can vote for it on the Database Ideas forum. The more support there is, the more likely we are to implement it! But no promises 😉

How Adding a UNIQUE Constraint on a OneToOne Relationship Helps Performance

A lot of people use SQL constraints mainly to enforce data integrity, and that’s already a very good thing. A UNIQUE constraint, for instance, makes sure that there is at most one instance of any possible value (or tuple, in the case of a composite constraint) in a table. For instance:

CREATE TABLE x (
  a NUMBER(10),
  UNIQUE (a)
);

-- This works:
INSERT INTO x VALUES (1);

-- This fails:
INSERT INTO x VALUES (1);

Constraints are also good for (SELECT) performance

One thing that people often do not think about, though, is the fact that constraints can also be used as very valuable meta information by the database optimiser to make a better decision when finding the most optimal execution plan. It is a big difference for an optimiser…

  • To be unaware of how many different values there are in any given column (worst case)
  • To be able to estimate the different values in any given column (statistics are present)
  • To know that each value can appear at most once

In the last case, a UNIQUE (or PRIMARY KEY) constraint tells the optimiser exactly that. Once the optimiser knows that a certain operation returns at most one row (and not 10, 100, or even 10M rows), a nested loop suddenly becomes extremely cheap, as it can abort as soon as one row was found.

A Java analogy

Compare this with the following Java code:

// This is much "harder" ...
List<Object> objects = unknownSize();
for (Object object : objects) {
    doSomethingWith(object);
}

// ... than this, where we "know" the list
// only contains one value
for (Object object : Collections.singletonList("abc")) {
    doSomethingWith(object);
}

If Java had such optimisation capabilities (bonus question: can the JIT do it?), then the second loop could be optimised as such:

// The loop can be avoided:
doSomethingWith("abc");

Let’s look at Oracle

Let’s look at a concrete example with Oracle:

CREATE TABLE x1 (
  a NUMBER(10) NOT NULL,
  data VARCHAR2(100),
  CONSTRAINT pk_y1 PRIMARY KEY (a)
);

CREATE TABLE y1 (
  a NUMBER(10) NOT NULL, 
  optional_data VARCHAR2(100),
  CONSTRAINT fk_y1 FOREIGN KEY (a) REFERENCES x1(a)
);

CREATE INDEX i1 ON y1(a);
CREATE INDEX i1data ON x1(data);

INSERT INTO x1
SELECT level, dbms_random.string('a', 100)
FROM dual
CONNECT BY level <= 10000;

INSERT INTO y1
SELECT level, dbms_random.string('a', 100)
FROM dual
CONNECT BY level <= 5000;

This is a typical one-to-many relationship between x1 and y1. With the constraints in place, there can be between 0 and N rows in y1 for each row in x1. As a user, we know that there is only one value in y1 for any value in x1, but we don’t enforce this knowledge with a constraint.

Let’s look at the following query:

SELECT count(*)
FROM x1
JOIN y1 USING (a)
WHERE data LIKE 'a%';

What we’re doing here is we want all values in x1 whose data starts with the letter ‘a’, and for which we also have any optional_data. The execution plan is

-----------------------------------------------------
| Operation                | Name   | Rows  | Cost  |
-----------------------------------------------------
| SELECT STATEMENT         |        |       |    29 |
|  SORT AGGREGATE          |        |     1 |       |
|   HASH JOIN              |        |   176 |    29 |
|    VIEW                  |        |   176 |    24 |
|     HASH JOIN            |        |       |       |
|      INDEX RANGE SCAN    | I1DATA |   176 |     4 |
|      INDEX FAST FULL SCAN| PK_Y1  |   176 |    24 |
|    INDEX FAST FULL SCAN  | I1     |  5000 |     5 |
-----------------------------------------------------

As you can see, Oracle chooses to run a hash join operation, which means that all the values from x1 starting with ‘a’ are fetched (around 176), and joined in a hashmap with the entire set of values in y1, fetched from the index i1 (5000 values).

How does this compare with using a UNIQUE constraint?

We’ll create almost the exact same schema as follows:

CREATE TABLE x2 (
  a NUMBER(10) NOT NULL,
  data VARCHAR2(100),
  CONSTRAINT pk_x2 PRIMARY KEY (a)
);

CREATE TABLE y2 (
  a NUMBER(10) NOT NULL, 
  optional_data VARCHAR2(100),
  CONSTRAINT uk_y2 UNIQUE (a),
  CONSTRAINT fk_y2 FOREIGN KEY (a) REFERENCES x2(a)
);

CREATE INDEX i2data ON x2(data);

INSERT INTO x2
SELECT * FROM x1;

INSERT INTO y2
SELECT * FROM y1;

BEGIN
  dbms_stats.gather_table_stats('TEST', 'X2');
  dbms_stats.gather_table_stats('TEST', 'Y2');
END;
/

The data is exactly the same, but now we enforce a UNIQUE constraint on y2’s foreign key, making this effectively a one-to-one relationship. Check out what happens when we run the exact same query…

SELECT count(*)
FROM x2
JOIN y2 USING (a)
WHERE data LIKE 'a%';

Its execution plan is now:

-----------------------------------------------------
| Operation                | Name   | Rows  | Cost  |
-----------------------------------------------------
| SELECT STATEMENT         |        |       |    25 |
|  SORT AGGREGATE          |        |     1 |       |
|   NESTED LOOPS           |        |   176 |    25 |
|    VIEW                  |        |   176 |    25 |
|     HASH JOIN            |        |       |       |
|      INDEX RANGE SCAN    | I2DATA |   176 |     5 |
|      INDEX FAST FULL SCAN| PK_X2  |   176 |    24 |
|    INDEX UNIQUE SCAN     | UK_Y2  |     1 |     0 |
-----------------------------------------------------

As you can see, the overall cost has decreased from 29 to 25 as we’re no longer using a hash join, but a nested loop join operation, which is probably faster if our statistics are not way off, as we only have to look up the single value in y2 corresponding to x2 for each of x2’s estimated 176 rows that start with the letter ‘a’.

But let’s not trust the execution plan, let’s benchmark things (as discussed in a previous article). Run the following code in SQL Developer, for instance:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM x1
      JOIN y1 USING (a)
      WHERE data LIKE 'a%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Without UNIQUE constraint: '
    || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM x2
      JOIN y2 USING (a)
      WHERE data LIKE 'a%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('With    UNIQUE constraint: '
    || (SYSTIMESTAMP - v_ts));
END;
/

The above benchmark repeats each individual query 1000 times. The results speak for themselves:

Without UNIQUE constraint: +000000000 00:00:04.250000000
With    UNIQUE constraint: +000000000 00:00:02.847000000

Remark

The lack of a UNIQUE constraint may happen in situations where you prefer using a surrogate primary key in the referencing table (which I omitted in the examples for brevity). If you’re “sharing” the primary key or use natural keys, this issue won’t happen to you, of course.

Conclusion

The execution planner can make a more informed decision if it has formal knowledge about your data via an additional UNIQUE constraint. This formal knowledge is by far more powerful than any statistics that might indicate the same thing. In the absence of a formal UNIQUE constraint, the database will always have to make sure there is not another row once it has found one. With a formal UNIQUE constraint, it can stop looking as soon as that unique row was found. This can drastically speed up queries. As we’ve seen in the above example, this improves things by a factor of 1.5, so the second query is 50% faster!

Always tell your database as much as you can. Your SELECT performance will greatly increase, at the small cast of a little overhead when inserting data.

Correlated Subqueries are Evil and Slow. Or are They?

A common myth in SQL is the idea that correlated subqueries are evil and slow. For example, this query here:

SELECT 
  first_name, last_name,
  (SELECT count(*) 
   FROM film_actor fa 
   WHERE fa.actor_id = a.actor_id)
FROM actor a

It “forces” the database engine to run a nested loop of the form (in pseudo code):

for (Actor a : actor) {
  output(
    a.first_name,
    a.last_name,
    film_actor.where(fa -> fa.actor_id = a.actor_id)
              .size()
}

So, for every actor, collect all the corresponding film_actors and count them. This will produce the number of films each actors has played in.

It appears that it would be much better to run this query in “bulk”, I.e. to run:

SELECT 
  first_name, last_name, count(fa.actor_id)
FROM actor a
LEFT JOIN film_actor fa USING (actor_id)
GROUP BY actor_id, first_name, last_name

But is it really faster? And if so, why would you expect that?

Bulk aggregation vs nested loops

Bulk aggregation in this case really just means that we’re collecting all actors and all film_actors and we then merge them in memory for the group by operation. The execution plan (in Oracle) looks like this:

-------------------------------------------------------------------
| Id  | Operation              | Name                    | A-Rows |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                         |    200 |
|   1 |  HASH GROUP BY         |                         |    200 |
|*  2 |   HASH JOIN (OUTER)    |                         |   5462 |
|   3 |    TABLE ACCESS FULL   | ACTOR                   |    200 |
|   4 |    INDEX FAST FULL SCAN| IDX_FK_FILM_ACTOR_ACTOR |   5462 |
-------------------------------------------------------------------

There are 5462 rows in our film_actor table, and for each actor, we join and group and aggregate them all to get to the results. Let’s compare this to the nested loop’s plan:

-----------------------------------------------------------------------
| Id  | Operation         | Name                    | Starts | A-Rows |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                         |      1 |    200 |
|   1 |  SORT AGGREGATE   |                         |    200 |    200 |
|*  2 |   INDEX RANGE SCAN| IDX_FK_FILM_ACTOR_ACTOR |    200 |   5462 |
|   3 |  TABLE ACCESS FULL| ACTOR                   |      1 |    200 |
-----------------------------------------------------------------------

I’ve now included the “Starts” row to illustrate that after having collected all the 200 actors, we start the subquery 200 times to get the count for each actor. But this time, with a range scan.

From the plan only, can we tell which one is faster? Bulk aggregation will need more memory (to load all the data), but has a lower algorithmic complexity (linear). Nested looping will need less memory (all the required info is available from the index directly) but seems to have a higher algorithmic complexity (quadratic).

Fact is, this isn’t exactly true.

The bulk aggregation is indeed linear, but according to O(M + N) where M = number of actors and N = number of film_actors, whereas nested looping isn’t quadratic, it’s O(M log N). We don’t need to traverse the entire index to get the count.

At some point, the higher complexity is worse, but with this little amount of data, it’s not:

On the x-axis is the size of N and on the y-axis is the “complexity value”, e.g. how much time (or memory) is used by an algorithm.

Effects of algorithmic complexity for large N

Effects of algorithmic complexity for large N

Effects of algorithmic complexity for "small" N

Effects of algorithmic complexity for “small” N

Here’s a disclaimer about the above:

Algorithmic complexity for “small N” = “works on my machine” 😉

There’s nothing better than proving things with measurements. Let’s run the following PL/SQL program:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000;
BEGIN
  v_ts := SYSTIMESTAMP;
  
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        first_name, last_name,
        (SELECT count(*) 
         FROM film_actor fa 
         WHERE fa.actor_id = a.actor_id)
      FROM actor a
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
  
  dbms_output.put_line('Nested select: ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
  
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        first_name, last_name, count(film_id)
      FROM actor a
      LEFT JOIN film_actor fa USING (actor_id)
      GROUP BY actor_id, first_name, last_name
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
  
  dbms_output.put_line('Group by     : ' || (SYSTIMESTAMP - v_ts));
END;
/

After three runs, and against our standard Sakila database (get it here: https://www.jooq.org/sakila) with 200 actors and 5462 film_actors, we can see that the nested select consistently outperforms the bulk group by:

Nested select: +000000000 00:00:01.122000000
Group by     : +000000000 00:00:03.191000000

Nested select: +000000000 00:00:01.116000000
Group by     : +000000000 00:00:03.104000000

Nested select: +000000000 00:00:01.122000000
Group by     : +000000000 00:00:03.228000000

Helping the optimiser

Some interesting feedback by Markus Winand (author of http://sql-performance-explained.com) was given on Twitter:

There is a third option: Nesting the GROUP BY operation in a derived table:

SELECT
  first_name, last_name, c
FROM actor a
JOIN (
  SELECT actor_id, count(*) c
  FROM film_actor
  GROUP BY actor_id
) USING (actor_id)

which produces a slightly better plan than the “ordinary” group by query:

--------------------------------------------------------------------
| Id  | Operation               | Name                    | A-Rows |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                         |    200 |
|*  1 |  HASH JOIN              |                         |    200 |
|   2 |   TABLE ACCESS FULL     | ACTOR                   |    200 |
|   3 |   VIEW                  |                         |    200 |
|   4 |    HASH GROUP BY        |                         |    200 |
|   5 |     INDEX FAST FULL SCAN| IDX_FK_FILM_ACTOR_ACTOR |   5462 |
--------------------------------------------------------------------

Adding it to the benchmark as such:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000;
BEGIN
  v_ts := SYSTIMESTAMP;
   
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT
        first_name, last_name,
        (SELECT count(*) 
         FROM film_actor fa 
         WHERE fa.actor_id = a.actor_id)
      FROM actor a
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
   
  dbms_output.put_line('Nested select            : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
   
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT
        first_name, last_name, count(film_id)
      FROM actor a
      LEFT JOIN film_actor fa USING (actor_id)
      GROUP BY actor_id, first_name, last_name
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
   
  dbms_output.put_line('Group by                 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
  
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        first_name, last_name, c
      FROM actor a
      JOIN (
        SELECT actor_id, count(*) c
        FROM film_actor
        GROUP BY actor_id
      ) USING (actor_id)
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Group by in derived table: ' || (SYSTIMESTAMP - v_ts));
END;
/

… shows that it’s already very close to the nested select option:

Nested select            : +000000000 00:00:01.371000000
Group by                 : +000000000 00:00:03.417000000
Group by in derived table: +000000000 00:00:01.592000000

Nested select            : +000000000 00:00:01.236000000
Group by                 : +000000000 00:00:03.329000000
Group by in derived table: +000000000 00:00:01.595000000

Conclusion

We have shown that under some circumstances, correlated subqueries can be better than bulk aggregation. In Oracle. With small-medium sized data sets. In other cases, that’s not true as the size of M and N, our two algorithmic complexity variables increase, O(M log N) will be much worse than O(M + N).

The conclusion here is: Don’t trust any initial judgement. Measure. When you run such a query a lot of times, 3x slower can make a big difference. But don’t go replace all your bulk aggregations either. 🙂

Liked this article? The content is part of our Data Geekery SQL performance training

SQL JOIN or EXISTS? Chances Are, You’re Doing it Wrong

I’ve noticed this very consistently with a lot of customers, and also with participants of our Data Geekery SQL Workshop (which I highly recommend to everyone, if you excuse the advertising): A lot of developers get the distinction between JOIN and SEMI-JOIN wrong. Let me explain…

What are JOIN and SEMI-JOIN

A little bit of relational algebra first. What is an (INNER) JOIN? An JOIN is nothing but a filtered cartesian product. And what is a cartesian product? Wikipedia explains this very nicely:

for sets A and B, the Cartesian product A × B is the set of all ordered pairs (a, b) where a ∈ A and b ∈ B.

That was the technical way of putting it. A more understandable way might be the following:

ranks = {A, K, Q, J, 10, 9, 8, 7, 6, 5, 4, 3, 2}
suits = {♠, ♥, ♦, ♣}

so, ranks × suits =
{(A, ♠), (A, ♥), (A, ♦), (A, ♣), (K, ♠),
…,
(3, ♣), (2, ♠), (2, ♥), (2, ♦), (2, ♣)}

Or, as an image:

By Trainler - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=7104281

By Trainler – Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=7104281

The above cartesian product models the combination of each rank with each suite. Simple, right?

In SQL, a cartesian product can be written as either a CROSS JOIN, or a table list in the FROM clause. The following query combines every customer with every staff member:

-- CROSS JOIN
SELECT c.last_name, s.last_name
FROM customer AS c
CROSS JOIN staff AS s

-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
     staff AS s

Now, as I mentioned before, an (INNER) JOIN is nothing but a filtered CROSS JOIN, where the filter is applied in a dedicated USING or ON clause.

-- INNER JOIN with USING
SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

-- INNER JOIN with ON
SELECT c.last_name, s.last_name
FROM customer AS c
INNER JOIN staff AS s 
  ON c.last_name = s.last_name

The above query will match only those customers with those users whose last_name are the same. As I’ve told you before, an (INNER) JOIN is just a filtered CROSS JOIN, so the below queries will be semantically equivalent to the above:

-- CROSS JOIN
SELECT c.last_name, s.last_name
FROM customer AS c
CROSS JOIN staff AS s
WHERE c.last_name = s.last_name

-- Table list
SELECT c.last_name, s.last_name
FROM customer AS c,
     staff AS s
WHERE c.last_name = s.last_name

Specifically the last version is still used in many SQL codebases, which have not yet migrated to the ANSI JOIN syntax (even if ANSI joins should be preferred for readability reasons).

But that might be wrong

Unfortunately, I’m seeing this mistake all the time, as I’ve mentioned before. JOIN might appear like a useful tool to match rows between tables. But remember one thing, and I’m starting to repeat myself:

(INNER) JOIN is just a filtered CROSS JOIN

This means that if you choose INNER JOIN to find those customers for which there are matching staff, you will create a cartesian product between customer and staff, and then apply a filter. Why is that a problem? Let’s assume the following:

Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Alice      | Miller    |
| Max        | Doe       |
+------------+-----------+

Staff:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Alice      | Peterson  |
| Jane       | Doe       |
+------------+-----------+

What happens when you run the above queries that use (INNER) JOIN to match customers with staff? Exactly. You’ll form a cartesian product first:

{ (John Doe, John Doe),
  (John Doe, Alice Peterson),
  (John Doe, Jane Doe),
  (Alice Miller, John Doe),
  (Alice Miller, Alice Peterson),
  (Alice Miller, Jane Doe),
  (Max Doe, John Doe),
  (Max Doe, Alice Peterson),
  (Max Doe, Jane Doe) }

… and then filter out the tuples that shouldn’t be in the result, i.e. the ones that don’t have matching last names (of course, the database might choose to optimise this and not materialise the entire cross product):

{ (John Doe, John Doe),
  (John Doe, Jane Doe),
  (Max Doe, John Doe),
  (Max Doe, Jane Doe) }

We’re now left with 4 tuples. That’s great, if that’s what you were after in the first place. A combination of all customers with all staff, for which the combination shares the same last name. But maybe you were asking yourself something else, namely:

Do we have any customers who are staff family members?

Use-case: Exclude such customers from a raffle (let’s assume that last names are a sufficient criteria here).

In that case, we’ll get “duplicate” records. Because the query that some of you might’ve written would have been:

-- INNER JOIN with USING
SELECT c.*
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

Yielding:

Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| John       | Doe       |
| Max        | Doe       |
| Max        | Doe       |
+------------+-----------+

Bummer. How to remove duplicates? With DISTINCT you might think:

-- INNER JOIN with USING
SELECT DISTINCT c.*
FROM customer AS c
INNER JOIN staff AS s 
  USING (last_name)

Yielding:

Customer:
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| John       | Doe       |
| Max        | Doe       |
+------------+-----------+

What’s wrong with DISTINCT?

Using DISTINCT in this situation is a big mistake. Why?

  • Your accidental cartesian product loads too many records from disk, and produces too many records in memory, which have to be removed again
  • DISTINCT can be expensive in some databases, that implement it via sorting, rather than via hashing
  • DISTINCT may change the semantics of your SELECT clause, with nasty side-effects
  • In order to prevent those side-effects, you might even resort to wrapping this DISTINCT query in a subselect, making performance even worse

That’s horrible. See also this list of common SQL mistakes:
https://blog.jooq.org/2013/07/30/10-common-mistakes-java-developers-make-when-writing-sql

How to do it right?

By using a SEMI-JOIN. It is called semi join (i.e. “half” join) in relational algebra, because we only care about one side of the JOIN operation in the results, not the other side. In this example, we only care about customers in the result. We don’t want to have any staff records. The relational algebra notation would be

Customer ⋉ Staff

Unfortunately, SQL doesn’t have SEMI JOIN keywords, so the following isn’t possible:

SELECT *
FROM customer AS c
LEFT SEMI JOIN staff AS s 
  USING (last_name)

The SQL way to express a SEMI JOIN is by using EXISTS () or IN (). The following two are equivalent:

-- Using EXISTS
SELECT *
FROM customer AS c
WHERE EXISTS (
  SELECT *
  FROM staff AS s
  WHERE c.last_name = s.last_name
)

-- Using IN
SELECT *
FROM customer
WHERE last_name IN (
  SELECT last_name
  FROM staff
)

(Note, that NOT EXISTS and NOT IN are NOT equivalent)

Not only are these queries more correct, they are also much faster in most SQL databases for a simple reason. The database can stop searching for staff as soon as it has encountered at least one staff for which there is a matching customer. This is also nicely explained in Dan Martensen’s article SQL Performance of JOIN and WHERE EXISTS. And we’ve blogged about a related topic here: SQL Tip of the Day: Be Wary of SELECT COUNT(*).

Semi Join and Anti Join in jOOQ

We believe that these useful relational operators should be first class citizens in SQL as we have stated in our blog post:
https://blog.jooq.org/2015/10/13/semi-join-and-anti-join-should-have-its-own-syntax-in-sql

Semi join

ctx.select()
   .from(Employee)
   .leftSemiJoin(Dept)
   .on(Employee.DeptName.eq(Dept.DeptName))
   .fetch();

Anti join

ctx.select()
   .from(Employee)
   .leftAntiJoin(Dept)
   .on(Employee.DeptName.eq(Dept.DeptName))
   .fetch();

The above is much easier to write, and will transform into the corresponding (NOT) EXISTS predicate.

Exception

There are some databases that may unfortunately show worse performance for some of these semi join / anti join operators. See, for instance this outdated article on MySQL performance:
http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql

Do measure first, before you believe any of these articles, though!

Another exception is when you have a primary key / foreign key relationship that guarantees that an (INNER) JOIN produces no duplicate values, i.e. when you’re joining a one-to-one or many-to-one relationship, then JOIN is a correct solution, but it is usually equally fast, so semi join will still be more readable.

Conclusion

If you need to check whether you have any matches between a table A and a table B, but you only really care about the results from table A, do make sure you’re using a SEMI-JOIN (i.e. an EXISTS or IN predicate), not an (INNER) JOIN.

Further reading: