Correlated Subqueries are Evil and Slow. Or are They?

A common myth in SQL is the idea that correlated subqueries are evil and slow. For example, this query here:

SELECT 
  first_name, last_name,
  (SELECT count(*) 
   FROM film_actor fa 
   WHERE fa.actor_id = a.actor_id)
FROM actor a

It “forces” the database engine to run a nested loop of the form (in pseudo code):

for (Actor a : actor) {
  output(
    a.first_name,
    a.last_name,
    film_actor.where(fa -> fa.actor_id = a.actor_id)
              .size()
}

So, for every actor, collect all the corresponding film_actors and count them. This will produce the number of films each actors has played in.

It appears that it would be much better to run this query in “bulk”, I.e. to run:

SELECT 
  first_name, last_name, count(fa.actor_id)
FROM actor a
LEFT JOIN film_actor fa USING (actor_id)
GROUP BY actor_id, first_name, last_name

But is it really faster? And if so, why would you expect that?

Bulk aggregation vs nested loops

Bulk aggregation in this case really just means that we’re collecting all actors and all film_actors and we then merge them in memory for the group by operation. The execution plan (in Oracle) looks like this:

-------------------------------------------------------------------
| Id  | Operation              | Name                    | A-Rows |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                         |    200 |
|   1 |  HASH GROUP BY         |                         |    200 |
|*  2 |   HASH JOIN (OUTER)    |                         |   5462 |
|   3 |    TABLE ACCESS FULL   | ACTOR                   |    200 |
|   4 |    INDEX FAST FULL SCAN| IDX_FK_FILM_ACTOR_ACTOR |   5462 |
-------------------------------------------------------------------

There are 5462 rows in our film_actor table, and for each actor, we join and group and aggregate them all to get to the results. Let’s compare this to the nested loop’s plan:

-----------------------------------------------------------------------
| Id  | Operation         | Name                    | Starts | A-Rows |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                         |      1 |    200 |
|   1 |  SORT AGGREGATE   |                         |    200 |    200 |
|*  2 |   INDEX RANGE SCAN| IDX_FK_FILM_ACTOR_ACTOR |    200 |   5462 |
|   3 |  TABLE ACCESS FULL| ACTOR                   |      1 |    200 |
-----------------------------------------------------------------------

I’ve now included the “Starts” row to illustrate that after having collected all the 200 actors, we start the subquery 200 times to get the count for each actor. But this time, with a range scan.

From the plan only, can we tell which one is faster? Bulk aggregation will need more memory (to load all the data), but has a lower algorithmic complexity (linear). Nested looping will need less memory (all the required info is available from the index directly) but seems to have a higher algorithmic complexity (quadratic).

Fact is, this isn’t exactly true.

The bulk aggregation is indeed linear, but according to O(M + N) where M = number of actors and N = number of film_actors, whereas nested looping isn’t quadratic, it’s O(M log N). We don’t need to traverse the entire index to get the count.

At some point, the higher complexity is worse, but with this little amount of data, it’s not:

On the x-axis is the size of N and on the y-axis is the “complexity value”, e.g. how much time (or memory) is used by an algorithm.

Effects of algorithmic complexity for large N
Effects of algorithmic complexity for large N
Effects of algorithmic complexity for "small" N
Effects of algorithmic complexity for “small” N

Here’s a disclaimer about the above:

Algorithmic complexity for “small N” = “works on my machine” ;)

There’s nothing better than proving things with measurements. Let’s run the following PL/SQL program:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000;
BEGIN
  v_ts := SYSTIMESTAMP;
  
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        first_name, last_name,
        (SELECT count(*) 
         FROM film_actor fa 
         WHERE fa.actor_id = a.actor_id)
      FROM actor a
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
  
  dbms_output.put_line('Nested select: ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
  
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        first_name, last_name, count(film_id)
      FROM actor a
      LEFT JOIN film_actor fa USING (actor_id)
      GROUP BY actor_id, first_name, last_name
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
  
  dbms_output.put_line('Group by     : ' || (SYSTIMESTAMP - v_ts));
END;
/

After three runs, and against our standard Sakila database (get it here: https://www.jooq.org/sakila) with 200 actors and 5462 film_actors, we can see that the nested select consistently outperforms the bulk group by:

Nested select: +000000000 00:00:01.122000000
Group by     : +000000000 00:00:03.191000000

Nested select: +000000000 00:00:01.116000000
Group by     : +000000000 00:00:03.104000000

Nested select: +000000000 00:00:01.122000000
Group by     : +000000000 00:00:03.228000000

Helping the optimiser

Some interesting feedback by Markus Winand (author of http://sql-performance-explained.com) was given on Twitter:

There is a third option: Nesting the GROUP BY operation in a derived table:

SELECT
  first_name, last_name, c
FROM actor a
JOIN (
  SELECT actor_id, count(*) c
  FROM film_actor
  GROUP BY actor_id
) USING (actor_id)

which produces a slightly better plan than the “ordinary” group by query:

--------------------------------------------------------------------
| Id  | Operation               | Name                    | A-Rows |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                         |    200 |
|*  1 |  HASH JOIN              |                         |    200 |
|   2 |   TABLE ACCESS FULL     | ACTOR                   |    200 |
|   3 |   VIEW                  |                         |    200 |
|   4 |    HASH GROUP BY        |                         |    200 |
|   5 |     INDEX FAST FULL SCAN| IDX_FK_FILM_ACTOR_ACTOR |   5462 |
--------------------------------------------------------------------

Adding it to the benchmark as such:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000;
BEGIN
  v_ts := SYSTIMESTAMP;
   
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT
        first_name, last_name,
        (SELECT count(*) 
         FROM film_actor fa 
         WHERE fa.actor_id = a.actor_id)
      FROM actor a
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
   
  dbms_output.put_line('Nested select            : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
   
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT
        first_name, last_name, count(film_id)
      FROM actor a
      LEFT JOIN film_actor fa USING (actor_id)
      GROUP BY actor_id, first_name, last_name
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
   
  dbms_output.put_line('Group by                 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
  
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        first_name, last_name, c
      FROM actor a
      JOIN (
        SELECT actor_id, count(*) c
        FROM film_actor
        GROUP BY actor_id
      ) USING (actor_id)
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Group by in derived table: ' || (SYSTIMESTAMP - v_ts));
END;
/

… shows that it’s already very close to the nested select option:

Nested select            : +000000000 00:00:01.371000000
Group by                 : +000000000 00:00:03.417000000
Group by in derived table: +000000000 00:00:01.592000000

Nested select            : +000000000 00:00:01.236000000
Group by                 : +000000000 00:00:03.329000000
Group by in derived table: +000000000 00:00:01.595000000

Conclusion

We have shown that under some circumstances, correlated subqueries can be better than bulk aggregation. In Oracle. With small-medium sized data sets. In other cases, that’s not true as the size of M and N, our two algorithmic complexity variables increase, O(M log N) will be much worse than O(M + N).

The conclusion here is: Don’t trust any initial judgement. Measure. When you run such a query a lot of times, 3x slower can make a big difference. But don’t go replace all your bulk aggregations either. :)

Liked this article? The content is part of our Data Geekery SQL performance training

CROSS JOIN, a nice example for a rarely used operation

In 95% of the cases, cartesian products originate from accidental cross join operations and cause unnecessary high load on a database. Maybe the results aren’t even wrong, as someone may have applied a UNION or a DISTINCT keyword, to remove unwanted duplicates.

But there are those 5% of SQL queries, where the cartesian product is intentional, even desireable. Here’s a nice example of such a query. Let’s assume we have these two tables (for simplicity, constraints are omitted):

create table content_hits (
  content_id int,
  subscriber_id int
);

create table content_tags (
  content_id int,
  tag_id int
);

The above tables model some blog content, where content_hits shows how many subscribers are concerned with a given blog entry, and content_tags shows which tag is applied to a given blog entry. So let’s say, we want to have some statistics about how many times any given subscriber/tag combination was actually seen. We cannot just join and group by subscriber_id, tag_id if we want the complete list of combinations (including the ones that have a zero count). So a cross join comes to the rescue!

SELECT combinations.tag_id,
       combinations.subscriber_id,

-- correlated subquery to count the actual hits by tag/subscriber when
-- joining the two tables using content_id

       (SELECT count(*)
        FROM content_hits AS h
        JOIN content_tag AS t ON h.content_id = t.content_id
        WHERE h.subscriber_id = combinations.subscriber_id
        AND t.tag_id = combinations.tag_id) as cnt

-- Create all combinations of tag/subscribers first, before counting
-- anything. This will be necessary to have "zero-counts" for any
-- combination of tag/subscriber

FROM (
  SELECT DISTINCT tag_id, subscriber_id
  FROM content_tag
  CROSS JOIN content_hits
) AS combinations

Instead of using a correlated subquery, for performance reasons (depending on the database) it might be more interesting to left outer join the first subquery to the “combinations” relation and then group by subscriber_id, tag_id, and count(content_id) for every group, resulting in this query:

SELECT combinations.tag_id,
       combinations.subscriber_id,
       count(content_relation.content_id)

-- Create all combinations of tag/subscribers first, before counting
-- anything. This will be necessary to have "zero-counts" for any
-- combination of tag/subscriber

FROM (
  SELECT DISTINCT tag_id, subscriber_id
  FROM content_tag
  CROSS JOIN content_hits
) AS combinations

-- Now, outer join every expected combination to its 
-- actual relation joined on content_id

LEFT OUTER JOIN (
  SELECT h.content_id, 
         h.subscriber_id,
         t.tag_id
  FROM content_hits AS h
  JOIN content_tag AS t ON h.content_id = t.content_id
) AS content_relation
ON combinations.tag_id = content_relation.tag_id
AND combinations.subscriber_id = content_relation.subscriber_id

-- And now, finally, group, in order to get the count(content_id)
GROUP BY combinations.tag_id,
         combinations.subscriber_id

The original Stack Overflow question can be seen here:

http://stackoverflow.com/questions/9924433/mysql-select-query-using-count