Don’t Use the String Concatenation “Trick” in SQL Predicates

In SQL, quite often, we want to compare several values with each other. For instance, when we’re looking for a specific user by their first and last names, we’ll write a query like this one:

SELECT *
FROM customer
WHERE first_name = 'SUSAN'
AND last_name = 'WILSON';

We’re getting:

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          8   SUSAN        WILSON

Surely, everyone agrees that this is correct and perfectly fine as we probably have an index on these two columns (or on at least one of them) to speed up such queries:

CREATE INDEX idx_customer_name ON customer (last_name, first_name);

The execution plan is thus optimal, e.g. with Oracle:

-------------------------------------------------------------------------
| Id  | Operation                           | Name              | Rows  |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                   |       |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER          |     1 |
|*  2 |   INDEX RANGE SCAN                  | IDX_CUSTOMER_NAME |     1 |
-------------------------------------------------------------------------

But sometimes, we cannot use AND to connect two predicates. In particular, that’s not possible with an IN predicate, so people sometimes resort to using string concatenation, because that seems to work and make sense.

For instance, let’s find all customers whose first and last names matches those of an actor (as always, using the Sakila database)

SELECT *
FROM customer
WHERE first_name || last_name IN (
  SELECT first_name || last_name
  FROM actor
)

And yes indeed, what we’re getting here is the correct answer:

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          6   JENNIFER     DAVIS

But that answer is only accidentally correct!

Because we weren’t looking for customers called

first_name = 'JENNIFER' AND last_name = 'DAVIS'

We were looking for customers called

first_name || last_name = 'JENNIFERDAVIS'

Want proof? Let’s add a new customer:

INSERT INTO customer (customer_id, first_name, last_name )
VALUES               (600        , 'JENNI'   , 'FERDAVIS');

Yeah right? No one is called FERDAVIS. Or are they? As good programmers, we closely observe Murphy’s Law (i.e. always look both left and right when crossing a street).

In any case, let’s run our query again:

SELECT *
FROM customer
WHERE first_name || last_name IN (
  SELECT first_name || last_name
  FROM actor
)

And observe the result!

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          6   JENNIFER     DAVIS
        600   JENNI        FERDAVIS

Of course, because our predicate was really looking for customers called

first_name || last_name = 'JENNIFERDAVIS'

Which matches in both cases:

-- What we expected
first_name || last_name = 'JENNIFER' || 'DAVIS'

-- What we got
first_name || last_name = 'JENNI' || 'FERDAVIS'

Notice that I only added this customer to the customer table, not to the actor table. There’s no actor by the name FERDAVIS, so the result is clearly wrong.

AHA! Let’s use an “impossible” separator

So, we might proceed to fixing this as such:

SELECT *
FROM customer
WHERE first_name || '###' || last_name IN (
  SELECT first_name || '###' || last_name
  FROM actor
)

And now, the result is again correct. We get only JENNIFER DAVIS because we were looking for:

first_name || '###' || last_name = 'JENNIFER###DAVIS'

This works quite well for a while, as the separator is quite “impossible” (i.e. improbable) to be encountered in actual data. But we shouldn’t trust our judgement, because… Murphy’s Law. So you might think: better use a more rare separator, e.g. (if your database supports proper character sets)

SELECT *
FROM customer
WHERE first_name || '๐Ÿ™ˆ๐Ÿ™‰๐Ÿ™Š' || last_name IN (
  SELECT first_name || '๐Ÿ™ˆ๐Ÿ™‰๐Ÿ™Š' || last_name
  FROM actor
)

The use of emojis should indicate what my opinion of this approach is.

Too bad for performance, though

Remember that index we’ve created? Fact is, we also have such an index on the ACTOR table:

CREATE INDEX idx_actor_name ON actor (last_name, first_name);

And now, let’s assume our query is a bit different. We’ll be looking only for customers whose address_id is 10:

SELECT *
FROM customer
WHERE address_id = 10
AND first_name || '๐Ÿ™ˆ๐Ÿ™‰๐Ÿ™Š' || last_name IN (
  SELECT first_name || '๐Ÿ™ˆ๐Ÿ™‰๐Ÿ™Š' || last_name
  FROM actor
)

Now, our querymoji is using the index indeed, but for an INDEX FULL SCAN, so it’s only slightly faster than scanning the entire actor table:

-----------------------------------------------------------------------------------
| Id  | Operation                            | Name                       | Rows  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                            |       |
|*  1 |  HASH JOIN SEMI                      |                            |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER                   |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_CUSTOMER_FK_ADDRESS_ID |     1 |
|   4 |   INDEX FULL SCAN                    | IDX_ACTOR_NAME             |     2 |
-----------------------------------------------------------------------------------

And what’s worse, even if all the cardinality estimates correctly indicate only 1-2 rows, we’ll perform a HASH JOIN and load the full index for it! We should be running a NESTED LOOP instead.

Is there a better way? Yes! Use row constructors to compare several values at once:

SELECT *
FROM customer
WHERE address_id = 10
AND (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM actor
);

Or, if your database doesn’t support this syntax (luckily, Oracle and PostgreSQL do, for instance), then you can resort to an equivalent EXISTS predicate

SELECT *
FROM customer c
WHERE address_id = 10
AND EXISTS (
  SELECT 1
  FROM actor a
  WHERE c.first_name = a.first_name
  AND c.last_name = a.last_name
);

Both of these queries are exactly equivalent and result in a nested loop semi join, rather than the previous hash join, which is perfectly reasonable for these small tables. We can now use the IDX_ACTOR_NAME for a quick INDEX RANGE SCAN operation:

-----------------------------------------------------------------------------------
| Id  | Operation                            | Name                       | Rows  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                            |       |
|   1 |  NESTED LOOPS SEMI                   |                            |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER                   |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_CUSTOMER_FK_ADDRESS_ID |     1 |
|*  4 |   INDEX RANGE SCAN                   | IDX_ACTOR_NAME             |     1 |
-----------------------------------------------------------------------------------

But let’s not trust the estimated plans. Let’s benchmark (more info about benchmarking SQL here)

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2500;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name
        FROM customer
        WHERE address_id = 10
        AND first_name || '###' || last_name IN (
          SELECT first_name || '###' || last_name
          FROM actor
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 1 : ' 
      || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name
        FROM customer
        WHERE address_id = 10
        AND (first_name, last_name) IN (
          SELECT first_name, last_name
          FROM actor
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 2 : '
      || (SYSTIMESTAMP - v_ts));
  END LOOP;
END;
/

As can be seen here, the benchmark shows that the query using the row constructor is drastically faster as it can properly use the index as it should:

Run 1, Statement 1 : +000000000 00:00:00.374471000
Run 1, Statement 2 : +000000000 00:00:00.062830000
Run 2, Statement 1 : +000000000 00:00:00.364168000
Run 2, Statement 2 : +000000000 00:00:00.066252000
Run 3, Statement 1 : +000000000 00:00:00.359559000
Run 3, Statement 2 : +000000000 00:00:00.063898000
Run 4, Statement 1 : +000000000 00:00:00.344775000
Run 4, Statement 2 : +000000000 00:00:00.086060000
Run 5, Statement 1 : +000000000 00:00:00.394163000
Run 5, Statement 2 : +000000000 00:00:00.063176000

Now, imagine we were running this against some much more impressive data sets than the Sakila database

Conclusion

If you’re ever thinking about concatenating two fields for a comparison, try again. There are two major caveats that should indicate you’re about to do something silly:

  • There’s a major risk of your query being subtly wrong (accidental matches between JENNIFER DAVIS and JENNI FERDAVIS)
  • There’s a major risk of your query being quite slow

So, as a rule of thumb, don’t use concatenation in predicates. There’s (almost) always a better way.

Read also: Why You Should (Sometimes) Avoid Expressions in SQL Predicates

6 thoughts on “Don’t Use the String Concatenation “Trick” in SQL Predicates

  1. There’s a perfect solution for the first problem: Prefixing every field by its length and a separator. The last field does’t need it. So length (first_name) || ‘ ‘ || first_name || last_name should do. Some escaping would do, too, but it’s slower and more complicated.

    Anyway, there’s no solution to the second problem, so my comment doesn’t change anything.

  2. Remark:
    In MS SQL Server you could add a computed column (LastName + ‘###’ + FirstName) and index it.
    Oracle supports indexes with formulas direct (without adding a computed column first).

    Another way to compare composited fields would be the useage of some hash functions as HASHBYTES(‘SHA2_256’, CONCAT(LastName, ‘###’, FirstName) (for MS SQL), of course you would need a fitting index on the hash column again and hashing would prevent any LIKE or BETWEEN searches.

    Problem three is a little riddle: Donald Trump has it, and it is short, Arnold Schwarzenegger has it and it is much longer, the pope has it, but he doesn’t use it. Madonna does not have it.
    Of course I’m talking about the last name – and it could be NULL. And depending on your ANSI settings concating NULL with something else would result in NULL…

    • Of course you could do that, but that would add a lot of extra maintenance work to your server, given that there’s already a suitable index on the first_name / last_name columns.

      Nice remark about NULL last names ๐Ÿ™‚ Reminds me of this article here: http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names.

      I’m spoiled with using Oracle, mostly, where NULL and '' are the same thing, thus concatenating has no “undesired” side-effects.

    • HASHBYTES(โ€˜SHA2_256โ€™, CONCAT(LastName, โ€˜###โ€™, FirstName)

      I guess, with names you might be save assuming they don’t contain three consecutive number signs, but in general, you’d need my “Level 1” solution. ๐Ÿ˜€

      last name โ€“ and it could be NULL

      Not in my database. I always use an empty string in such funny cases as there are NOT NULL columns but no NOT EMPTY columns. Semantically, it sounds good as it’s no missing data (would you say that Madonna has an empty surname?) and it doesn’t spoil every expression it occurs in.

      Performance-wise, I don’t know if it’s a good idea, what do you think?

      • Performance-wise, I donโ€™t know if itโ€™s a good idea, what do you think?

        Sure it can be a good idea. I’ve used some “empty token” in the past for this purpose. E.g. <empty/>. Usually I prefer NULLs, but sometimes, having an actual non-null token was better for performance in Oracle, because NULLs by default are not indexed.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s