Faster SQL Through Occasionally Choosing Natural Keys Over Surrogate Keys

There are many many opinions out there regarding the old surrogate key vs. natural key debate. Most of the times, surrogate keys (e.g. sequence generated IDs) win because they’re much easier to design:

  • They’re easy to keep consistent across a schema (e.g. every table has an ID column, and that’s always the primary key)
  • They’re thus a no-brainer to add. When you create a new table, you don’t need to worry about any candidate keys
  • They’re guaranteed to be unique, because they have absolutely no business value, only a technical value

Great. So why bother using natural keys in the first place?

Well, there is a very compelling reason!

Performance!

Whenever you introduce surrogate keys, this means that your key data becomes completely meaningless. From a design perspective, that’s not too bad. You can easily join that other table to get the interesting, meaningful information that hides behind the surrogate foreign key value. For example, in our Sakila database

… we have a typical many-to-many relationship modelled with a relationship table between the FILM table and the CATEGORY table – state-of-the-art normalisation. But check out this interesting thing:

  • The FILM_CATEGORY relationship table doesn’t contain any interesting information at all. Just the relationships
  • The category table only contains a single useful column: The NAME column
  • The remaining columns (CATEGORY_ID and LAST_UPDATE) have absolutely no meaning

With this in mind, we could design a much simpler schema, where we use the category name as a natural key, and in fact, we don’t even need the CATEGORY table anymore, we can now remove it (that’s optional here. To ensure data correctness, we could keep it around, containing only the NAME column as a primary key). Check this out:

Now, if we run a query like the following one against our entire Sakila schema:

SELECT c.name, count(*)
FROM film_actor fa USING (actor_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

The query finds all categories a given actor played in, and the number of films that the given actor played in each category. For instance, this could be the result:

NAME       COUNT(*)
-------------------
Horror     3
Classics   2
Family     2
New        2
Games      2
Animation  1
Sports     1
Children   1
...

With an alternative schema where the category NAME has been moved to a new FILM_CATEGORY_NATURAL table, we could run this much simpler query:

SELECT fc.name, count(*)
FROM film_actor fa 
JOIN film_category_natural fc 
  USING (film_id)
WHERE actor_id = 1
GROUP BY fc.name
ORDER BY count(*) DESC

Notice how we can omit an entire JOIN.

The execution plans (here on Oracle) are quite different. Check this out:

Before:

After:

Unfortunately, the cost difference (8 vs 5) cannot be taken as a tool to compare actual costs between the two queries/plans. But the plans are otherwise very similar, except that we’re simply missing one table access (CATEGORY and an entire JOIN). That’s a significant improvement for something this simple. Imagine the improvement if we could roll out this kind of better query throughout the system?

We could look into more execution plan measurements (especially from the actual plan results), but what if we simply benchmark the two queries using the same silly benchmark, as always, repeating each statement 100 times:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT c.name, count(*)
      FROM film_actor fa USING (actor_id)
      JOIN film_category fc USING (film_id)
      JOIN category c USING (category_id)
      WHERE actor_id = 1
      GROUP BY c.name
      ORDER BY count(*) DESC
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT fc.name, count(*)
      FROM film_actor fa 
      JOIN film_category_natural fc 
        USING (film_id)
      WHERE actor_id = 1
      GROUP BY fc.name
      ORDER BY count(*) DESC
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The results are drastic:

Statement 1 : 00:00:00.122070000
Statement 2 : 00:00:00.051179000

A factor of 2.5x faster!

(Disclaimer: Benchmarks are an inaccurate way of measuring things because they suffer (or benefit) from “unfair” side-effects like heavy caching, but they’re a helpful tool to assess the order of magnitude of a difference between two queries).

The JOIN is completely unnecessary

The only reason why we joined the CATEGORY table each time is because we needed to display the meaningful business value of a category to the user, the CATEGORY.NAME. We could have avoided the JOIN if that was not a requirement, but displaying surrogate keys to the user (the CATEGORY_ID) would be rather harsh, wouldn’t it?

And we’re doing this all the time with all sorts of tables. We have:

  • Category tables (category name is a good candidate key)
  • Translation tables (label is a good candidate key)
  • Country tables (ISO 3166 country codes are a good candidate key)
  • Language tables (ISO 639 language codes are a good candidate key)
  • Currency tables (ISO 4217 currency codes are a good candidate key)
  • Stock symbol tables (ISIN security codes are a good candidate key)
  • … and many more

Working with natural keys can be quite cumbersome. But in some entities, the internationally standardised codes are really good candidate keys, and most of the time, they’re sufficient. What does the LANGUAGE_ID 47 even mean? It means nothing. An experienced DBA will remember, after a while, that it means “English”. But wouldn’t EN be a much better value?

You would have EN as a primary key AND foreign key value, so chances are, because everyone (including frontend developers who probably hard-code some translations anyway) knows what language EN is (but no one knows what 47 means), you will almost never need to join the language table again – except in those rare cases where you want to work with secondary language columns, such as, for instance, DESCRIPTION.

Now, imagine what happens if we search for English entries, or as in our previous example, for films of category “Classics”? Our entire JOIN graph would be simplified.

(Another place where we don’t need additional surrogate keys is the relationship table. In this particular case, there’s no such key anyway)

Caveats

Our category strings are quite short. If natural keys become longer, then the duplication itself can become a problem on a lower storage level, as you might need more pages and blocks to store the same amount of rows.

Please, do take this advice in this article with a grain of salt. The essence here is to not always follow strict rules that were established only as a good default. There are always tradeoffs!

Conclusion

This should be quite a straightforward refactoring for many applications. If your tables are extremely obvious picks for a natural key (like the above), then do use natural keys. Your queries will immediately be faster – not necessarily much faster, but probably you can speed up a significant number of queries, i.e. take load off your entire system. Plus: your database will be more user-friendly.

And all of this at the price of not using the identical table design everywhere. I mean – when was having an identical table design a real business case anyway, right?

Side-note

In some cases, you could take this even one step further and denormalise your schema by putting categories as arrays or XML or JSON data structures directly inside your films. You’ll lose the normalisation benefits, but you could further win in terms of performance. For more details, read this very interesting article (about PostgreSQL) here:

http://www.databasesoup.com/2015/01/tag-all-things.html

18 thoughts on “Faster SQL Through Occasionally Choosing Natural Keys Over Surrogate Keys

  1. I know the battle between surrogate key VS natural key, and can say the general rule is “use surrogate key” (and I agree with that). But it doesn’t mean a developer must ignore the details of a modeling its schema.

    I think your post advices that, it advices about paying attention and taking seriously when designing a schema. There’s some common cases where the general rule doesn’t apply and you just showed that very well.

    So I agree with you, there’s some common and exceptional cases where surrogate key is not the best choice even when it seems to be correct on design perspective.

    1. Exactly, thanks for the feedback. In fact, I need to update the post. I’ve seen surrogate keys in relationship tables, which are completely unnecessary as well…

      And yes, I’ve deliberately put the word “occasionally” in the title, in order to prevent a flame war :)

  2. Hi !
    looks like the first model allows you to express things like (1) a category regroups several films, and (2) a film can be categorized by several categories. how does the second model help express the latter? Looks like the semantics got changed on the way.

    1. Yes, the first model is a many to many relationship. That’s still the case in the second model. There’s no implied unique key on the category name in the new FILM_CATEGORY_NATURAL table. There is only a unique (primary) key on both (FILM_ID, NAME).

      I’ll write this into the article to avoid confusion (I’m aware that the MS Paint-edited picture got the PK tag wrong)

  3. To ensure data correctness you still need to restrict the film-category name to be one of a list of legal values. You could decide to implement this as a check constraint on a table, of course, and if you have a database that supports domains then category name is a candidate for a domain.

    In the absence of any other method of capturing this information in your graphic model, this suggests the need for retaining the CATEGORY entity with name as primary key even when the name is the only column.

    1. Thanks for your comment. You’re right, of course. That’s what was implied with the remark “that’s optional here”. I’ll add another remark to clarify this.

  4. I agree with you for simple codes that do not need translation. Imagine if you application is being used my people across the world, then you would need to translate and that would need category table along with language code.

    1. Yes, as mentioned in the article, the removal of the category table is optional. Probably, you want it to stay there because there can be:

      1. Categories with no films
      2. You may want some additional attributes on a category

      The point is, however, that if the NAME column was the category’s primary key, we wouldn’t need to join the category in many cases. And the other point is: It’s just an example. The ISO code examples are probably better,

  5. Another good reason for the usage of surrogate keys is the fact that very often people want to change the value of a natural key. Without a surrogate key this is hard to implement and very resource intensive to do.

    1. But if the value can change, then it’s not really a good natural key, is it? Could you provide an example where people would want to do that?

      1. Yes, we agree. However, that’s what I observe. In your example, it can be that the category name is initially entered as “SF” and later changed to “Science Fiction”. One can argue that for display purposes another column should be introduced, so the name can retain its value. In practice then the Name or Code column tend to become meaningless, which users don’t want to enter or maintain.

        1. OK, I see, but that’s such a fundamental change. I’d even argue that the category shouldn’t be renamed, but archived (and references migrated, perhaps). It’s like when a country is “renamed”. Technically, that would be come a new country (ISO code) and the old one would be archived – regardless if using surrogate keys or natural keys.

          1. Well, the meaning is still the same. Only the coding changed. In that way I don’t think it is a fundamental change. In your example with countries the re-coding only happens (as far as I know) if countries split or merge. But of course, it’s just an example. The migration of references, which is certainly needed without the surrogate key, is the resource costly part I referred to.

          2. You can’t get away with telling a business user that Categories can’t be renamed. Put in a Code column as the PK. Code values could be entered manually or generated using the initial Name, (E.g. Uppercase ASCII, no spaces, no punctuation). That gives a user the ability to rename Categories, to have overlapping names, and the ability to reference Category code values from program code.

  6. I agree with this blog, except that it talks up one side without giving credit to other situations.

    In the hundreds of table I have designed, I used a surrogate key about one-third of the time. Since I always weigh the tradeoffs, I believe that about 1/3 of the time surrogate “wins” and 2/3 of the time “natural” wins.

    Maybe someday I will blog on when to do which.

    Yes, CHAR(2) CHARSET ascii is “right” for country_code, even if you also need another table to provide the spelled-out name of the country.

    The example with category begins to fall apart when you need a billion rows with either a short id or a long name. This becomes one of many examples where space is the criteria for making the decision.

    1. I’m very curious about a schema where 2/3rd of the time the “natural” key wins – I couldn’t imagine such a schema myself :) So, hope you’ll link to the blog post from here, when you wrote it.

  7. I have dealt with the pain of a user base wanting to change natural keys used as primary keys. A pain that scales with database size. That painful experience has prompted me to almost always default to a surrogate key. Unless, I know for certain that the natural key will never be touched. i.e. a state table or something similar. I would rather deal with a database that had used a surrogate when a natural would have worked, rather than a database with a natural used as a primary key that the users want to change.

Leave a Reply to lukasederCancel reply