Say NO to Excessive Use of Surrogate Keys if Performance Really Matters to You

We programmers keep cargo culting these wrong ideas. Recently, we said “NO” to Venn diagrams. Today we’re going to say no to surrogate keys.

The surrogate keys vs. natural keys non-debate is one of the most overheated debates in data architecture, and I don’t get why everyone is so emotional. Both sides claim to hold the ultimate truth (just like in the tabs vs. spaces non-debate) and prevent added value where it really matters.

This article sheds some light into why you shouldn’t be so dogmatic all the time. By the end of the article, you’ll agree, promised.

What’s so great about surrogate keys?

In case your SQL-speak is a bit rusty, a surrogate key is an artificial identifier. Or how Wikipedia puts it:

A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key which is derived from application data.

There are three clear advantages that surrogate keys have:

  1. You know every table has one, and it probably has the same name everywhere: ID.
  2. It is relatively small (e.g. a BIGINT)
  3. You don’t have to go through the “hassle” of designing your table for 30 seconds longer, trying to find a natural key

There are more advantages (check out the wikipedia article), and of course, there are disadvantages. Today, I’d like to talk about architectures where EVERY table has a surrogate key, even when it makes absolutely no sense.

When do surrogate keys make absolutely no sense?

I’m currently helping a customer improve performance on their queries that run against a log database. The log database essentially contains relevant information about sessions, and some transactions related to those sessions. (In case you’re jumping to the conclusion “hey don’t use SQL for logs”: Yes, they also use Splunk for unstructured logs. But these here are structured, and they run a lot of SQL style analytics against them).

Here’s what you can imagine is stored in that sessions table (using Oracle):

CREATE TABLE sessions (

  -- Useful values
  sess        VARCHAR2(50 CHAR) NOT NULL PRIMARY KEY,
  ip_address  VARCHAR2(15 CHAR) NOT NULL,
  tracking    VARCHAR2(50 CHAR) NOT NULL,
  
  -- and much more info here
  ...
)

Except, though, that’s not how the schema was designed. It was designed like this:

CREATE TABLE sessions (

  -- Meaningless surrogate key
  id NUMBER(18) NOT NULL PRIMARY KEY,

  -- Useful values
  sess        VARCHAR2(50 CHAR) NOT NULL,
  
  -- Meaningless foreign keys
  ip_id       NUMBER(18) NOT NULL,
  tracking_id NUMBER(18) NOT NULL,
  ...
  
  FOREIGN KEY (ip_id) 
    REFERENCES ip_addresses (ip_id),
  FOREIGN KEY (tracking_id) 
    REFERENCES tracking (tracking_id),
  ...
  
  -- Non-primary UNIQUE key
  UNIQUE (sess),
)

So, this so called “fact table” (it’s a star schema after all) contains nothing useful, only a set of surrogate keys that contain references to the interesting values, which are all located in other tables. For instance, if you want to count all session for a given IP address, you already need to run a JOIN:

SELECT ip_address, count(*)
FROM ip_addresses
JOIN sessions USING (ip_id)
GROUP BY ip_address

After all, we need the join because what the user really likes to see is the IP address, not the surrogate key. Duh, right? Here’s the execution plan:

------------------------------------------------------
| Operation           | Name         | Rows  | Cost  |
------------------------------------------------------
| SELECT STATEMENT    |              |  9999 |   108 |
|  HASH GROUP BY      |              |  9999 |   108 |
|   HASH JOIN         |              | 99999 |   104 |
|    TABLE ACCESS FULL| IP_ADDRESSES |  9999 |     9 |
|    TABLE ACCESS FULL| SESSIONS     | 99999 |    95 |
------------------------------------------------------

Perfect hash join with two full table scans. In my example database, I have 100k sessions and 10k IP addresses.

Obviously, there’s an index on the IP_ADDRESS column, because I want to be able to filter by it. Something meaningful, like:

SELECT ip_address, count(*)
FROM ip_addresses
JOIN sessions USING (ip_id)
WHERE ip_address LIKE '192.168.%'
GROUP BY ip_address

Obviously, the plan is a bit better, because we’re returning less data. Here’s there result:

----------------------------------------------------------------
| Operation                     | Name         | Rows  | Cost  |
----------------------------------------------------------------
| SELECT STATEMENT              |              |     1 |    99 |
|  HASH GROUP BY                |              |     1 |    99 |
|   HASH JOIN                   |              |    25 |    98 |
|    TABLE ACCESS BY INDEX ROWID| IP_ADDRESSES |     1 |     3 |
|     INDEX RANGE SCAN          | I_IP         |     1 |     2 |
|    TABLE ACCESS FULL          | SESSIONS     | 99999 |    95 |
----------------------------------------------------------------

Intersting. We can now use our index statistics to estimate that our predicate will return only one row from the ip_address table. Yet, we still get a hash join with significant cost, for what now appears to be a rather trivial query.

What would the world look like without surrogate keys?

Easy. We no longer need the join, every time we need something IP address related from the sessions table.

Our two queries become, trivially:

-- All counts
SELECT ip_address, count(*)
FROM sessions2
GROUP BY ip_address

-- Filtered counts
SELECT ip_address, count(*)
FROM sessions2
WHERE ip_address LIKE '192.168.%'
GROUP BY ip_address

The first query yields a simpler execution plan, with around the same cost estimate.

--------------------------------------------------
| Operation          | Name      | Rows  | Cost  |
--------------------------------------------------
| SELECT STATEMENT   |           |   256 |   119 |
|  HASH GROUP BY     |           |   256 |   119 |
|   TABLE ACCESS FULL| SESSIONS2 | 99999 |   116 |
--------------------------------------------------

We don’t seem to gain that much here, but what happens with the filtered query?

------------------------------------------------
| Operation            | Name  | Rows  | Cost  |
------------------------------------------------
| SELECT STATEMENT     |       |     1 |     4 |
|  SORT GROUP BY NOSORT|       |     1 |     4 |
|   INDEX RANGE SCAN   | I_IP2 |   391 |     4 |
------------------------------------------------

OMG! Where did our costs go? Huh, this seems to be extremely fast! Let’s benchmark!

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100000;
BEGIN
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT ip_address, count(*)
      FROM ip_addresses
      JOIN sessions USING (ip_id)
      WHERE ip_address LIKE '192.168.%'
      GROUP BY ip_address
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Surrogate: '
    || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT ip_address, count(*)
      FROM sessions2
      WHERE ip_address LIKE '192.168.%'
      GROUP BY ip_address
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Natural  : '
    || (SYSTIMESTAMP - v_ts));
END;
/
Surrogate: +000000000 00:00:03.453000000
Natural  : +000000000 00:00:01.227000000

The improvement is significant, and we don’t have a lot of data here. Now, when you think about it, it is kind of obvious, no? For the semantically exact same query, we either run a JOIN, or we don’t. And this is using very very little data. The actual customer database has hundreds of millions of sessions, where all these JOINs waste valuable resources for nothing but an artificial surrogate key which was introduced… because that’s how things were always done.

And, as always, don’t trust your execution plan costs. Measure. Benchmarking 100 iterations of the unfiltered query (the one that produced a hash join) yields:

Surrogate: +000000000 00:00:06.053000000
Natural  : +000000000 00:00:02.408000000

Still obvious, when you think of it.

Do note that we’re not denormalising yet. There’s still an IP_ADDRESS table, but now it contains the business key as the primary key (the address), not the surrogate key. In more rare querying use-cases, we’ll still join the table, in order to get IP-address related info (such as country).

Taking it to the extreme

The database at this customer site was designed by someone who appreciates purity, and I can certainly relate to that some times. But in this case, it went clearly wrong, because there were many of these queries that were actually filtering for a “natural key” (i.e. a business value), but needed to add yet another JOIN just to do that.

There were more of these cases, for instance:

  • ISO 639 language codes
  • ISIN security numbers
  • Account numbers, which had a formal identifier from an external system

and many more. Each and every time, there were dozens of JOINs required just to get that additional mapping between surrogate key and natural key.

Sometimes, surrogate keys are nice. They do use a little less disk space. Consider the following query (credits go to Stack Overflow user WW.)

SELECT 
  owner,
  table_name,
  TRUNC(SUM(bytes)/1024) kb,
  ROUND(ratio_to_report(SUM(bytes)) OVER() * 100) Percent
FROM (
  SELECT 
    segment_name table_name,
    owner,
    bytes
  FROM dba_segments
  WHERE segment_type IN (
    'TABLE', 
	'TABLE PARTITION', 
	'TABLE SUBPARTITION'
  )
  UNION ALL
  SELECT 
    i.table_name,
    i.owner,
    s.bytes
  FROM dba_indexes i,
    dba_segments s
  WHERE s.segment_name = i.index_name
  AND s.owner          = i.owner
  AND s.segment_type  IN (
    'INDEX', 
	'INDEX PARTITION', 
	'INDEX SUBPARTITION'
  )
)
WHERE owner = 'TEST'
AND table_name IN (
  'SESSIONS', 
  'IP_ADDRESSES', 
  'SESSIONS2'
 )
GROUP BY table_name, owner
ORDER BY SUM(bytes) DESC;

This will show us the disk space used by each object:

TABLE_NAME                   KB    PERCENT
-------------------- ---------- ----------
SESSIONS2                 12288         58
SESSIONS                   8192         39
IP_ADDRESSES                768          4

Yes. We use a little more disk space, because now our primary key in the sessions table is a VARCHAR2(50) rather than a NUMBER(18). But disk space is extremely cheap, whereas wall clock time performance is essential. By removing just a little complexity, we’ve greatly increased performance already for a simple query.

Conclusion

Surrogate keys or natural keys? I say both. Surrogate keys do have advantages. You generally don’t want a 5 column composite primary key. And even less so, you don’t want a 5 column composite foreign key. In these cases, using a surrogate key can add value by removing complexity (and probably increasing performance again). But choose wisely. Ever so often, blindly using surrogate keys will go very wrong, and you’ll pay for it dearly when it comes to querying your data.

Further reading:

5 thoughts on “Say NO to Excessive Use of Surrogate Keys if Performance Really Matters to You

  1. With the IP address, you’re lucky as 1. it’s about as short as the surrogate key, 2. it stands for itself so we know it’ll never change (unlike SSN representing a person, which may change, or IBAN representing an account, etc.; probably also ISIN).

    I might even dare to say that there’s no really immutable natural key for an entity, unless the entity is an artificial construct standing for itself (as IP STRING does).

    That said, lesson taken; I won’t number everything blindly.

    • We might claim that SSN, IBAN, and ISIN will never change either although the value that they represent may be migrated to a new value. But those will be different values (new “persons”, new “books”, new “securities”) for accountability, audit, and history reasons. Specifically for the security, I’m pretty sure you will not just modify its ISIN and lose track of all historic data.

      I could take your argument to the extreme and say: Hey the IP address might change as well. The device for which it stands gets a new IP all the time 🙂

      Anyway. Good points. Natural keys are much harder to get right. I mentioned this in the article as well. By using a surrogate key, it is extremely easy to get row identity right from the beginning.

  2. For me the discussion surrogate keys vs natural keys is from old days, when storage was slow and expansive. Today I would not recommand, to use natural keys. Not cause I am emotional, but there are good reasons for it. I have a similar discussion at my job about explicit and implicit Joins, specially when it comes to the oracle world. Well, humans are lazy. But lets focus on this article….

    First of all, there are some minor syntactical errors in this article. For example:

    SELECT ip_address, count(*)
    FROM sessions USING — USING is wrong without the Join
    WHERE ip_address LIKE ‘192.168.%’
    GROUP BY ip_address

    Not a big iusse, prolly only a copy and paste mistake, but I wonder if the author tested his SQL commands posted here. And an explain plan with a table called “SESSIONS2” let me guess, there was some other tables around. Another issue in the example is, I cannot see an index on the ip_address column in the sessions table (btw. it is good style to use singular for object names). A reference without an index is a bad idea most of the times. It would not suprise me, if there wont be a full table scan on the SESSIONS table with a proper index on it anymore. And if I have the choise, changing the data model for using natural keys or create an index, well it would be an easy choise.

    But that is not the main reason for surrogate and against natural keys. The killer argument is controlling. When we model a database, we make a “data copy” of a specific enviroment. And that enviroment can and will change, we are not in control of it. A famous person said: “The natural key should be immutable and sensible ” And exactly that is the problem, immutable natural keys do not exists. In old days, a database had a very constant data layout without many changes, so you did not stumble over it very often. Today development is much faster, the changes in the hole world are much faster. So using natural keys will let you lose the control over you primary keys and relationships and that is the killer argument.

    In general, I would never use natural keys. If there would be ever an exception for me, there must be very, very, very big and heavy reasons for me to take natural keys. But since now I cannot see that reason. I wont list all pros and cons here, there are enough documents about this topic already in the internet, just the iusse with the performance, cause it is mentioned here.

    The only reason why you can skip the join is not cause you use natural keys, it is cause you use redundancy. A surrogate key in a child table will only store the relationship, nothing more. A natural key will store the relationship AND the data of the parent table. We could do the same with surrogate keys and you will see, we can skip the join as well. All you need to do is store the surrogate key in the child table and the natural unique key value from the parent table in an extra column in the child table (this time wiithout any reference on the natural column). Of course now we got two columns in the child table and that would be bad design, cause of the redundancy, but I just wanted to point out, that the preformance gain of skiping one table is only possible, cause natural keys always use/need redundancy.

    If you model a database and you wont to store the same information only at one place, you will have to join, period. in that case joins are nothing bad and with the proper index they wont hurt.

    a data modeler

    • Thanks for the syntax error hints. Fixed.

      I’m not sure why you got caught up with those indexes on foreign keys so much. If we’re aggregating the entire table, we’d definitely prefer hash joins over nested loop joins, where indexes on the foreign keys won’t help.

      You’re right about redundancy being the main reason for being able to skip the join, yet I don’t agree that your counterexample where you introduced non-key redundancy is equivalent because the foreign key constraint will no longer enforce the integrity of redundant data in your case. Sure, that can be fine, but with a natural key, in those rare cases where they make sense, we get that advantage, too.

      Of course, you can argue these things don’t matter for single queries / single joins, but if you look at an entire system, making the most-accessed data redundant does impact performance.

      • About the Index on foreign keys, we have to specifiy if we are modeling an OLTP or OLAP system. In an OLTP system you want an index, not only cause of the performance of the querys, but if you delete a parent record, it would be “hard work” for the RDBMS checking (locks) the child tables without any index on the foreign key columns. So in my opinion, I would Index every foreign key column in OLTP Systems. Or in other words, no matter if we would use surrogate or natural keys in an OLTP system, I want the index on a foreign key.

        OLAP systems are different, redundancy and DML are not big issues as in OLTP systems. But let us face your example, where you pointed out the extreme performance gain for natural keys. You dont aggregate the entire table(s) there. You filter the ip_addresses table by the ip_address. And it looks like that filter in the WHERE clause is very selective in the parent table (WHERE fires before GROUP BY). If you would add an index on the foreign key column in the sessions table and you would do the same benchmark test like you did before, it would not surprise me if the performance would be almost equal with surrogate keys. If you GROUP over the hole table(s), you dont need an index and a hash join would be a good plan. But thats not what you did in your example with the “OMG!” performance gain. Create the Index, do the same benchmark test and let us compare.

        About the redundancy and the foreign key constraint, it is the same, believe it or not ^^ If you want, you could even add the foreign key constraint on the redundant column, cause you can create it on unique keys, dont need to be a primary key. That means you could have two foreign key constraints in the child table sessions, one refering on the surrogate key, one on the natural key of the ip_addresses table. You can play around with it, you dont need a foreign key constraint at all and will be still able to join. MYSQL MyISAM Engine is a good example for it. After all it is “just” a constraint….

        So time for a conclusion. If you want to be able to skip a join, you have to work with redundancy most of the time (yes I know there are other technical solutions). I dont need natural keys for redundancy, I can do the same with surrogate keys. And the disadvantages of natural keys are….destructive…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s