The Cost of Useless Surrogate Keys in Relationship Tables

What’s a good natural key?

This is a very difficult question for most entities when you design your schema. In some rare cases, there seems to be an “obvious” candidate, such as a variety of ISO standards, including:

But even in those cases, there might be exceptions and the worst thing that can happen is a key change. Most database designs play it safe and use surrogate keys instead. Nothing wrong with that. But…

Relationship tables

There is one exception where a surrogate key is never really required. Those are relationship tables. For example, in the Sakila database, all relationship tables lack a surrogate key and use their respective foreign keys as a compound “natural” primary key instead:

So, the FILM_ACTOR table, for example, is defined as such:

CREATE TABLE film_actor (
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id)
);

There is really no point in adding another column FILM_ACTOR_ID or ID for an individual row in this table, even if a lot of ORMs and non-ORM-defined schemas will do this, simply for “consistency” reasons (and in a few cases, because they cannot handle compound keys).

Now, the presence or absence of such a surrogate key is usually not too relevant in every day work with this table. If you’re using an ORM, it will likely make no difference to client code. If you’re using SQL, it definitely doesn’t. You just never use that additional column.

But in terms of performance, it might make a huge difference!

Clustered indexes

In many RDBMS, when creating a table, you get to choose whether to use a “clustered index” or a “non clustered index” table layout. The main difference is:

Clustered index

… is a primary key index that “clusters” data together, which belongs together. In other words:

  • All the index column values are contained in the index tree structure
  • All the other column values are contained in the index leaf nodes

The benefit of this table layout is that primary key lookups can be much faster because your entire row is located in the index, which requires less disk I/O than the non clustered index for primary key lookups. The price for this is slower secondary index searches (e.g. searching for last names). The algorithmic complexities are:

  • O(log N) for primary key lookups
  • O(log N) for secondary key lookups plus O(M log N) for projections of non-secondary-key columns (quite a high price to pay)

… where

  • N is the size of the table
  • M is the number of rows that are searched in secondary keys

OLTP usage often profits from clustered indexes.

Non clustered index

… is a primary key index that resides “outside” of the table structure, which is a heap table. In other words:

  • All the index column values are contained in the index tree structure
  • All the index column values and other column values are contained in the heap table

The benefit of this table layout is that all lookups are equally fast, regardless if you’re using a primary key lookup or a secondary key search. There’s always an additional, constant time heap table lookup. The algorithmic complexities are:

  • O(log N) for primary key lookups plus O(1) for projections of non-primary-key columns (a moderate price to pay)
  • O(log N) for secondary key lookups plus O(M) for projections of non-secondary-key columns (a moderate price to pay)

OLAP usage definitely profits from heap tables.

Defaults

  • MySQL’s InnoDB offers clustered indexes only.
  • MySQL’s MyISAM offers heap tables only.
  • Oracle offers both and defaults to heap tables
  • PostgreSQL offers both and defaults to heap tables
  • SQL Server offers both and defaults to clustered indexes

Note that Oracle calls clustered indexes “index organised tables”

Performance

In this article, I’m checking MySQL’s performance as MySQL’s InnoDB doesn’t offer to switch the table layout. Curiously, the problems shown below could not be reproduced on PostgreSQL as shown by reddit user /u/ForeverAlot. Details here.

With the algorithmic complexities above, we can easily guess what I’m trying to hint at here. In the presence of a clustered index, we should avoid expensive secondary key searches when possible. Of course, these searches cannot always be avoided, but if we review the alternative design of these two tables:

CREATE TABLE film_actor_surrogate (
  id int NOT NULL,
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_surrogate_pkey PRIMARY KEY (id)
);

CREATE TABLE film_actor_natural (
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id)
);

… we can see that if we’re using a clustered index here, the clustering will be made based on either:

  • FILM_ACTOR_SURROGATE.ID, which is a very useless clustering
  • (FILM_ACTOR_NATURAL.ACTOR_ID, FILM_ACTOR_NATURAL.FILM_ID), which is a very useful clustering

In the latter case, whenever we look up an actor’s films, we can use the clustering index as a covering index, regardless if we project anything additional from that table or not.

In the former case, we have to rely on an additional secondary key index that contains (ACTOR_ID, FILM_ID), and chances are that secondary index is not covering if we have additional projections.

The surrogate key clustering is really useless, because we never use the table this way.

Does it matter?

We can easily design a benchmark for this case. You can find the complete benchmark code here on GitHub, to validate the results on your environment. The benchmark uses this database design:

create table parent_1 (id int not null primary key);
create table parent_2 (id int not null primary key);

create table child_surrogate (
  id int auto_increment, 
  parent_1_id int not null references parent_1, 
  parent_2_id int not null references parent_2, 
  payload_1 int, 
  payload_2 int, 
  primary key (id), 
  unique (parent_1_id, parent_2_id)
) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */
;

create table child_natural (
  parent_1_id int not null references parent_1, 
  parent_2_id int not null references parent_2, 
  payload_1 int, 
  payload_2 int, 
  primary key (parent_1_id, parent_2_id)
) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */
;

Unlike in the Sakila database, we’re now adding some “payload” to the relationship table, which is not unlikely. Recent versions of MySQL will default to InnoDB, which only supports a clustered index layout. You can uncomment the ENGINE storage clause to see how this would perform with MyISAM, which only supports heap tables.

The benchmark adds:

  • 10 000 rows in PARENT_1
  • 100 rows in PARENT_2
  • 1 000 000 rows in both CHILD tables (just a cross join of the above)

And then, it runs 5 iterations of 10000 repetitions of the following two queries, following our standard SQL benchmark technique:

-- Query 1
SELECT c.payload_1 + c.payload_2 AS a 
FROM parent_1 AS p1 
JOIN child_surrogate AS c ON p1.id = c.parent_1_id 
WHERE p1.id = 4;

-- Query 2
SELECT c.payload_1 + c.payload_2 AS a 
FROM parent_1 AS p1 
JOIN child_natural AS c ON p1.id = c.parent_1_id 
WHERE p1.id = 4;

Notice that MySQL does not implement join elimination, otherwise, the useless join to PARENT_1 would be eliminated. The benchmark results are very clear:

Using InnoDB (clustered indexes)

Run 0, Statement 1 : 3104
Run 0, Statement 2 : 1910
Run 1, Statement 1 : 3097
Run 1, Statement 2 : 1905
Run 2, Statement 1 : 3045
Run 2, Statement 2 : 2276
Run 3, Statement 1 : 3589
Run 3, Statement 2 : 1910
Run 4, Statement 1 : 2961
Run 4, Statement 2 : 1897

Using MyISAM (heap tables)

Run 0, Statement 1 : 3473
Run 0, Statement 2 : 3288
Run 1, Statement 1 : 3328
Run 1, Statement 2 : 3341
Run 2, Statement 1 : 3674
Run 2, Statement 2 : 3307
Run 3, Statement 1 : 3373
Run 3, Statement 2 : 3275
Run 4, Statement 1 : 3298
Run 4, Statement 2 : 3322

You shouldn’t read this as a comparison between InnoDB and MyISAM in general, but as a comparison of the different table structures within the boundaries of the same engine. Very obviously, the additional search complexity of the badly clustered index in CHILD_SURROGATE causes a 50% slower query execution on this type of query, without gaining anything.

In the case of the heap table, the additional surrogate key column did not have any significant effect.

Again, the full benchmark can be found here on GitHub, if you want to repeat it.

Conclusion

Not everyone agrees what is generally better: clustered or non clustered indexes. Not everyone agrees on the utility of surrogate keys on every table. These are both quite opinionated discussions.

But this article clearly showed that on relationship tables, which have a very clear candidate key, namely the set of outgoing foreign keys that defines the many-to-many relationship, the surrogate key not only doesn’t add value, but it actively hurts your performance on a set of queries when your table is using a clustered index.

MySQL’s InnoDB and SQL Server use clustered indexes by default, so if you’re using any of those RDBMS, do check if you have room for significant improvement by dropping your surrogate keys.

16 thoughts on “The Cost of Useless Surrogate Keys in Relationship Tables

  1. A sort of minor note. Dr. Codd defined a “surrogate key” is being created by the SQL engine and completely hidden from the users. Think about a hash column, rather than an index that you actually have to create.

    1. Interesting, thanks for the pointer. I was not aware of that. This is not how we use the term today, though, right? After all, we often want to reference a row by precisely that surrogate key value, also from external systems.

  2. For write-heavy tables clustered indices are generally worse though, right? As in more work needing to be done to reorder the indices (even physical location of rows on some databases?) after an insert?

    1. In those cases, you might do well with some surrogate key or just a separate column that’s always-increasing. Cluster on that, unique index on the key values. That will make the writes relatively fast, but the reads should still be reasonable. Might also consider how much free space is left for those indexes to avoid page splits and such.

      We went through a write-heavy table that was clustered for read performance. The writes took 2x longer than when we switched to clustering on an ever-increasing value column. Reads were about the same once we had the right indexes for them.

    1. Well, you could do that of course to prevent this issue. You could also add a few secondary indexes and make sure to make them covering. There are many solutions to the problem exposed here.

      In your case, though, the question still remains. What’s the point of this surrogate PK in the relationship table? What are you gaining from it?

      1. So I can have generic application code that handles objects by id, like for example delete code that calls a REST API with ID in the URL and the row is removed from the database. This code is far more complex if it has to support composite keys.

        1. I see. Is that really a use case for relationship tables, though? When you expose that model via REST, you’re probably going to nest one collection in another, and not expose the relationship table directly.

          Anyway, if you do want to expose it, then sure, you could add an additional key (it doesn’t have to be the primary key).

          Of course, at some point, these workarounds become impractical, and you revert to the original design. This article is not trying to provide a true/false solution for all cases, just raise awareness that this particular issue can be a problem.

  3. If you want an IDENTITY ID on all tables (we do, but agree that it is Horses-for-Courses) why not just have a Unique Index/Constraint on that ID, and make the Clustered Index whatever is most appropriate for performance etc? In this case the Actor_ID/Film_ID combination

  4. To me this issue is easily solved by additional covering indexes. If you’re using surrogate keys everywhere else, it’s easier to maintain if you stay consistent and predictable to the developers that will take over the code later. Databases in the real world are not static and things change all the time. I can’t count the number of times a relationship table ended up becoming an entity itself with it’s own additional properties as businesses grow and requirements change. Being consistent with the primary keys makes maintenance and updates that much easier. I appreciate the awareness this article brings, and I’m very aware of the religious war behind the scenes on this one, but I just wanted to point out that not all databases are built in a bubble. Code consistently is important when the design is fluid so that significant amounts of rework are not required to support additional fields in tables.

    1. Yes, the article mainly aims at pointing out the default behaviour in various RDBMS, and how this can turn into a problem. Additional covering indexes are a viable remedy, but their maintenance has its own price.

      At some point, there’s always a tradeoff, and schema design consistency has to be sacrificed. It is not a an immutable goal either, like everything else.

  5. When analyzing the algorithmic complexities for tables with non-clustered primary keys, wouldn’t the actual complexity be:

    O(log N) for [primary key / secondary key] lookups plus O(M*N) for projections of [non-primary-key / non-secondary-key] columns? As a binary heap is only O(1) in min-search / max-search (depending on the heap organization), and for each row searched on your non-clustered index a full search on the heap will be required (worst case) in order to find the required projections?

    1. I cannot comment for all heap implementations, but in Oracle, access to the heap table by ROWID (which is a pointer to the physical row address) is O(1), so M accesses to the heap table will be O(M). If it were O(M*N), databases would be quite unusable.

  6. Curiously, the problems shown below could not be reproduced on PostgreSQL
    Which is unsurprising, as this:
    PostgreSQL offers both and defaults to heap tables
    is flat out wrong — it offers heap tables only. CLUSTER is not “index-organized table” or “clustered index” — it just sorts heap entries according to an index once (i.e. the ordering is not automatically preserved). See: https://www.postgresql.org/docs/current/sql-cluster.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.