The Cost of Useless Surrogate Keys in Relationship Tables

What’s a good natural key?

This is a very difficult question for most entities when you design your schema. In some rare cases, there seems to be an “obvious” candidate, such as a variety of ISO standards, including:

But even in those cases, there might be exceptions and the worst thing that can happen is a key change. Most database designs play it safe and use surrogate keys instead. Nothing wrong with that. But…

Relationship tables

There is one exception where a surrogate key is never really required. Those are relationship tables. For example, in the Sakila database, all relationship tables lack a surrogate key and use their respective foreign keys as a compound “natural” primary key instead:

So, the FILM_ACTOR table, for example, is defined as such:

CREATE TABLE film_actor (
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id)
);

There is really no point in adding another column FILM_ACTOR_ID or ID for an individual row in this table, even if a lot of ORMs and non-ORM-defined schemas will do this, simply for “consistency” reasons (and in a few cases, because they cannot handle compound keys).

Now, the presence or absence of such a surrogate key is usually not too relevant in every day work with this table. If you’re using an ORM, it will likely make no difference to client code. If you’re using SQL, it definitely doesn’t. You just never use that additional column.

But in terms of performance, it might make a huge difference!

Clustered indexes

In many RDBMS, when creating a table, you get to choose whether to use a “clustered index” or a “non clustered index” table layout. The main difference is:

Clustered index

… is a primary key index that “clusters” data together, which belongs together. In other words:

  • All the index column values are contained in the index tree structure
  • All the other column values are contained in the index leaf nodes

The benefit of this table layout is that primary key lookups can be much faster because your entire row is located in the index, which requires less disk I/O than the non clustered index for primary key lookups. The price for this is slower secondary index searches (e.g. searching for last names). The algorithmic complexities are:

  • O(log N) for primary key lookups
  • O(log N) for secondary key lookups plus O(M log N) for projections of non-secondary-key columns (quite a high price to pay)

… where

  • N is the size of the table
  • M is the number of rows that are searched in secondary keys

OLTP usage often profits from clustered indexes.

Non clustered index

… is a primary key index that resides “outside” of the table structure, which is a heap table. In other words:

  • All the index column values are contained in the index tree structure
  • All the index column values and other column values are contained in the heap table

The benefit of this table layout is that all lookups are equally fast, regardless if you’re using a primary key lookup or a secondary key search. There’s always an additional, constant time heap table lookup. The algorithmic complexities are:

  • O(log N) for primary key lookups plus O(1) for projections of non-primary-key columns (a moderate price to pay)
  • O(log N) for secondary key lookups plus O(M) for projections of non-secondary-key columns (a moderate price to pay)

OLAP usage definitely profits from heap tables.

Defaults

  • MySQL’s InnoDB offers clustered indexes only.
  • MySQL’s MyISAM offers heap tables only.
  • Oracle offers both and defaults to heap tables
  • PostgreSQL offers both and defaults to heap tables
  • SQL Server offers both and defaults to clustered indexes

Note that Oracle calls clustered indexes “index organised tables”

Performance

In this article, I’m checking MySQL’s performance as MySQL’s InnoDB doesn’t offer to switch the table layout. Curiously, the problems shown below could not be reproduced on PostgreSQL as shown by reddit user /u/ForeverAlot. Details here.

With the algorithmic complexities above, we can easily guess what I’m trying to hint at here. In the presence of a clustered index, we should avoid expensive secondary key searches when possible. Of course, these searches cannot always be avoided, but if we review the alternative design of these two tables:

CREATE TABLE film_actor_surrogate (
  id int NOT NULL,
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_surrogate_pkey PRIMARY KEY (id)
);

CREATE TABLE film_actor_natural (
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id)
);

… we can see that if we’re using a clustered index here, the clustering will be made based on either:

  • FILM_ACTOR_SURROGATE.ID, which is a very useless clustering
  • (FILM_ACTOR_NATURAL.ACTOR_ID, FILM_ACTOR_NATURAL.FILM_ID), which is a very useful clustering

In the latter case, whenever we look up an actor’s films, we can use the clustering index as a covering index, regardless if we project anything additional from that table or not.

In the former case, we have to rely on an additional secondary key index that contains (ACTOR_ID, FILM_ID), and chances are that secondary index is not covering if we have additional projections.

The surrogate key clustering is really useless, because we never use the table this way.

Does it matter?

We can easily design a benchmark for this case. You can find the complete benchmark code here on GitHub, to validate the results on your environment. The benchmark uses this database design:

create table parent_1 (id int not null primary key);
create table parent_2 (id int not null primary key);

create table child_surrogate (
  id int auto_increment, 
  parent_1_id int not null references parent_1, 
  parent_2_id int not null references parent_2, 
  payload_1 int, 
  payload_2 int, 
  primary key (id), 
  unique (parent_1_id, parent_2_id)
) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */
;

create table child_natural (
  parent_1_id int not null references parent_1, 
  parent_2_id int not null references parent_2, 
  payload_1 int, 
  payload_2 int, 
  primary key (parent_1_id, parent_2_id)
) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */
;

Unlike in the Sakila database, we’re now adding some “payload” to the relationship table, which is not unlikely. Recent versions of MySQL will default to InnoDB, which only supports a clustered index layout. You can uncomment the ENGINE storage clause to see how this would perform with MyISAM, which only supports heap tables.

The benchmark adds:

  • 10 000 rows in PARENT_1
  • 100 rows in PARENT_2
  • 1 000 000 rows in both CHILD tables (just a cross join of the above)

And then, it runs 5 iterations of 10000 repetitions of the following two queries, following our standard SQL benchmark technique:

-- Query 1
SELECT c.payload_1 + c.payload_2 AS a 
FROM parent_1 AS p1 
JOIN child_surrogate AS c ON p1.id = c.parent_1_id 
WHERE p1.id = 4;

-- Query 2
SELECT c.payload_1 + c.payload_2 AS a 
FROM parent_1 AS p1 
JOIN child_natural AS c ON p1.id = c.parent_1_id 
WHERE p1.id = 4;

Notice that MySQL does not implement join elimination, otherwise, the useless join to PARENT_1 would be eliminated. The benchmark results are very clear:

Using InnoDB (clustered indexes)

Run 0, Statement 1 : 3104
Run 0, Statement 2 : 1910
Run 1, Statement 1 : 3097
Run 1, Statement 2 : 1905
Run 2, Statement 1 : 3045
Run 2, Statement 2 : 2276
Run 3, Statement 1 : 3589
Run 3, Statement 2 : 1910
Run 4, Statement 1 : 2961
Run 4, Statement 2 : 1897

Using MyISAM (heap tables)

Run 0, Statement 1 : 3473
Run 0, Statement 2 : 3288
Run 1, Statement 1 : 3328
Run 1, Statement 2 : 3341
Run 2, Statement 1 : 3674
Run 2, Statement 2 : 3307
Run 3, Statement 1 : 3373
Run 3, Statement 2 : 3275
Run 4, Statement 1 : 3298
Run 4, Statement 2 : 3322

You shouldn’t read this as a comparison between InnoDB and MyISAM in general, but as a comparison of the different table structures within the boundaries of the same engine. Very obviously, the additional search complexity of the badly clustered index in CHILD_SURROGATE causes a 50% slower query execution on this type of query, without gaining anything.

In the case of the heap table, the additional surrogate key column did not have any significant effect.

Again, the full benchmark can be found here on GitHub, if you want to repeat it.

Conclusion

Not everyone agrees what is generally better: clustered or non clustered indexes. Not everyone agrees on the utility of surrogate keys on every table. These are both quite opinionated discussions.

But this article clearly showed that on relationship tables, which have a very clear candidate key, namely the set of outgoing foreign keys that defines the many-to-many relationship, the surrogate key not only doesn’t add value, but it actively hurts your performance on a set of queries when your table is using a clustered index.

MySQL’s InnoDB and SQL Server use clustered indexes by default, so if you’re using any of those RDBMS, do check if you have room for significant improvement by dropping your surrogate keys.

How to Quickly Rename all Primary Keys in Oracle

Are you working with someone else’s schema and they haven’t declared nice names for all their constraints?

Unfortunately, it is all too easy to create a table like this:

CREATE TABLE order1 (
  order_id NUMBER(18) NOT NULL PRIMARY KEY
);

Or like this:

CREATE TABLE order2 (
  order_id NUMBER(18) NOT NULL,

  PRIMARY KEY (order_id)
);

Sure, you get a little convenience when writing the table. But from now on, you’re stuck with weird, system generated names both for the constraint and for the backing index. For instance, when doing execution plan analyses:

EXPLAIN PLAN FOR
SELECT *
FROM order1
JOIN order2 USING (order_id)
WHERE order_id = 1;

SELECT * FROM TABLE (dbms_xplan.display);

The simplified execution plan (output of the above queries) is this:

-------------------------------------------
| Id  | Operation          | Name         |
-------------------------------------------
|   0 | SELECT STATEMENT   |              |
|   1 |  NESTED LOOPS      |              |
|*  2 |   INDEX UNIQUE SCAN| SYS_C0042007 |
|*  3 |   INDEX UNIQUE SCAN| SYS_C0042005 |
-------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ORDER2"."ORDER_ID"=1)
   3 - access("ORDER1"."ORDER_ID"=1)

So, I got these system generated index names called SYS_C0042007 and SYS_C0042005. What do they mean? I can derive the actual meaning perhaps from the predicate information, as SYS_C0042007 is accessed in operation #2, which uses an access predicate on ORDER2. Fine. But do I really need to look these things up all the time?

Just name your constraints. Always!

Don’t be fooled into this convenience. It’ll hurt you time and again, not just when doing analyses. You might not be able to easily import / export your schema to some other database, because another database might already occupy these generated names.

So, do this instead:

CREATE TABLE order2 (
  order_id NUMBER(18) NOT NULL,

  CONSTRAINT pk_order2 PRIMARY KEY (order_id)
);

Find a naming schema (any naming scheme), like for instance PK_[table name]. If you’re cleaning up an existing database, this might help:

SET SERVEROUTPUT ON
BEGIN
  FOR stmts IN (
    SELECT 
      'ALTER TABLE ' || table_name || 
      ' RENAME CONSTRAINT ' || constraint_name || 
      ' TO PK_' || table_name AS stmt
    FROM user_constraints
    WHERE constraint_name LIKE 'SYS%'
    AND constraint_type = 'P'
  ) LOOP
    dbms_output.put_line(stmts.stmt);
    EXECUTE IMMEDIATE stmts.stmt;
  END LOOP;
  
  FOR stmts IN (
    SELECT 
      'ALTER INDEX ' || index_name || 
      ' RENAME TO PK_' || table_name AS stmt
    FROM user_constraints
    WHERE index_name LIKE 'SYS%'
    AND constraint_type = 'P'
  ) LOOP
    dbms_output.put_line(stmts.stmt);
    EXECUTE IMMEDIATE stmts.stmt;
  END LOOP;
END;
/

The above yields (and runs)

ALTER TABLE ORDER1 RENAME CONSTRAINT SYS_C0042005 TO PK_ORDER1
ALTER TABLE ORDER2 RENAME CONSTRAINT SYS_C0042007 TO PK_ORDER2
ALTER INDEX SYS_C0042005 RENAME TO PK_ORDER1
ALTER INDEX SYS_C0042007 RENAME TO PK_ORDER2

You can of course repeat the exercise for unique constraints, etc. I omit the example here because the naming scheme might be a bit more complicated there. Now re-calculate the execution plan and check this out:

EXPLAIN PLAN FOR
SELECT *
FROM order1
JOIN order2 USING (order_id)
WHERE order_id = 1;

SELECT * FROM TABLE (dbms_xplan.display);

The simplified execution plan (output of the above queries) is this:

----------------------------------------
| Id  | Operation          | Name      |
----------------------------------------
|   0 | SELECT STATEMENT   |           |
|   1 |  NESTED LOOPS      |           |
|*  2 |   INDEX UNIQUE SCAN| PK_ORDER2 |
|*  3 |   INDEX UNIQUE SCAN| PK_ORDER1 |
----------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ORDER2"."ORDER_ID"=1)
   3 - access("ORDER1"."ORDER_ID"=1)

There! That’s much more like it.

SQL GROUP BY and Functional Dependencies: A Very Useful Feature

Relational databases define the term “Functional Dependency” as such (from Wikipedia):

In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, functional dependency is a constraint that describes the relationship between attributes in a relation.

In SQL, functional dependencies appear whenever there is a unique constraint (e.g. a primary key constraint). Let’s assume the following:

CREATE TABLE actor (
  actor_id BIGINT NOT NULL PRIMARY KEY,
  first_name VARCHAR(50) NOT NULL,
  last_name VARCHAR(50) NOT NULL
);

It can be said that both FIRST_NAME and LAST_NAME each have a functional dependency on the ACTOR_ID column.

Nice. So what?

This isn’t just some mathematical statement that can be applied to unique constraints. It’s extremely useful for SQL. It means that for every ACTOR_ID value, there can be only one (functionally dependent) FIRST_NAME and LAST_NAME value. The other way round, this isn’t true. For any given FIRST_NAME and/or LAST_NAME value, we can have multiple ACTOR_ID values, as we can have multiple actors by the same names.

Because there can be only one corresponding FIRST_NAME and LAST_NAME value for any given ACTOR_ID value, we can omit those columns in the GROUP BY clause. Let’s assume also:

CREATE TABLE film_actor (
  actor_id BIGINT NOT NULL,
  film_id BIGINT NOT NULL,
  
  PRIMARY KEY (actor_id, film_id),
  FOREIGN KEY (actor_id) REFERENCS actor (actor_id),
  FOREIGN KEY (film_id) REFERENCS film (film_id)
);

Now, if we want to count the number of films per actor, we can write:

SELECT
  actor_id, first_name, last_name, COUNT(*)
FROM actor
JOIN film_actor USING (actor_id)
GROUP BY actor_id
ORDER BY COUNT(*) DESC

This is extremely useful as it saves us from a lot of typing. In fact, the way GROUP BY semantics is defined, we can put all sorts of column references in the SELECT clause, which are any of:

  • Column expressions that appear in the GROUP BY clause
  • Column expressions that are functionally dependent on the set of column expressions in the GROUP BY clause
  • Aggregate functions

Unfortunately, not everyone supports this

If you’re using Oracle, for instance, you can’t make use of the above. You’ll need to write the classic, equivalent version where all the non-aggregate column expressions appearing in the SELECT clause must also appear in the GROUP BY clause

SELECT
  actor_id, first_name, last_name, COUNT(*)
FROM actor
JOIN film_actor USING (actor_id)
GROUP BY actor_id, first_name, last_name
--                 ^^^^^^^^^^  ^^^^^^^^^ unnecessary
ORDER BY COUNT(*) DESC

Further reading:

Subtle SQL differences: IDENTITY columns

As I’m mostly using Oracle, IDENTITY columns were not so important to me up until I started to support them in jOOQ. Then, I found that yet again, there are many differences between various databases in how they handle IDENTITY columns in DDL and in DML.

In SQL, there are essentially three orthogonal concepts of how to identify a record (Please correct me if I may be missing more vendor-specific concepts):

  1. Primary Key: This is the most well-known concept. A primary key is the “main unique key” of a table, which means it is a constraint imposed on at least one column. It can be imposed on several columns, too. The primary key has its origin in the relational model
  2. Identity: Identities / identity columns are attributed to at most one column per table. They generate a unique ID from a sequence at record insertion – unlike primary keys, which are expected to be inserted correctly by client DML. The identity often coincides with the primary key, but it does not have to. It does not always have to be unique, either
  3. ROW ID / OID: Many databases know an internal ID that is usually not used in the logical representation of the database schema. This row id has mostly technical purposes and is also present in tables without primary key or identity columns

SQL 2008 standard

While primary key constraints are defined pretty much in the same way in every SQL dialect, and ROWID’s are very vendor-specific things, identities are something relatively new. The SQL 1992 standard does not yet define identity columns. They were formally introduced in SQL:2003, only. Here’s a part of the specification:

<column definition> ::=
  <column name> [ <data type or domain name> ]
  [ <default clause> | <identity column specification> ]
  [ ... ]

<identity column specification> ::=
  GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY
  [ <left paren> <common sequence generator options> <right paren> ]

 

This looks neat. In fact, in addition to column constraints, columns can specify an identity generation expression based on a sequence. What does reality look like?

Reality

DB2, Derby, HSQLDB, Ingres

These SQL dialects implement the standard very neatly.


id INTEGER GENERATED BY DEFAULT AS IDENTITY
id INTEGER GENERATED BY DEFAULT AS IDENTITY (START WITH 1)

H2, MySQL, Postgres, SQL Server, Sybase ASE, Sybase SQL Anywhere

These SQL dialects implement identites, but the DDL syntax doesn’t follow the standard


-- H2 mimicks MySQL's and SQL Server's syntax
ID INTEGER IDENTITY(1,1)
ID INTEGER AUTO_INCREMENT
-- MySQL
ID INTEGER NOT NULL AUTO_INCREMENT

-- Postgres serials implicitly create a sequence
-- Postgres also allows for selecting from custom sequences
-- That way, sequences can be shared among tables
id SERIAL NOT NULL

-- SQL Server
ID INTEGER IDENTITY(1,1) NOT NULL
-- Sybase ASE
id INTEGER IDENTITY NOT NULL
-- Sybase SQL Anywhere
id INTEGER NOT NULL IDENTITY

Oracle

Oracle does not know any identity columns at all. Instead, you will have to use a trigger and update the ID column yourself, using a custom sequence. Something along these lines:

CREATE OR REPLACE TRIGGER my_trigger
BEFORE INSERT
ON my_table
REFERENCING NEW AS new
FOR EACH ROW
BEGIN
  SELECT my_sequence.nextval
  INTO :new.id
  FROM dual;
END my_trigger;

Note, that this approach can be employed in most databases supporting sequences and triggers! It is a lot more flexible than standard identities

SQLite

SQLite does not know any identity columns either. It seems to have MySQL’s AUTO_INCREMENT clause, but in fact this just sets some rules on the ROWID generation, which is not a true identity column according to the SQL standard

Conclusion

As many things in SQL, identities are a major pain to get correctly when switching databases, or when writing SQL DDL and DML that works correctly across databases.

jOOQ supports DML executed against tables with identity columns and/or trigger-generated ID values. See a previous blog post that further elaborates on the issue:

https://lukaseder.wordpress.com/2011/08/29/postgres-insert-returning-clause-and-how-this-can-be-simulated-in-other-rdbms/

Subtle SQL differences: Constraint names

The various SQL product vendors implement subtle differences in the way they interpret SQL. In this case, I’ve been examining the reuse of constraint names within a schema / database (which is yet another story: what’s a schema, what’s a database?). Here’s the summary:

Constraint names are unique within a schema

  • Derby
  • H2
  • HSQLDB
  • Ingres
  • MySQL
  • Oracle
  • SQL Server

Constraint names are unique within a table

  • DB2
  • SQLite
  • Sybase SQL Anywhere

The “weird ones”

  • Postgres: Foreign Key names can be reused. Unique / Primary Key names cannot
  • Sybase ASE: Unique / Primary Key names can be reused. Foreign Key names cannot

For most compatibility across databases, it is never a good idea to re-use names. Keep your constraint names unique across a schema.