Using UNPIVOT to Traverse a Configuration Table’s Rows and Columns

Imagine you have a configuration table like the following:

CREATE TABLE rule (
  name     VARCHAR2(50)         NOT NULL PRIMARY KEY,
  enabled  NUMBER(1)  DEFAULT 1 NOT NULL CHECK (enabled IN (0,1)),
  priority NUMBER(10) DEFAULT 0 NOT NULL,
  flag1    NUMBER(3)  DEFAULT 0 NOT NULL,
  flag2    NUMBER(3)  DEFAULT 0 NOT NULL,
  flag3    NUMBER(3)  DEFAULT 0 NOT NULL,
  flag4    NUMBER(3)  DEFAULT 0 NOT NULL,
  flag5    NUMBER(3)  DEFAULT 0 NOT NULL
);

It specifies a set of rules that

  • Can be enabled / disabled
  • Can be given a priority among themselves
  • Include a set of flags which correspond to the thing you want to configure (e.g. some check to execute)
  • Those flags can be ordered as well

So, given the following data:

INSERT INTO rule (name, priority, flag1, flag5) 
  VALUES ('RULE 1', 1, 1, 2);
INSERT INTO rule (name, priority, flag2, flag5) 
  VALUES ('RULE 2', 2, 2, 1);
INSERT INTO rule (name, priority, flag3, flag4, flag5) 
  VALUES ('RULE 3', 3, 3, 1, 2);
INSERT INTO rule (name, priority, flag3) 
  VALUES ('RULE 4', 4, 1);

SELECT * FROM rule;

We’ll get our configuration “spreadsheet”:

NAME    ENABLED  PRIORITY  FLAG1  FLAG2  FLAG3  FLAG4  FLAG5
------------------------------------------------------------
RULE 1  1        1         1      0      0      0      2
RULE 2  1        2         0      2      0      0      1
RULE 3  1        3         0      0      3      1      2
RULE 4  1        4         0      0      1      0      0

This form is really useful to edit the configuration. If we want to activate FLAG2 in RULE 1, we just go to that cell in some SQL tool like Oracle SQL Developer, and change the value.

But reading the configuration is a bit different. FLAG1 through FLAG5 are not nicely normalised. How to read the data as though it were normalised?

Using UNPIVOT

In Oracle and SQL Server, we can use UNPIVOT for this use case. I’m using Oracle syntax in this blog post. SQL Server’s is just slightly different. Consider the following query:

SELECT name, flag, value
FROM rule
UNPIVOT (
  value FOR flag IN (
    flag1,  
    flag2,  
    flag3,  
    flag4,  
    flag5
  )
)
WHERE enabled = 1
AND value > 0
ORDER BY priority, value;

This will result in the following result set:

NAME    FLAG    VALUE
---------------------
RULE 1  FLAG1   1
RULE 1  FLAG5   2
RULE 2  FLAG5   1
RULE 2  FLAG2   2
RULE 3  FLAG4   1
RULE 3  FLAG5   2
RULE 3  FLAG3   3
RULE 4  FLAG3   1

In this representation, the rules are ordered by priority, and the flags are ordered by their respective value within a rule. The flags that are not turned on (value 0) are simply omitted. This form is much easier to traverse procedurally, when “consuming” the configuration.

How does it work?

In principle, UNPIVOT is just syntax sugar for a bunch of UNION ALL subqueries. We could have written our query like this, instead:

SELECT name, flag, value
FROM (
  SELECT rule.*, 'FLAG1' AS flag, FLAG1 AS value FROM rule
  UNION ALL
  SELECT rule.*, 'FLAG2' AS flag, FLAG2 AS value FROM rule
  UNION ALL
  SELECT rule.*, 'FLAG3' AS flag, FLAG3 AS value FROM rule
  UNION ALL
  SELECT rule.*, 'FLAG4' AS flag, FLAG4 AS value FROM rule
  UNION ALL
  SELECT rule.*, 'FLAG5' AS flag, FLAG5 AS value FROM rule
) rule
WHERE enabled = 1
AND value > 0
ORDER BY priority, value;

Which is decidedly more code. It’s also more work for the database. The execution plans are different (I’m using Oracle 12.2.0.1.0):

UNPIVOT version – single table access

---------------------------------------------
| Id  | Operation            | Name | Rows  |
---------------------------------------------
|   0 | SELECT STATEMENT     |      |       |
|   1 |  SORT ORDER BY       |      |     5 |
|*  2 |   VIEW               |      |     5 |
|   3 |    UNPIVOT           |      |       |
|*  4 |     TABLE ACCESS FULL| RULE |     1 |
---------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(("unpivot_view_005"."VALUE">0 AND 
              "unpivot_view_005"."ENABLED"=1))
   4 - filter("RULE"."ENABLED"=1)

UNION ALL version – multi table access

---------------------------------------------
| Id  | Operation            | Name | Rows  |
---------------------------------------------
|   0 | SELECT STATEMENT     |      |       |
|   1 |  SORT ORDER BY       |      |     8 |
|   2 |   VIEW               |      |     8 |
|   3 |    UNION-ALL         |      |       |
|*  4 |     TABLE ACCESS FULL| RULE |     1 |
|*  5 |     TABLE ACCESS FULL| RULE |     1 |
|*  6 |     TABLE ACCESS FULL| RULE |     2 |
|*  7 |     TABLE ACCESS FULL| RULE |     1 |
|*  8 |     TABLE ACCESS FULL| RULE |     3 |
---------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - filter(("RULE"."ENABLED"=1 AND "FLAG1">0))
   5 - filter(("RULE"."ENABLED"=1 AND "FLAG2">0))
   6 - filter(("RULE"."ENABLED"=1 AND "FLAG3">0))
   7 - filter(("RULE"."ENABLED"=1 AND "FLAG4">0))
   8 - filter(("RULE"."ENABLED"=1 AND "FLAG5">0))

We can also measure the time it takes to execute these queries thousands of times. The following shows resulting times relative to the fastest execution (1):

Run 1, Statement 1 : 1.155
Run 1, Statement 2 : 1.88056

Run 2, Statement 1 : 1.04333
Run 2, Statement 2 : 1.95148

Run 3, Statement 1 : 1.02185
Run 3, Statement 2 : 1.86074

Run 4, Statement 1 : 1
Run 4, Statement 2 : 1.85241

Run 5, Statement 1 : 1.0263
Run 5, Statement 2 : 1.82944

The UNION ALL version is consistently about 2x slower on this very small data set. This is significant in the use case presented here, as a configuration table is probably read many times per day.

Knowing when a rule starts and when it ends

The real world use case that is behind this blog post also needed to know when a rule started and when it ended. I.e., which flag entry was the first and which was the last of the rule. This was easy in the non-normalised representation where each rule was a single row.

In the normalised version, we can use LEAD() and LAG().

Using this query:

SELECT 
  CASE WHEN lag(name, 1, 'NULL') 
            OVER (ORDER BY priority, value) != name 
       THEN 1 ELSE 0 END rule_begin,
  CASE WHEN lead(name, 1, 'NULL') 
            OVER (ORDER BY priority, value) != name 
       THEN 1 ELSE 0 END rule_end,
  name, flag, value
FROM rule
UNPIVOT (
  value FOR flag IN (
    flag1,  
    flag2,  
    flag3,  
    flag4,  
    flag5
  )
)
WHERE enabled = 1
AND value > 0
ORDER BY priority, value;

We’re now getting (with some visual emphasis):

RULE_BEGIN  RULE_END  NAME    FLAG    VALUE
-------------------------------------------
1           0         RULE 1  FLAG1   1
0           1         RULE 1  FLAG5   2

1           0         RULE 2  FLAG5   1
0           1         RULE 2  FLAG2   2

1           0         RULE 3  FLAG4   1
0           0         RULE 3  FLAG5   2
0           1         RULE 3  FLAG3   3

1           1         RULE 4  FLAG3   1

LEAD() looks ahead one row to see if the rule name there is different from the rule name on the current row.

LAG() looks behind one row to see if the rule name there is different from the rule name on the current row.

That’s it – very simple. The window functions part of this example is part of my 10 SQL Tricks talk, which I highly recommend you watch.

Say NO to Excessive Use of Surrogate Keys if Performance Really Matters to You

We programmers keep cargo culting these wrong ideas. Recently, we said “NO” to Venn diagrams. Today we’re going to say no to surrogate keys.

The surrogate keys vs. natural keys non-debate is one of the most overheated debates in data architecture, and I don’t get why everyone is so emotional. Both sides claim to hold the ultimate truth (just like in the tabs vs. spaces non-debate) and prevent added value where it really matters.

This article sheds some light into why you shouldn’t be so dogmatic all the time. By the end of the article, you’ll agree, promised.

What’s so great about surrogate keys?

In case your SQL-speak is a bit rusty, a surrogate key is an artificial identifier. Or how Wikipedia puts it:

A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key which is derived from application data.

There are three clear advantages that surrogate keys have:

  1. You know every table has one, and it probably has the same name everywhere: ID.
  2. It is relatively small (e.g. a BIGINT)
  3. You don’t have to go through the “hassle” of designing your table for 30 seconds longer, trying to find a natural key

There are more advantages (check out the wikipedia article), and of course, there are disadvantages. Today, I’d like to talk about architectures where EVERY table has a surrogate key, even when it makes absolutely no sense.

When do surrogate keys make absolutely no sense?

I’m currently helping a customer improve performance on their queries that run against a log database. The log database essentially contains relevant information about sessions, and some transactions related to those sessions. (In case you’re jumping to the conclusion “hey don’t use SQL for logs”: Yes, they also use Splunk for unstructured logs. But these here are structured, and they run a lot of SQL style analytics against them).

Here’s what you can imagine is stored in that sessions table (using Oracle):

CREATE TABLE sessions (

  -- Useful values
  sess        VARCHAR2(50 CHAR) NOT NULL PRIMARY KEY,
  ip_address  VARCHAR2(15 CHAR) NOT NULL,
  tracking    VARCHAR2(50 CHAR) NOT NULL,
  
  -- and much more info here
  ...
)

Except, though, that’s not how the schema was designed. It was designed like this:

CREATE TABLE sessions (

  -- Meaningless surrogate key
  id NUMBER(18) NOT NULL PRIMARY KEY,

  -- Useful values
  sess        VARCHAR2(50 CHAR) NOT NULL,
  
  -- Meaningless foreign keys
  ip_id       NUMBER(18) NOT NULL,
  tracking_id NUMBER(18) NOT NULL,
  ...
  
  FOREIGN KEY (ip_id) 
    REFERENCES ip_addresses (ip_id),
  FOREIGN KEY (tracking_id) 
    REFERENCES tracking (tracking_id),
  ...
  
  -- Non-primary UNIQUE key
  UNIQUE (sess),
)

So, this so called “fact table” (it’s a star schema after all) contains nothing useful, only a set of surrogate keys that contain references to the interesting values, which are all located in other tables. For instance, if you want to count all session for a given IP address, you already need to run a JOIN:

SELECT ip_address, count(*)
FROM ip_addresses
JOIN sessions USING (ip_id)
GROUP BY ip_address

After all, we need the join because what the user really likes to see is the IP address, not the surrogate key. Duh, right? Here’s the execution plan:

------------------------------------------------------
| Operation           | Name         | Rows  | Cost  |
------------------------------------------------------
| SELECT STATEMENT    |              |  9999 |   108 |
|  HASH GROUP BY      |              |  9999 |   108 |
|   HASH JOIN         |              | 99999 |   104 |
|    TABLE ACCESS FULL| IP_ADDRESSES |  9999 |     9 |
|    TABLE ACCESS FULL| SESSIONS     | 99999 |    95 |
------------------------------------------------------

Perfect hash join with two full table scans. In my example database, I have 100k sessions and 10k IP addresses.

Obviously, there’s an index on the IP_ADDRESS column, because I want to be able to filter by it. Something meaningful, like:

SELECT ip_address, count(*)
FROM ip_addresses
JOIN sessions USING (ip_id)
WHERE ip_address LIKE '192.168.%'
GROUP BY ip_address

Obviously, the plan is a bit better, because we’re returning less data. Here’s there result:

----------------------------------------------------------------
| Operation                     | Name         | Rows  | Cost  |
----------------------------------------------------------------
| SELECT STATEMENT              |              |     1 |    99 |
|  HASH GROUP BY                |              |     1 |    99 |
|   HASH JOIN                   |              |    25 |    98 |
|    TABLE ACCESS BY INDEX ROWID| IP_ADDRESSES |     1 |     3 |
|     INDEX RANGE SCAN          | I_IP         |     1 |     2 |
|    TABLE ACCESS FULL          | SESSIONS     | 99999 |    95 |
----------------------------------------------------------------

Intersting. We can now use our index statistics to estimate that our predicate will return only one row from the ip_address table. Yet, we still get a hash join with significant cost, for what now appears to be a rather trivial query.

What would the world look like without surrogate keys?

Easy. We no longer need the join, every time we need something IP address related from the sessions table.

Our two queries become, trivially:

-- All counts
SELECT ip_address, count(*)
FROM sessions2
GROUP BY ip_address

-- Filtered counts
SELECT ip_address, count(*)
FROM sessions2
WHERE ip_address LIKE '192.168.%'
GROUP BY ip_address

The first query yields a simpler execution plan, with around the same cost estimate.

--------------------------------------------------
| Operation          | Name      | Rows  | Cost  |
--------------------------------------------------
| SELECT STATEMENT   |           |   256 |   119 |
|  HASH GROUP BY     |           |   256 |   119 |
|   TABLE ACCESS FULL| SESSIONS2 | 99999 |   116 |
--------------------------------------------------

We don’t seem to gain that much here, but what happens with the filtered query?

------------------------------------------------
| Operation            | Name  | Rows  | Cost  |
------------------------------------------------
| SELECT STATEMENT     |       |     1 |     4 |
|  SORT GROUP BY NOSORT|       |     1 |     4 |
|   INDEX RANGE SCAN   | I_IP2 |   391 |     4 |
------------------------------------------------

OMG! Where did our costs go? Huh, this seems to be extremely fast! Let’s benchmark!

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100000;
BEGIN
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT ip_address, count(*)
      FROM ip_addresses
      JOIN sessions USING (ip_id)
      WHERE ip_address LIKE '192.168.%'
      GROUP BY ip_address
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Surrogate: '
    || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT ip_address, count(*)
      FROM sessions2
      WHERE ip_address LIKE '192.168.%'
      GROUP BY ip_address
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Natural  : '
    || (SYSTIMESTAMP - v_ts));
END;
/
Surrogate: +000000000 00:00:03.453000000
Natural  : +000000000 00:00:01.227000000

The improvement is significant, and we don’t have a lot of data here. Now, when you think about it, it is kind of obvious, no? For the semantically exact same query, we either run a JOIN, or we don’t. And this is using very very little data. The actual customer database has hundreds of millions of sessions, where all these JOINs waste valuable resources for nothing but an artificial surrogate key which was introduced… because that’s how things were always done.

And, as always, don’t trust your execution plan costs. Measure. Benchmarking 100 iterations of the unfiltered query (the one that produced a hash join) yields:

Surrogate: +000000000 00:00:06.053000000
Natural  : +000000000 00:00:02.408000000

Still obvious, when you think of it.

Do note that we’re not denormalising yet. There’s still an IP_ADDRESS table, but now it contains the business key as the primary key (the address), not the surrogate key. In more rare querying use-cases, we’ll still join the table, in order to get IP-address related info (such as country).

Taking it to the extreme

The database at this customer site was designed by someone who appreciates purity, and I can certainly relate to that some times. But in this case, it went clearly wrong, because there were many of these queries that were actually filtering for a “natural key” (i.e. a business value), but needed to add yet another JOIN just to do that.

There were more of these cases, for instance:

  • ISO 639 language codes
  • ISIN security numbers
  • Account numbers, which had a formal identifier from an external system

and many more. Each and every time, there were dozens of JOINs required just to get that additional mapping between surrogate key and natural key.

Sometimes, surrogate keys are nice. They do use a little less disk space. Consider the following query (credits go to Stack Overflow user WW.)

SELECT 
  owner,
  table_name,
  TRUNC(SUM(bytes)/1024) kb,
  ROUND(ratio_to_report(SUM(bytes)) OVER() * 100) Percent
FROM (
  SELECT 
    segment_name table_name,
    owner,
    bytes
  FROM dba_segments
  WHERE segment_type IN (
    'TABLE', 
	'TABLE PARTITION', 
	'TABLE SUBPARTITION'
  )
  UNION ALL
  SELECT 
    i.table_name,
    i.owner,
    s.bytes
  FROM dba_indexes i,
    dba_segments s
  WHERE s.segment_name = i.index_name
  AND s.owner          = i.owner
  AND s.segment_type  IN (
    'INDEX', 
	'INDEX PARTITION', 
	'INDEX SUBPARTITION'
  )
)
WHERE owner = 'TEST'
AND table_name IN (
  'SESSIONS', 
  'IP_ADDRESSES', 
  'SESSIONS2'
 )
GROUP BY table_name, owner
ORDER BY SUM(bytes) DESC;

This will show us the disk space used by each object:

TABLE_NAME                   KB    PERCENT
-------------------- ---------- ----------
SESSIONS2                 12288         58
SESSIONS                   8192         39
IP_ADDRESSES                768          4

Yes. We use a little more disk space, because now our primary key in the sessions table is a VARCHAR2(50) rather than a NUMBER(18). But disk space is extremely cheap, whereas wall clock time performance is essential. By removing just a little complexity, we’ve greatly increased performance already for a simple query.

Conclusion

Surrogate keys or natural keys? I say both. Surrogate keys do have advantages. You generally don’t want a 5 column composite primary key. And even less so, you don’t want a 5 column composite foreign key. In these cases, using a surrogate key can add value by removing complexity (and probably increasing performance again). But choose wisely. Ever so often, blindly using surrogate keys will go very wrong, and you’ll pay for it dearly when it comes to querying your data.

Further reading: