Say NO to Excessive Use of Surrogate Keys if Performance Really Matters to You


We programmers keep cargo culting these wrong ideas. Recently, we said “NO” to Venn diagrams. Today we’re going to say no to surrogate keys.

The surrogate keys vs. natural keys non-debate is one of the most overheated debates in data architecture, and I don’t get why everyone is so emotional. Both sides claim to hold the ultimate truth (just like in the tabs vs. spaces non-debate) and prevent added value where it really matters.

This article sheds some light into why you shouldn’t be so dogmatic all the time. By the end of the article, you’ll agree, promised.

What’s so great about surrogate keys?

In case your SQL-speak is a bit rusty, a surrogate key is an artificial identifier. Or how Wikipedia puts it:

A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key which is derived from application data.

There are three clear advantages that surrogate keys have:

  1. You know every table has one, and it probably has the same name everywhere: ID.
  2. It is relatively small (e.g. a BIGINT)
  3. You don’t have to go through the “hassle” of designing your table for 30 seconds longer, trying to find a natural key

There are more advantages (check out the wikipedia article), and of course, there are disadvantages. Today, I’d like to talk about architectures where EVERY table has a surrogate key, even when it makes absolutely no sense.

When do surrogate keys make absolutely no sense?

I’m currently helping a customer improve performance on their queries that run against a log database. The log database essentially contains relevant information about sessions, and some transactions related to those sessions. (In case you’re jumping to the conclusion “hey don’t use SQL for logs”: Yes, they also use Splunk for unstructured logs. But these here are structured, and they run a lot of SQL style analytics against them).

Here’s what you can imagine is stored in that sessions table (using Oracle):

CREATE TABLE sessions (

  -- Useful values
  sess        VARCHAR2(50 CHAR) NOT NULL PRIMARY KEY,
  ip_address  VARCHAR2(15 CHAR) NOT NULL,
  tracking    VARCHAR2(50 CHAR) NOT NULL,
  
  -- and much more info here
  ...
)

Except, though, that’s not how the schema was designed. It was designed like this:

CREATE TABLE sessions (

  -- Meaningless surrogate key
  id NUMBER(18) NOT NULL PRIMARY KEY,

  -- Useful values
  sess        VARCHAR2(50 CHAR) NOT NULL,
  
  -- Meaningless foreign keys
  ip_id       NUMBER(18) NOT NULL,
  tracking_id NUMBER(18) NOT NULL,
  ...
  
  FOREIGN KEY (ip_id) 
    REFERENCES ip_addresses (ip_id),
  FOREIGN KEY (tracking_id) 
    REFERENCES tracking (tracking_id),
  ...
  
  -- Non-primary UNIQUE key
  UNIQUE (sess),
)

So, this so called “fact table” (it’s a star schema after all) contains nothing useful, only a set of surrogate keys that contain references to the interesting values, which are all located in other tables. For instance, if you want to count all session for a given IP address, you already need to run a JOIN:

SELECT ip_address, count(*)
FROM ip_addresses
JOIN sessions USING (ip_id)
GROUP BY ip_address

After all, we need the join because what the user really likes to see is the IP address, not the surrogate key. Duh, right? Here’s the execution plan:

------------------------------------------------------
| Operation           | Name         | Rows  | Cost  |
------------------------------------------------------
| SELECT STATEMENT    |              |  9999 |   108 |
|  HASH GROUP BY      |              |  9999 |   108 |
|   HASH JOIN         |              | 99999 |   104 |
|    TABLE ACCESS FULL| IP_ADDRESSES |  9999 |     9 |
|    TABLE ACCESS FULL| SESSIONS     | 99999 |    95 |
------------------------------------------------------

Perfect hash join with two full table scans. In my example database, I have 100k sessions and 10k IP addresses.

Obviously, there’s an index on the IP_ADDRESS column, because I want to be able to filter by it. Something meaningful, like:

SELECT ip_address, count(*)
FROM ip_addresses
JOIN sessions USING (ip_id)
WHERE ip_address LIKE '192.168.%'
GROUP BY ip_address

Obviously, the plan is a bit better, because we’re returning less data. Here’s there result:

----------------------------------------------------------------
| Operation                     | Name         | Rows  | Cost  |
----------------------------------------------------------------
| SELECT STATEMENT              |              |     1 |    99 |
|  HASH GROUP BY                |              |     1 |    99 |
|   HASH JOIN                   |              |    25 |    98 |
|    TABLE ACCESS BY INDEX ROWID| IP_ADDRESSES |     1 |     3 |
|     INDEX RANGE SCAN          | I_IP         |     1 |     2 |
|    TABLE ACCESS FULL          | SESSIONS     | 99999 |    95 |
----------------------------------------------------------------

Intersting. We can now use our index statistics to estimate that our predicate will return only one row from the ip_address table. Yet, we still get a hash join with significant cost, for what now appears to be a rather trivial query.

What would the world look like without surrogate keys?

Easy. We no longer need the join, every time we need something IP address related from the sessions table.

Our two queries become, trivially:

-- All counts
SELECT ip_address, count(*)
FROM sessions
GROUP BY ip_address

-- Filtered counts
SELECT ip_address, count(*)
FROM sessions USING
WHERE ip_address LIKE '192.168.%'
GROUP BY ip_address

The first query yields a simpler execution plan, with around the same cost estimate.

--------------------------------------------------
| Operation          | Name      | Rows  | Cost  |
--------------------------------------------------
| SELECT STATEMENT   |           |   256 |   119 |
|  HASH GROUP BY     |           |   256 |   119 |
|   TABLE ACCESS FULL| SESSIONS2 | 99999 |   116 |
--------------------------------------------------

We don’t seem to gain that much here, but what happens with the filtered query?

------------------------------------------------
| Operation            | Name  | Rows  | Cost  |
------------------------------------------------
| SELECT STATEMENT     |       |     1 |     4 |
|  SORT GROUP BY NOSORT|       |     1 |     4 |
|   INDEX RANGE SCAN   | I_IP2 |   391 |     4 |
------------------------------------------------

OMG! Where did our costs go? Huh, this seems to be extremely fast! Let’s benchmark!

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100000;
BEGIN
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT ip_address, count(*)
      FROM ip_addresses
      JOIN sessions USING (ip_id)
      WHERE ip_address LIKE '192.168.%'
      GROUP BY ip_address
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Surrogate: '
    || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT ip_address, count(*)
      FROM sessions2
      WHERE ip_address LIKE '192.168.%'
      GROUP BY ip_address
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Natural  : '
    || (SYSTIMESTAMP - v_ts));
END;
/
Surrogate: +000000000 00:00:03.453000000
Natural  : +000000000 00:00:01.227000000

The improvement is significant, and we don’t have a lot of data here. Now, when you think about it, it is kind of obvious, no? For the semantically exact same query, we either run a JOIN, or we don’t. And this is using very very little data. The actual customer database has hundreds of millions of sessions, where all these JOINs waste valuable resources for nothing but an artificial surrogate key which was introduced… because that’s how things were always done.

And, as always, don’t trust your execution plan costs. Measure. Benchmarking 100 iterations of the unfiltered query (the one that produced a hash join) yields:

Surrogate: +000000000 00:00:06.053000000
Natural  : +000000000 00:00:02.408000000

Still obvious, when you think of it.

Do note that we’re not denormalising yet. There’s still an IP_ADDRESS table, but now it contains the business key as the primary key (the address), not the surrogate key. In more rare querying use-cases, we’ll still join the table, in order to get IP-address related info (such as country).

Taking it to the extreme

The database at this customer site was designed by someone who appreciates purity, and I can certainly relate to that some times. But in this case, it went clearly wrong, because there were many of these queries that were actually filtering for a “natural key” (i.e. a business value), but needed to add yet another JOIN just to do that.

There were more of these cases, for instance:

  • ISO 639 language codes
  • ISIN security numbers
  • Account numbers, which had a formal identifier from an external system

and many more. Each and every time, there were dozens of JOINs required just to get that additional mapping between surrogate key and natural key.

Sometimes, surrogate keys are nice. They do use a little less disk space. Consider the following query (credits go to Stack Overflow user WW.)

SELECT 
  owner,
  table_name,
  TRUNC(SUM(bytes)/1024) kb,
  ROUND(ratio_to_report(SUM(bytes)) OVER() * 100) Percent
FROM (
  SELECT 
    segment_name table_name,
    owner,
    bytes
  FROM dba_segments
  WHERE segment_type IN (
    'TABLE', 
	'TABLE PARTITION', 
	'TABLE SUBPARTITION'
  )
  UNION ALL
  SELECT 
    i.table_name,
    i.owner,
    s.bytes
  FROM dba_indexes i,
    dba_segments s
  WHERE s.segment_name = i.index_name
  AND s.owner          = i.owner
  AND s.segment_type  IN (
    'INDEX', 
	'INDEX PARTITION', 
	'INDEX SUBPARTITION'
  )
)
WHERE owner = 'TEST'
AND table_name IN (
  'SESSIONS', 
  'IP_ADDRESSES', 
  'SESSIONS2'
 )
GROUP BY table_name, owner
ORDER BY SUM(bytes) DESC;

This will show us the disk space used by each object:

TABLE_NAME                   KB    PERCENT
-------------------- ---------- ----------
SESSIONS2                 12288         58
SESSIONS                   8192         39
IP_ADDRESSES                768          4

Yes. We use a little more disk space, because now our primary key in the sessions table is a VARCHAR2(50) rather than a NUMBER(18). But disk space is extremely cheap, whereas wall clock time performance is essential. By removing just a little complexity, we’ve greatly increased performance already for a simple query.

Conclusion

Surrogate keys or natural keys? I say both. Surrogate keys do have advantages. You generally don’t want a 5 column composite primary key. And even less so, you don’t want a 5 column composite foreign key. In these cases, using a surrogate key can add value by removing complexity (and probably increasing performance again). But choose wisely. Ever so often, blindly using surrogate keys will go very wrong, and you’ll pay for it dearly when it comes to querying your data.

Further reading:

The Java JIT Compiler is Darn Good at Optimization


“Challenge accepted” said Tagir Valeev when I recently asked the readers of the jOOQ blog to show if the Java JIT (Just-In-Time compilation) can optimise away a for loop.

Tagir is the author of StreamEx, very useful Java 8 Stream extension library that adds additional parallelism features on top of standard streams. He’s a speaker at conferences, and has contributed a dozen of patches into OpenJDK Stream API (including bug fixes, performance optimizations and new features). He’s interested in static code analysis and works on a new Java bytecode analyzer.

I’m very happy to publish Tagir’s guest post here on the jOOQ blog.

tagir-valeev

The Java JIT Compiler

In recent article Lukas wondered whether JIT could optimize a code like this to remove an unnecessary iteration:

// ... than this, where we "know" the list
// only contains one value
for (Object object : Collections.singletonList("abc")) {
    doSomethingWith(object);
}

Here’s my answer: JIT can do even better. Let’s consider this simple method which calculates total length of all the strings of supplied list:

static int testIterator(List<String> list) {
    int sum = 0;
    for (String s : list) {
        sum += s.length();
    }
    return sum;
}

As you might know this code is equivalent to the following:

static int testIterator(List<String> list) {
    int sum = 0;
    Iterator<String> it = list.iterator();
    while(it.hasNext()) {
        String s = it.next();
        sum += s.length();
    }
    return sum;
}

Of course in general case the list could be anything, so when creating an iterator, calling hasNext and next methods JIT must emit honest virtual calls which is not very fast. However what will happen if you always supply the singletonList here? Let’s create some simple test:

public class Test {
    static int res = 0;

    public static void main(String[] args) {
        for (int i = 0; i < 100000; i++) {
            res += testIterator(Collections.singletonList("x"));
        }
        System.out.println(res);
    }
}

We are calling our testIterator in a loop so it’s called enough times to be JIT-compiled with C2 JIT compiler. As you might know, in HotSpot JVM there are two JIT-compilers, namely C1 (client) compiler and C2 (server) compiler. In 64-bit Java 8 they work together. First method is compiled with C1 and special instructions are added to gather some statistics (which is called profiling). Among it there is type statistics. JVM will carefully check which exact types our list variable has. And in our case it will discover that in 100% of cases it’s singleton list and nothing else. When method is called quite often, it gets recompiled by better C2 compiler which can use this information. Thus when C2 compiles it can assume that in future singleton list will also appear quite often.

You may ask JIT compiler to output the assembly generated for methods. To do this you should install hsdis on your system. After that you may use convenient tools like JITWatch or write a JMH benchmark and use -perfasm option. Here we will not use third-party tools and simply launch the JVM with the following command line options:

$ java -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+PrintAssembly Test >output.txt

This will generate quite huge output which may scare the children. The assembly generated by C2 compiler for ourtestIterator method looks like this (on Intel x64 platform):

  # {method} {0x0000000055120518} 
  # 'testIterator' '(Ljava/util/List;)I' in 'Test'
  # parm0:    rdx:rdx   = 'java/util/List'
  #           [sp+0x20]  (sp of caller)
  0x00000000028e7560: mov    %eax,-0x6000(%rsp)
  0x00000000028e7567: push   %rbp

  ;*synchronization entry
  ; - Test::testIterator@-1 (line 15)
  0x00000000028e7568: sub    $0x10,%rsp         
                                                
  ; implicit exception: dispatches to 0x00000000028e75bd
  0x00000000028e756c: mov    0x8(%rdx),%r10d    

  ;   {metadata('java/util/Collections$SingletonList')}
  0x00000000028e7570: cmp    $0x14d66a20,%r10d  

  ;*synchronization entry
  ; - java.util.Collections::singletonIterator@-1
  ; - java.util.Collections$SingletonList::iterator@4
  ; - Test::testIterator@3 (line 16)
  0x00000000028e7577: jne    0x00000000028e75a0 

  ;*getfield element
  ; - java.util.Collections$SingletonList::iterator@1
  ; - Test::testIterator@3 (line 16)
  0x00000000028e7579: mov    0x10(%rdx),%ebp    

  ; implicit exception: dispatches to 0x00000000028e75c9
  0x00000000028e757c: mov    0x8(%rbp),%r11d    

  ;   {metadata('java/lang/String')}
  0x00000000028e7580: cmp    $0x14d216d0,%r11d  
  0x00000000028e7587: jne    0x00000000028e75b1

  ;*checkcast
  ; - Test::testIterator@24 (line 16)
  0x00000000028e7589: mov    %rbp,%r10          
                                                
  ;*getfield value
  ; - java.lang.String::length@1
  ; - Test::testIterator@30 (line 17)
  0x00000000028e758c: mov    0xc(%r10),%r10d    

  ;*synchronization entry
  ; - Test::testIterator@-1 (line 15)
  ; implicit exception: dispatches to 0x00000000028e75d5
  0x00000000028e7590: mov    0xc(%r10),%eax     
                                                
                                               
  0x00000000028e7594: add    $0x10,%rsp
  0x00000000028e7598: pop    %rbp

  # 0x0000000000130000
  0x00000000028e7599: test   %eax,-0x27b759f(%rip)        
         
  ;   {poll_return}                                       
  0x00000000028e759f: retq   
  ... // slow paths follow

What you can notice is that it’s surpisingly short. I’ll took the liberty to annotate what happens here:

// Standard stack frame: every method has such prolog
mov    %eax,-0x6000(%rsp)
push   %rbp
sub    $0x10,%rsp         
// Load class identificator from list argument (which is stored in rdx 
// register) like list.getClass() This also does implicit null-check: if 
// null is supplied, CPU will trigger a hardware exception. The exception
// will be caught by JVM and translated into NullPointerException
mov    0x8(%rdx),%r10d
// Compare list.getClass() with class ID of Collections$SingletonList class 
// which is constant and known to JIT
cmp    $0x14d66a20,%r10d
// If list is not singleton list, jump out to the slow path
jne    0x00000000028e75a0
// Read Collections$SingletonList.element private field into rbp register
mov    0x10(%rdx),%ebp
// Read its class identificator and check whether it's actually String
mov    0x8(%rbp),%r11d
cmp    $0x14d216d0,%r11d
// Jump out to the exceptional path if not (this will create and throw
// ClassCastException)
jne    0x00000000028e75b1
// Read private field String.value into r10 which is char[] array containing
//  String content
mov    %rbp,%r10
mov    0xc(%r10),%r10d
// Read the array length field into eax register (by default method returns
// its value via eax/rax)
mov    0xc(%r10),%eax
// Standard method epilog
add    $0x10,%rsp
pop    %rbp
// Safe-point check (so JVM can take the control if necessary, for example,
// to perform garbage collection)
test   %eax,-0x27b759f(%rip)
// Return
retq   

If it’s still hard to understand, let’s rewrite it via pseudo-code:

if (list.class != Collections$SingletonList) {
  goto SLOW_PATH;
}
str = ((Collections$SingletonList)list).element;
if (str.class != String) {
  goto EXCEPTIONAL_PATH;
}
return ((String)str).value.length;

So for the hot path we have no iterator allocated and no loop, just several dereferences and two quick checks (which are always false, so CPU branch predictor will predict them nicely). Iterator object is evaporated completely, though originally it has additional bookkeeping like tracking whether it was already called and throwing NoSuchElementException in this case. JIT-compiler statically proved that these parts of code are unnecessary and removed them. The sum variable is also evaporated. Nevertheless the method is correct: if it happens in future that it will be called with something different from singleton list, it will handle this situation on the SLOW_PATH (which is of course much longer). Other cases like list == null or list element is not String are also handled.

What will occur if your program pattern changes? Imagine that at some point you are no longer using singleton lists and pass different list implementations here. When JIT discovers that SLOW_PATH is hit too often, it will recompile the method to remove special handling of singleton list. This is different from pre-compiled applications: JIT can change your code following the behavioral changes of your program.

How to Know if a Given Index Can be Dropped


It seems that perfection is attained not when there is nothing more to add, but when there is nothing more to remove.

Antoine de Saint Exupéry in Terre des Hommes

As SQL developers, we keep adding more and more indexes to our tables. Every time we run new queries that are potentially slow, a new index might solve the problem. But it also creates new problems. The more indexes you have, the slower your insertions and updates will become, and obviously, the more disk space your data will use.

But how do we know if we can remove an existing index? How do we know if there won’t be any performance regression?

In Oracle, there’s a way to answer this question! Just look at the cursor cache of your production system. In a previous blog post about UNIQUE constraints, we have added an index called “I1DATA” to our database, and we’ve run some queries that used this index.

Several days later, we want to know if that index is still being used in our “production” database. Just run the following query:

SELECT *
FROM v$sql
JOIN v$sql_plan USING (sql_id)
WHERE object_name = 'I1DATA'

This query uses Oracle’s wonderful system views, which reveal tons of information about your production system. Learn them, especially v$sql, and v$sql_plan and impress your coworkers!

The above query returns something like the following data:

SQL_TEXT                             LAST_ACTIVE
------------------------------------------------
SELECT COUNT(*) FROM X1 JOIN Y1  ... 14.07.16
SELECT count(*) FROM x1 JOIN y1  ... 14.07.16
SELECT count(*) FROM x1 WHERE a  ... 14.07.16
SELECT COUNT(*) FROM X1 WHERE A  ... 14.07.16

A whole lot of additional information about the actual execution plan is contained in the result set, e.g. whether the index was used for an INDEX RANGE SCAN, or something else. The interesting thing at this point is:

Yes, the index is still being used by several queries in the cursor cache

… so we probably cannot delete it yet.

Caution

These views won’t tell you reliably if an index is used. They are using the cursor cache, which may be configured in various ways, including not caching cursors for too long, or not all cursors.

In any case, the following rules can be established:

  • If an index appears in the cursor cache several times, you shouldn’t remove it
  • If an index appears in the cursor cache only once, perhaps it isn’t really needed for that particular query. Analyse
  • If an index does not appear in the cursor cache, run the query again, frequently, and after some time (and if you’re sure), you can remove it

Caveats

  • Some indexes may have been created for reports that run very infrequently. The suggested approach won’t cover that scenario
  • This information is most reliable from the production system. You cannot possibly obtain any meaningful information from your developer box

Remember: This is just a nice trick. Not a reliable rule. But it certainly does help you assess whether that big, expensive index that doesn’t appear to be useful can be safely removed.

How Adding a UNIQUE Constraint on a OneToOne Relationship Helps Performance


A lot of people use SQL constraints mainly to enforce data integrity, and that’s already a very good thing. A UNIQUE constraint, for instance, makes sure that there is at most one instance of any possible value (or tuple, in the case of a composite constraint) in a table. For instance:

CREATE TABLE x (
  a NUMBER(10),
  UNIQUE (a)
);

-- This works:
INSERT INTO x VALUES (1);

-- This fails:
INSERT INTO x VALUES (1);

Constraints are also good for (SELECT) performance

One thing that people often do not think about, though, is the fact that constraints can also be used as very valuable meta information by the database optimiser to make a better decision when finding the most optimal execution plan. It is a big difference for an optimiser…

  • To be unaware of how many different values there are in any given column (worst case)
  • To be able to estimate the different values in any given column (statistics are present)
  • To know that each value can appear at most once

In the last case, a UNIQUE (or PRIMARY KEY) constraint tells the optimiser exactly that. Once the optimiser knows that a certain operation returns at most one row (and not 10, 100, or even 10M rows), a nested loop suddenly becomes extremely cheap, as it can abort as soon as one row was found.

A Java analogy

Compare this with the following Java code:

// This is much "harder" ...
List<Object> objects = unknownSize();
for (Object object : objects) {
    doSomethingWith(object);
}

// ... than this, where we "know" the list
// only contains one value
for (Object object : Collections.singletonList("abc")) {
    doSomethingWith(object);
}

If Java had such optimisation capabilities (bonus question: can the JIT do it?), then the second loop could be optimised as such:

// The loop can be avoided:
doSomethingWith("abc");

Let’s look at Oracle

Let’s look at a concrete example with Oracle:

CREATE TABLE x1 (
  a NUMBER(10) NOT NULL,
  data VARCHAR2(100),
  CONSTRAINT pk_y1 PRIMARY KEY (a)
);

CREATE TABLE y1 (
  a NUMBER(10) NOT NULL, 
  optional_data VARCHAR2(100),
  CONSTRAINT fk_y1 FOREIGN KEY (a) REFERENCES x1(a)
);

CREATE INDEX i1 ON y1(a);
CREATE INDEX i1data ON x1(data);

INSERT INTO x1
SELECT level, dbms_random.string('a', 100)
FROM dual
CONNECT BY level <= 10000;

INSERT INTO y1
SELECT level, dbms_random.string('a', 100)
FROM dual
CONNECT BY level <= 5000;

This is a typical one-to-many relationship between x1 and y1. With the constraints in place, there can be between 0 and N rows in y1 for each row in x1. As a user, we know that there is only one value in y1 for any value in x1, but we don’t enforce this knowledge with a constraint.

Let’s look at the following query:

SELECT count(*)
FROM x1
JOIN y1 USING (a)
WHERE data LIKE 'a%';

What we’re doing here is we want all values in x1 whose data starts with the letter ‘a’, and for which we also have any optional_data. The execution plan is

-----------------------------------------------------
| Operation                | Name   | Rows  | Cost  |
-----------------------------------------------------
| SELECT STATEMENT         |        |       |    29 |
|  SORT AGGREGATE          |        |     1 |       |
|   HASH JOIN              |        |   176 |    29 |
|    VIEW                  |        |   176 |    24 |
|     HASH JOIN            |        |       |       |
|      INDEX RANGE SCAN    | I1DATA |   176 |     4 |
|      INDEX FAST FULL SCAN| PK_Y1  |   176 |    24 |
|    INDEX FAST FULL SCAN  | I1     |  5000 |     5 |
-----------------------------------------------------

As you can see, Oracle chooses to run a hash join operation, which means that all the values from x1 starting with ‘a’ are fetched (around 176), and joined in a hashmap with the entire set of values in y1, fetched from the index i1 (5000 values).

How does this compare with using a UNIQUE constraint?

We’ll create almost the exact same schema as follows:

CREATE TABLE x2 (
  a NUMBER(10) NOT NULL,
  data VARCHAR2(100),
  CONSTRAINT pk_x2 PRIMARY KEY (a)
);

CREATE TABLE y2 (
  a NUMBER(10) NOT NULL, 
  optional_data VARCHAR2(100),
  CONSTRAINT uk_y2 UNIQUE (a),
  CONSTRAINT fk_y2 FOREIGN KEY (a) REFERENCES x2(a)
);

CREATE INDEX i2data ON x2(data);

INSERT INTO x2
SELECT * FROM x1;

INSERT INTO y2
SELECT * FROM y1;

BEGIN
  dbms_stats.gather_table_stats('TEST', 'X2');
  dbms_stats.gather_table_stats('TEST', 'Y2');
END;
/

The data is exactly the same, but now we enforce a UNIQUE constraint on y2’s foreign key, making this effectively a one-to-one relationship. Check out what happens when we run the exact same query…

SELECT count(*)
FROM x2
JOIN y2 USING (a)
WHERE data LIKE 'a%';

Its execution plan is now:

-----------------------------------------------------
| Operation                | Name   | Rows  | Cost  |
-----------------------------------------------------
| SELECT STATEMENT         |        |       |    25 |
|  SORT AGGREGATE          |        |     1 |       |
|   NESTED LOOPS           |        |   176 |    25 |
|    VIEW                  |        |   176 |    25 |
|     HASH JOIN            |        |       |       |
|      INDEX RANGE SCAN    | I2DATA |   176 |     5 |
|      INDEX FAST FULL SCAN| PK_X2  |   176 |    24 |
|    INDEX UNIQUE SCAN     | UK_Y2  |     1 |     0 |
-----------------------------------------------------

As you can see, the overall cost has decreased from 29 to 25 as we’re no longer using a hash join, but a nested loop join operation, which is probably faster if our statistics are not way off, as we only have to look up the single value in y2 corresponding to x2 for each of x2’s estimated 176 rows that start with the letter ‘a’.

But let’s not trust the execution plan, let’s benchmark things (as discussed in a previous article). Run the following code in SQL Developer, for instance:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM x1
      JOIN y1 USING (a)
      WHERE data LIKE 'a%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Without UNIQUE constraint: '
    || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM x2
      JOIN y2 USING (a)
      WHERE data LIKE 'a%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('With    UNIQUE constraint: '
    || (SYSTIMESTAMP - v_ts));
END;
/

The above benchmark repeats each individual query 1000 times. The results speak for themselves:

Without UNIQUE constraint: +000000000 00:00:04.250000000
With    UNIQUE constraint: +000000000 00:00:02.847000000

Remark

The lack of a UNIQUE constraint may happen in situations where you prefer using a surrogate primary key in the referencing table (which I omitted in the examples for brevity). If you’re “sharing” the primary key or use natural keys, this issue won’t happen to you, of course.

Conclusion

The execution planner can make a more informed decision if it has formal knowledge about your data via an additional UNIQUE constraint. This formal knowledge is by far more powerful than any statistics that might indicate the same thing. In the absence of a formal UNIQUE constraint, the database will always have to make sure there is not another row once it has found one. With a formal UNIQUE constraint, it can stop looking as soon as that unique row was found. This can drastically speed up queries. As we’ve seen in the above example, this improves things by a factor of 1.5, so the second query is 50% faster!

Always tell your database as much as you can. Your SELECT performance will greatly increase, at the small cast of a little overhead when inserting data.

Quantified Comparison Predicates – Some of SQL’s Rarest Species


A recent Tweet by Aaron Bertrand (whom you’ve certainly encountered on Stack Overflow) has triggered my interest

Indeed, few people I’ve met and who’ve visited my SQL masterclass have heard about the ANY and ALL quantifiers, which you can use in SQL, let alone used them on a regular basis (or ever used them at all).

What are these things?

The formal definition is rather simple. Let

R <comp op> <quantifier> S

… be a quantified comparison predicate where

  • R is a row
  • <comp op> is a comparison operator (like =, !=, etc.)
  • <quantifier> is ANY or ALL
  • S is a subquery

You can now write things like:

SELECT *
FROM author AS a
WHERE a.author_id = ANY (
  SELECT b.author_id
  FROM book AS b
)

The above reads relatively intuitively (depending on your understanding of “intuitive”):

I want all authors whose author_id is equal to “ANY” of the author_id values contained in book

In fact, the SQL standard defines the IN predicate as being just syntax sugar for the = ANY() quantified comparison predicate.

8.4 <in predicate>

Let RVC be the <row value predicand> and 
let IPV be the <in predicate value>.

...

The expression

  RVC IN IPV

is equivalent to

  RVC = ANY IPV

OK, so why even use these weird operators?

The interesting case for these operators is not where the comparison operator is an equality (“=”) or non-equality (“!=” or “”). The interesting case is where we’re comparing things with less-than or greater-than operators, such as:

SELECT *
FROM author AS a
WHERE a.age > ALL (
  SELECT c.age
  FROM customer AS c
)

Or, in plain English:

I want all authors whose age is greater than the age of all customers

That’s neat, isn’t it? Of course, there are (at least) two alternative ways of tackling the same query, namely:

Comparing only with the MAX()

SELECT *
FROM author AS a
WHERE a.age > (
  SELECT MAX(c.age)
  FROM customer AS c
)

This solution compares the author’s age only with the maximum customer age, which obviously yields the same result. Once we’ve found an author older than the oldest customer, then that author is older than ALL the customers.

Instead of using MAX(), we can use the following, semantically equivalent query:

Comparing only with the “first”

SELECT *
FROM author AS a
WHERE a.age > (
  SELECT c.age
  FROM customer AS c
  ORDER BY c.age DESC
  LIMIT 1
)

I’m using PostgreSQL syntax (LIMIT 1), although most databases have some way of returning only one row from a query (see the jOOQ manual for details). This is the same as using MAX(). Some databases recognise the fact that we’re only returning a single row, and they won’t sort the entire result set first in O(N log N), but yield the maximum value only in O(N), and only once for all authors, instead of once per author. Whether this optimisation is done in your database, I can’t say. You should measure this yourself, and check out execution plans.

So, again, why use quantifiers?

Aaron from the twitter conversation would advise against using this “unintuitive” syntax:

Perhaps. Ultimately, you should always choose performance first, and then – most certainly – intuitiveness second (because some poor soul might need to maintain your query). But personally, I find these quantifiers quite elegant for three reasons:

  1. They express the quantification right where it belongs. With the comparison operator. Compare this with the solution using LIMIT, which may be far away, visually, from the greater-than operator. Quantifiers are much more concise, even than when using MAX() (in my opinion)
  2. They’re very set oriented. I like thinking in terms of sets when I work with SQL. Whenever I can omit the ORDER BY clause, I will. If only to avoid potentially slow operations (in case the database doesn’t optimise this, and a full O(N log N) sort operation is invoked)
  3. Quantified comparison predicates work on rows too, not just on single values.

Check this out:

SELECT (c1.last_name, c1.first_name) >= ALL (
  SELECT c2.last_name, c2.first_name
  FROM customer AS c2
)
FROM customer AS c1
WHERE c1.id = 1

The above query will yield a single column containing a “TRUE” or “FALSE” value (e.g. in PostgreSQL, which supports this exact syntax). The idea is that we’re running a query for customer with id = 1 and we want to see if that customer’s (last_name, first_name) tuple is “after” all the other customers’. Or in plain English:

Am I at the end of the phone book?

Again, you could do this with LIMIT:

SELECT (c1.last_name, c1.first_name) >= (
  SELECT c2.last_name, c2.first_name
  FROM customer AS c2
  ORDER BY c2.last_name DESC, c2.first_name DESC
  LIMIT 1
)
FROM customer AS c1
WHERE c1.id = 1

… but I find this much less elegant.

Unfortunately, this time there’s no way to solve this problem with MAX(). No database (that I’m aware of) supports using row value expressions (tuples) for aggregate functions like MAX(). This would be cool:

SELECT (c1.last_name, c1.first_name) >= (
  SELECT MAX((c2.last_name, c2.first_name))
  FROM customer AS c2
)
FROM customer AS c1
WHERE c1.id = 1

Or this:

SELECT (c1.last_name, c1.first_name) >= (
  SELECT MAX(ROW(c2.last_name, c2.first_name))
  FROM customer AS c2
)
FROM customer AS c1
WHERE c1.id = 1

Of course, the above is just a constructed example. Yes, you could also use window functions instead (e.g. LEAD()).

Conclusion

Quantified comparison predicates are very rarely seen in the wild. One reason is, few people know about them, or think of them when they’re solving a problem. Another reason might be, they’re not well optimised in some databases, so it’s actually a bad idea to use them.

Nonetheless, I find them to be very interesting SQL trivia to know about. And who knows, perhaps one day, you do run into a problem that is indeed best solved with quantified comparison predicates. Then you’ll shine with the most optimal solution.

Further reading

“What Java ORM do You Prefer, and Why?” – SQL of Course!


Catchy headline, yes. But check out this Stack Overflow question by user Mike:

(I’m duplicating it here on the blog, as it might be deleted soon)

It’s a pretty open ended question. I’ll be starting out a new project and am looking at different ORMs to integrate with database access.

Do you have any favorites? Are there any you would advise staying clear of?

And the top voted answer (164 points by David Crawshaw is: “Just use SQL”:

I have stopped using ORMs.

The reason is not any great flaw in the concept. Hibernate works well. Instead, I have found that queries have low overhead and I can fit lots of complex logic into large SQL queries, and shift a lot of my processing into the database.

So consider just using the JDBC package.

The second answer (66 points by user simon) is, again: “Just use SQL”:

None, because having an ORM takes too much control away with small benefits. The time savings gained are easily blown away when you have to debug abnormalities resulting from the use of the ORM. Furthermore, ORMs discourage developers from learning SQL and how relational databases work and using this for their benefit.

The third answer (51 points by myself) is saying, once more: “Use SQL” (and use it with jOOQ).

The best way to write SQL in Java

Only the fourth answer (46 points by Abdullah Jibaly) mentiones Hibernate, the most popular ORM in the Java ecosystem.

The truth is, as we’ve shown numerous times on this blog: Hibernate/JPA/ORMs are good tools to get rid of boring (and complex) CRUD. But that’s just boilerplate logic with little value to your business logic. The interesting stuff – the queries, the batch and bulk processing, the analytics, the reporting, they’re all best done with SQL. Here are some additional articles:

Stay tuned as we’re entering an era of programming where object orientation fades, and functional / declarative programming makes data processing extremely easy and lean again.

Say NO to Venn Diagrams When Explaining JOINs


In recent times, there have been a couple of tremendously popular blog posts explaining JOINs using Venn Diagrams. After all, relational algebra and SQL are set oriented theories and languages, so it only makes sense to illustrate set operations like JOINs using Venn Diagrams. Right?

Google seems to say so:

venn-google

Everyone uses Venn Diagrams to explain JOINs. But that’s…

PLAIN WRONG!

Venn Diagrams are perfect to illustrate … actual set operations! SQL knows three of them:

  • UNION
  • INTERSECT
  • EXCEPT

And they can be explained as such:

venn-union

venn-intersection

venn-difference

(all of these slides are taken from our Data Geekery SQL Training, do check it out!)

Most of you use UNION occasionally. INTERSECT and EXCEPT are more exotic, but do come in handy every now and then.

The point here is: these set operations operate on sets of elements (tuples), which are all of the same type. As in the examples above, all elements are people with first and last names. This is also why INTERSECT and EXCEPT are more exotic, because they’re usually not very useful. JOIN is much more useful. For instance, you want to combine the set of actors with their corresponding set of films.

A JOIN is really a cartesian product (also cross product) with a filter. Here’s a nice illustration of a cartesian product:

venn-cross-product

So, what’s a better way to illustrate JOIN operations?

JOIN diagrams! Let’s look at CROSS JOIN first, because all other JOIN types can be derived from CROSS JOIN:

venn-cross-join

Remember, in a cross join (in SQL also written with a comma separated table list, historically) is just taking every item on the left side, and combines it with every item on the right side. When you CROSS JOIN a table of 3 rows with a table of 4 rows, you will get 3×4=12 result rows. See, I’m using an “x” character to write the multiplication. I.e. a “cross”.

INNER JOIN

All other joins are still based on cross joins, but with additional filters, and perhaps unions. Here’s an explanation of each individual JOIN type.

venn-join

In plain text, an INNER JOIN is a CROSS JOIN in which only those combinations are retained which fulfil a given predicate. For instance:

-- "Classic" ANSI JOIN syntax
SELECT *
FROM author a
JOIN book b ON a.author_id = b.author_id

-- "Nice" ANSI JOIN syntax
SELECT *
FROM author a
JOIN book b USING (author_id)

-- "Old" syntax using a "CROSS JOIN"
SELECT *
FROM author a, book b
WHERE a.author_id = b.author_id

OUTER JOIN

OUTER JOIN types help where we want to retain those rows from either the LEFT side or the RIGHT or both (FULL) sides, for which there was no matching row where the predicate yielded true.

A LEFT OUTER JOIN in relational algebra is defined as such:

dd81ee1373d922122ce1b3e0da74cb28

Or more verbosely in SQL:

SELECT *
FROM author a
LEFT JOIN book b USING (author_id)

This will produce all the authors and their books, but if an author doesn’t have any book, we still want to get the author with NULL as their only book value. So, it’s the same as writing:

SELECT *
FROM author a
JOIN book b USING (author_id)

UNION

SELECT a.*, NULL, NULL, NULL, ..., NULL
FROM (
  SELECT a.*
  FROM author a
  
  EXCEPT
  
  SELECT a.*
  FROM author a
  JOIN book b USING (author_id)
) a

But no one wants to write that much SQL, so OUTER JOIN was implemented.

Conclusion: Say NO to Venn Diagrams

JOINs are relatively easy to understand intuitively. And they’re relatively easy to explain using Venn Diagrams. But whenever you do that, remember, that you’re making a wrong analogy. A JOIN is not strictly a set operation that can be described with Venn Diagrams. A JOIN is always a cross product with a predicate, and possibly a UNION to add additional rows to the OUTER JOIN result.

So, if in doubt, please use JOIN diagrams rather than Venn Diagrams. They’re more accurate and visually more useful.

venn-google-say-no

(Remember, all of these slides are taken from our Data Geekery SQL Training, do get in touch, if you’re interested)