How to Fetch All Current Identity Values in Oracle

Oracle 12c has introduced the useful SQL standard IDENTITY feature, which is essentially just syntax sugar for binding a sequence to a column default. We can use it like this:

create table t1 (col1 number generated always as identity);
create table t2 (col2 number generated always as identity);

insert into t1 values (default);
insert into t1 values (default);
insert into t1 values (default);
insert into t2 values (default);

select * from t1;
select * from t2;

Which produces

COL1
----
  1
  2
  3

COL2
----
  1

For unit testing against our database, we might want to know what “state” our identities are in. For each table, we would like to know the next value such an identity would produce. If we knew all the backing sequence names, we could query their seq.currval, but we don’t know those sequence names as they are generated.

However, we can query the dictionary views to get this information as follows:

select data_default
from user_tab_cols
where data_default is not null
and identity_column = 'YES'
and table_name in ('T1', 'T2');

An alternative is to query user_tab_identity_cols

This would produce:

"TEST"."ISEQ$$_116601".nextval
"TEST"."ISEQ$$_116603".nextval

Now, if we’re lazy, we could just run EXECUTE IMMEDIATE on each of those expressions and we’re done:

set serveroutput on
declare
  v_current number;
begin
  for rec in (
    select table_name, data_default
    from user_tab_cols
    where data_default is not null
    and identity_column = 'YES'
    and table_name in ('T1', 'T2')
  ) loop
    execute immediate replace(
      'select ' || rec.data_default || ' from dual', 
      '.nextval', 
      '.currval'
    ) into v_current;
    dbms_output.put_line(
      'Table : ' || rec.table_name || 
      ', currval : ' || v_current
    );
  end loop;
end;
/

This would produce:

Table : T1, currval : 3
Table : T2, currval : 1

Alternatively, if you want this result to be a SQL result instead of DBMS_OUTPUT content, you could run this:

with
  function current_value(p_table_name varchar2) return number is
    v_current number;
  begin
    for rec in (
      select data_default
      from user_tab_cols
      where table_name = p_table_name
      and data_default is not null
      and identity_column = 'YES'
    )
    loop
      execute immediate replace(
        'select ' || rec.data_default || ' from dual', 
        '.nextval', 
        '.currval'
      ) into v_current;
      return v_current;
    end loop;
    
    return null;
  end;
select *
from (
  select table_name, current_value(table_name) current_value
  from user_tables
  where table_name in ('T1', 'T2')
)
where current_value is not null
order by table_name;
/

The alternative using user_tab_identity_cols would look like this:

with
  function current_value(p_table_name varchar2) return number is
    v_current number;
  begin
    for rec in (
      select sequence_name
      from user_tab_identity_cols
      where table_name = p_table_name
    )
    loop
      execute immediate 
        'select ' || rec.sequence_name || '.currval from dual'
      into v_current;
      return v_current;
    end loop;
     
    return null;
  end;
select *
from (
  select table_name, current_value(table_name) current_value
  from user_tables
)
where current_value is not null
order by table_name;
/

The result is now a nice SQL result set:

TABLE_NAME   CURRENT_VALUE
--------------------------
T1           3
T2           1

How to Use jOOQ’s Commercial Distributions with Spring Boot

Spring Boot is great to get started very quickly with what the Spring Boot authors have evaluated to be useful defaults. This can be a lot of help when you’re doing things for the first time, and have no way to copy paste working Maven pom.xml files from existing projects, for example.

When working with the jOOQ Open Source Edition, just go to https://start.spring.io, add the jOOQ dependency, and start working!

It is a bit different when you want to work with the commercial distributions of jOOQ, for two reasons:

  1. They are not on Maven Central, but in your own repository or artifactory, after you’ve installed the latest version from our website: https://www.jooq.org/download/versions
  2. They use a different Maven groupId, to make sure the different distributions can be easily distinguished.

The different groupIds for jOOQ distributions are:

org.jooq For the jOOQ Open Source Edition
org.jooq.trial For the jOOQ Trial Edition
org.jooq.pro For the jOOQ Express, Professional and Enterprise Edition (supporting the latest JDK versions)
org.jooq.pro-java-6 For the jOOQ Express, Professional and Enterprise Edition (supporting Java 6+)
org.jooq.pro-java-8 For the jOOQ Express, Professional and Enterprise Edition (supporting Java 8+, starting from jOOQ 3.12)

Spring Boot doesn’t know this, and doesn’t have to. All of these distributions are largely source and binary compatible, so you can switch editions in your application simply by replacing dependencies. A vanilla https://start.spring.io pom.xml configuration might look like this.

Notice: I’m leaving out spring-boot-starter-test, spring-boot-maven-plugin, and other things not essential for this blog post, please use https://start.spring.io to generate a more complete pom.xml stub!

<project>
  <modelVersion>4.0.0</modelVersion>
  <parent>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-parent</artifactId>
    <version>2.1.6.RELEASE</version>
  </parent>
  <groupId>com.example</groupId>
  <artifactId>demo</artifactId>
  <version>0.0.1-SNAPSHOT</version>

  <dependencies>
    <dependency>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-jooq</artifactId>
    </dependency>
  </dependencies>
</project>

What dependencies are we getting from this?

mvn dependency:tree

We’re getting:

[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ demo ---
[INFO] com.example:demo:jar:0.0.1-SNAPSHOT
[INFO] \- org.springframework.boot:spring-boot-starter-jooq:jar:2.1.6.RELEASE:compile
[INFO]    +- org.springframework.boot:spring-boot-starter-jdbc:jar:2.1.6.RELEASE:compile
[INFO]    |  +- org.springframework.boot:spring-boot-starter:jar:2.1.6.RELEASE:compile
[INFO]    |  |  +- org.springframework.boot:spring-boot:jar:2.1.6.RELEASE:compile
[INFO]    |  |  |  \- org.springframework:spring-context:jar:5.1.8.RELEASE:compile
[INFO]    |  |  |     +- org.springframework:spring-aop:jar:5.1.8.RELEASE:compile
[INFO]    |  |  |     \- org.springframework:spring-expression:jar:5.1.8.RELEASE:compile
[INFO]    |  |  +- org.springframework.boot:spring-boot-autoconfigure:jar:2.1.6.RELEASE:compile
[INFO]    |  |  +- org.springframework.boot:spring-boot-starter-logging:jar:2.1.6.RELEASE:compile
[INFO]    |  |  |  +- ch.qos.logback:logback-classic:jar:1.2.3:compile
[INFO]    |  |  |  |  \- ch.qos.logback:logback-core:jar:1.2.3:compile
[INFO]    |  |  |  +- org.apache.logging.log4j:log4j-to-slf4j:jar:2.11.2:compile
[INFO]    |  |  |  |  \- org.apache.logging.log4j:log4j-api:jar:2.11.2:compile
[INFO]    |  |  |  \- org.slf4j:jul-to-slf4j:jar:1.7.26:compile
[INFO]    |  |  +- javax.annotation:javax.annotation-api:jar:1.3.2:compile
[INFO]    |  |  \- org.yaml:snakeyaml:jar:1.23:runtime
[INFO]    |  +- com.zaxxer:HikariCP:jar:3.2.0:compile
[INFO]    |  |  \- org.slf4j:slf4j-api:jar:1.7.26:compile
[INFO]    |  \- org.springframework:spring-jdbc:jar:5.1.8.RELEASE:compile
[INFO]    +- org.springframework:spring-tx:jar:5.1.8.RELEASE:compile
[INFO]    |  +- org.springframework:spring-beans:jar:5.1.8.RELEASE:compile
[INFO]    |  \- org.springframework:spring-core:jar:5.1.8.RELEASE:compile
[INFO]    |     \- org.springframework:spring-jcl:jar:5.1.8.RELEASE:compile
[INFO]    \- org.jooq:jooq:jar:3.11.11:compile
[INFO]       \- javax.xml.bind:jaxb-api:jar:2.3.1:compile
[INFO]          \- javax.activation:javax.activation-api:jar:1.2.0:compile

When this blog post was written, 3.11.11 was the latest jOOQ Open Source Edition version. But perhaps, you want a newer version or an older version. You can override this easily by specifying the ${jooq.version} property in Maven:

<project>
  <modelVersion>4.0.0</modelVersion>
  <parent>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-parent</artifactId>
    <version>2.1.6.RELEASE</version>
  </parent>
  <groupId>com.example</groupId>
  <artifactId>demo</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  
  <properties>
    <jooq.version>3.11.0</jooq.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-jooq</artifactId>
    </dependency>
  </dependencies>
</project>

The dependency tree is now:

[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ demo ---
[INFO] com.example:demo:jar:0.0.1-SNAPSHOT
[INFO] \- org.springframework.boot:spring-boot-starter-jooq:jar:2.1.6.RELEASE:compile
[INFO]    +- org.springframework.boot:spring-boot-starter-jdbc:jar:2.1.6.RELEASE:compile
[INFO]    |  +- org.springframework.boot:spring-boot-starter:jar:2.1.6.RELEASE:compile
[INFO]    |  |  +- org.springframework.boot:spring-boot:jar:2.1.6.RELEASE:compile
[INFO]    |  |  |  \- org.springframework:spring-context:jar:5.1.8.RELEASE:compile
[INFO]    |  |  |     +- org.springframework:spring-aop:jar:5.1.8.RELEASE:compile
[INFO]    |  |  |     \- org.springframework:spring-expression:jar:5.1.8.RELEASE:compile
[INFO]    |  |  +- org.springframework.boot:spring-boot-autoconfigure:jar:2.1.6.RELEASE:compile
[INFO]    |  |  +- org.springframework.boot:spring-boot-starter-logging:jar:2.1.6.RELEASE:compile
[INFO]    |  |  |  +- ch.qos.logback:logback-classic:jar:1.2.3:compile
[INFO]    |  |  |  |  \- ch.qos.logback:logback-core:jar:1.2.3:compile
[INFO]    |  |  |  +- org.apache.logging.log4j:log4j-to-slf4j:jar:2.11.2:compile
[INFO]    |  |  |  |  \- org.apache.logging.log4j:log4j-api:jar:2.11.2:compile
[INFO]    |  |  |  \- org.slf4j:jul-to-slf4j:jar:1.7.26:compile
[INFO]    |  |  +- javax.annotation:javax.annotation-api:jar:1.3.2:compile
[INFO]    |  |  \- org.yaml:snakeyaml:jar:1.23:runtime
[INFO]    |  +- com.zaxxer:HikariCP:jar:3.2.0:compile
[INFO]    |  |  \- org.slf4j:slf4j-api:jar:1.7.26:compile
[INFO]    |  \- org.springframework:spring-jdbc:jar:5.1.8.RELEASE:compile
[INFO]    +- org.springframework:spring-tx:jar:5.1.8.RELEASE:compile
[INFO]    |  +- org.springframework:spring-beans:jar:5.1.8.RELEASE:compile
[INFO]    |  \- org.springframework:spring-core:jar:5.1.8.RELEASE:compile
[INFO]    |     \- org.springframework:spring-jcl:jar:5.1.8.RELEASE:compile
[INFO]    \- org.jooq:jooq:jar:3.11.0:compile
[INFO]       \- javax.xml.bind:jaxb-api:jar:2.3.1:compile
[INFO]          \- javax.activation:javax.activation-api:jar:1.2.0:compile

But it’s still the jOOQ Open Source Edition. What if you want a commercial distribution, e.g. to try out jOOQ? One way is to explicitly exclude Spring Boot’s transitive jOOQ Open Source Edition dependency, and introduce your own explicit dependency. For example:

<project>
  <modelVersion>4.0.0</modelVersion>
  <parent>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-parent</artifactId>
    <version>2.1.6.RELEASE</version>
  </parent>
  <groupId>com.example</groupId>
  <artifactId>demo</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  
  <properties>
    <jooq.version>3.11.11</jooq.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-jooq</artifactId>
    
      <!-- Exclude the jOOQ Open Source Edition -->
      <exclusions>
        <exclusion>
          <groupId>org.jooq</groupId>
          <artifactId>jooq</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
  
    <!-- Include a commercial jOOQ distribution -->
    <dependency>
      <groupId>org.jooq.trial</groupId>
      <artifactId>jooq</artifactId>
      <version>${jooq.version}</version>
    </dependency>
  </dependencies>
</project>

The new dependency tree is now:

[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ demo ---
[INFO] com.example:demo:jar:0.0.1-SNAPSHOT
[INFO] +- org.springframework.boot:spring-boot-starter-jooq:jar:2.1.6.RELEASE:compile
[INFO] |  +- org.springframework.boot:spring-boot-starter-jdbc:jar:2.1.6.RELEASE:compile
[INFO] |  |  +- org.springframework.boot:spring-boot-starter:jar:2.1.6.RELEASE:compile
[INFO] |  |  |  +- org.springframework.boot:spring-boot:jar:2.1.6.RELEASE:compile
[INFO] |  |  |  |  \- org.springframework:spring-context:jar:5.1.8.RELEASE:compile
[INFO] |  |  |  |     +- org.springframework:spring-aop:jar:5.1.8.RELEASE:compile
[INFO] |  |  |  |     \- org.springframework:spring-expression:jar:5.1.8.RELEASE:compile
[INFO] |  |  |  +- org.springframework.boot:spring-boot-autoconfigure:jar:2.1.6.RELEASE:compile
[INFO] |  |  |  +- org.springframework.boot:spring-boot-starter-logging:jar:2.1.6.RELEASE:compile
[INFO] |  |  |  |  +- ch.qos.logback:logback-classic:jar:1.2.3:compile
[INFO] |  |  |  |  |  \- ch.qos.logback:logback-core:jar:1.2.3:compile
[INFO] |  |  |  |  +- org.apache.logging.log4j:log4j-to-slf4j:jar:2.11.2:compile
[INFO] |  |  |  |  |  \- org.apache.logging.log4j:log4j-api:jar:2.11.2:compile
[INFO] |  |  |  |  \- org.slf4j:jul-to-slf4j:jar:1.7.26:compile
[INFO] |  |  |  +- javax.annotation:javax.annotation-api:jar:1.3.2:compile
[INFO] |  |  |  \- org.yaml:snakeyaml:jar:1.23:runtime
[INFO] |  |  +- com.zaxxer:HikariCP:jar:3.2.0:compile
[INFO] |  |  |  \- org.slf4j:slf4j-api:jar:1.7.26:compile
[INFO] |  |  \- org.springframework:spring-jdbc:jar:5.1.8.RELEASE:compile
[INFO] |  \- org.springframework:spring-tx:jar:5.1.8.RELEASE:compile
[INFO] |     +- org.springframework:spring-beans:jar:5.1.8.RELEASE:compile
[INFO] |     \- org.springframework:spring-core:jar:5.1.8.RELEASE:compile
[INFO] |        \- org.springframework:spring-jcl:jar:5.1.8.RELEASE:compile
[INFO] \- org.jooq.trial:jooq:jar:3.11.11:compile
[INFO]    \- javax.xml.bind:jaxb-api:jar:2.3.1:compile
[INFO]       \- javax.activation:javax.activation-api:jar:1.2.0:compile

And you’re all set!

How to Write a Simple, yet Extensible API

How to write a simple API is already an art on its own.

I didn’t have time to write a short letter, so I wrote a long one instead.

― Mark Twain

But keeping an API simple for beginners and most users, and making it extensible for power users seems even more of a challenge. But is it?

What does “extensible” mean?

Imagine an API like, oh say, jOOQ. In jOOQ, you can write SQL predicates like this:

ctx.select(T.A, T.B)
   .from(T)
   .where(T.C.eq(1)) // Predicate with bind value here
   .fetch();

By default (as this should always be the default), jOOQ will generate and execute this SQL statement on your JDBC driver, using a bind variable:

SELECT t.a, t.b
FROM t
WHERE t.c = ?

The API made the most common use case simple. Just pass your bind variable as if the statement was written in e.g. PL/SQL, and let the language / API do the rest. So we passed that test.

The use case for power users is to occasionally not use bind variables, for whatever reasons (e.g. skew in data and bad statistics, see also this post about bind variables).Will we pass that test as well?

jOOQ mainly offers two ways to fix this:

On a per-query basis

You can turn your variable into an inline value explicitly for this single occasion:

ctx.select(T.A, T.B)
   .from(T)
   .where(T.C.eq(inline(1))) // Predicate without bind value here
   .fetch();

This is using the static imported DSL.inline() method. Works, but not very convenient, if you have to do this for several queries, for several bind values, or worse, depending on some context.

This is a necessary API enhancement, but it does not make the API extensible.

On a global basis

Notice that ctx object there? It is the DSLContext object, the “contextual DSL”, i.e. the DSL API that is in the context of a jOOQ Configuration. You can thus set:

ctx2 = DSL.using(ctx
    .configuration()
    .derive()
    .set(new Settings()
    .withStatementType(StatementType.STATIC_STATEMENT));

// And now use this new DSLContext instead of the old one
ctx2.select(T.A, T.B)
    .from(T)
    .where(T.C.eq(1)) // No longer a bind variable
    .fetch();

Different approaches to offering such extensibility

We have our clean and simple API. Now some user wants to extend it. So often, we’re tempted to resort to a hack, e.g. by using thread locals, because they would work easily when under the assumption of a thread-bound execution model – such as e.g. classic Java EE Servlets

The price we’re paying for such a hack is high.

  1. It’s a hack, and as such it will break easily. If we offer this as functionality to a user, they will start depending on it, and we will have to support and maintain it
  2. It’s a hack, and it is based on assumptions, such as thread bound ness. It will not work in an async / reactive / parallel stream context, where our logic may jump back and forth between threads
  3. It’s a hack, and deep inside, we know it’s wrong. Obligatory XKCD: https://xkcd.com/292

This might obviously work, just like global (static) variables. You can set this variable globally (or “globally” for your own thread), and then the API’s internals will be able to read it. No need to pass around parameters, so no need to compromise on the APIs simplicity by adding optional and often ugly, distractive parameters.

What are better approaches to offering such extensibility?

Dependency Injection

One way is to use explicit Dependency Injection (DI). If you have a container like Spring, you can rely on Spring injecting arbitrary objects into your method call / whatever, where you need access to it:

This way, if you maintain several contextual objects of different lifecycle scopes, you can let the DI framework make appropriate decisions to figure out where to get that contextual information from. For example, when using JAX-RS, you can do this using an annotation based approach:


// These annotations bind the method to some HTTP address
@GET
@Produces("text/plain")
@Path("/api")
public String method(

    // This annotation fetches a request-scoped object
    // from the method call's context
    @Context HttpServletRequest request,

    // This annotation produces an argument from the
    // URL's query parameters
    @QueryParam("arg") String arg
) {
    ...
}

This approach works quite nicely for static environments (annotations being static), where you do not want to react to dynamic URLs or endpoints. It is declarative, and a bit magic, but well designed, so once you know all the options, you can choose the right one for your use case very easily.

While @QueryParam is mere convenience (you could have gotten the argument also from the HttpServletRequest), the @Context is powerful. It can help inject values of arbitrary lifecycle scope into your method / class / etc.

I personally favour explicit programming over annotation-based magic (e.g. using Guice for DI), but that’s probably a matter of taste. Both are a great way for implementors of APIs (e.g. HTTP APIs) to help get access to framework objects.

However, if you’re an API vendor, and want to give users of your API a way to extend the API, I personally favour jOOQ’s SPI approach.

SPIs

One of jOOQ’s strengths, IMO, is precisely this single, central place to register all SPI implementations that can be used for all sorts of purposes: The Configuration.

For example, on such a Configuration you can specify a JSR-310 java.time.Clock. This clock will be used by jOOQ’s internals to produce client side timestamps, instead of e.g. using System.currentTimeMillis(). Definitely a use case for power users only, but once you have this use case, you really only want to tweak a single place in jOOQ’s API: The Configuration.

All of jOOQ’s internals will always have a Configuration reference available. And it’s up to the user to decide what the scope of this object is, jOOQ doesn’t care. E.g.

  • per query
  • per thread
  • per request
  • per session
  • per application

In other words, to jOOQ, it doesn’t matter at all if you’re implementing a thread-bound, blocking, classic servlet model, or if you’re running your code reactively, or in parallel, or whatever. Just manage your own Configuration lifecycle, jOOQ doesn’t care.

In fact, you can have a global, singleton Configuration and implement thread bound components of it, e.g. the ConnectionProvider SPI, which takes care of managing the JDBC Connection lifecycle for jOOQ. Typically, users will use e.g. a Spring DataSource, which manages JDBC Connection (and transactions) using a thread-bound model, internally using ThreadLocal. jOOQ does not care. The SPI specifies that jOOQ will:

Again, it does not matter to jOOQ what the specific ConnectionProvider implementation does. You can implement it in any way you want if you’re a power user. By default, you’ll just pass jOOQ a DataSource, and it will wrap it in a default implementation called DataSourceConnectionProvider for you.

The key here is again:

  • The API is simple by default, i.e. by default, you don’t have to know about this functionality, just pass jOOQ a DataSource as always when working with Java and SQL, and you’re ready to go
  • The SPI allows for easily extending the API without compromising on its simplicity, by providing a single, central access point to this kind of functionality

Other SPIs in Configuration include:

  • ExecuteListener: An extremely useful and simple way to hook into the entire jOOQ query management lifecycle, from generating the SQL string to preparing the JDBC statement, to binding variables, to execution, to fetching result sets. A single SPI can accomodate various use cases like SQL logging, patching SQL strings, patching JDBC statements, listening to result set events, etc.
  • ExecutorProvider: Whenever jOOQ runs something asynchronously, it will ask this SPI to provide a standard JDK Executor, which will be used to run the asynchronous code block. By default, this will be the JDK default (the default ForkJoinPool), as always. But you probably want to override this default, and you want to be in full control of this, and not think about it every single time you run a query.
  • MetaProvider: Whenever jOOQ needs to look up database meta information (schemas, tables, columns, types, etc.), it will ask this MetaProvider about the available meta information. By default, this will run queries on the JDBC DatabaseMetaData, which is good enough, but maybe you want to wire these calls to your jOOQ-generated classes, or something else.
  • RecordMapperProvider and RecordUnmapperProvider: jOOQ has a quite versatile default implementation of how to map between a jOOQ Record and an arbitrary Java class, supporting a variety of standard approaches including JavaBeans getter/setter naming conventions, JavaBeans @ConstructorProperties, and much more. These defaults apply e.g. when writing query.fetchInto(MyBean.class). But sometimes, the defaults are not good enough, and you want this particular mapping to work differently. Sure, you could write query.fetchInto(record -> mymapper(record)), but you may not want to remember this for every single query. Just override the mapper (and unmapper) at a single, central spot for your own chosen Configuration scope (e.g. per query, per request, per session, etc.) and you’re done

Conclusion

Writing a simple API is difficult. Making it extensible in a simple way, however, is not. If your API has achieved “simplicity”, then it is very easy to support injecting arbitrary SPIs for arbitrary purposes at a single, central location, such as jOOQ’s Configuration.

In my most recent talk “10 Reasons Why we Love Some APIs and Why we Hate Some Others”, I’ve made a point that things like simplicity, discoverability, consistency, and convenience are among the most important aspects of a great API. How do you define a good API? The most underrated answer on this (obviously closed) Stack Overflow question is this one:

.

Again, this is hard in terms of creating a simple API. But it is extremely easy when making this simple API extensible. Make your SPIs very easily discoverable. A jOOQ power user will always look for extension points in jOOQ’s Configuration. And because the extension points are explicit types which have to be implemented (as opposed to annotations and their magic), no documentation is needed to learn the SPI (of course it is still beneficial as a reference).

I’d love to hear your alternative approaches to this API design challenge in the comments.

Watch the full talk here:

Using IGNORE NULLS With SQL Window Functions to Fill Gaps

I found a very interesting SQL question on Twitter recently:

Rephrasing the question: We have a set of sparse data points:

+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 |
| 2019-01-05 |   125 |
| 2019-01-06 |   128 |
| 2019-01-10 |   130 |
+------------+-------+

Since dates can be listed as discrete, continuous data points, why not fill in the gaps between 2019-01-02 and 2019-01-05 or 2019-01-06 and 2019-01-10? The desired output would be:

+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 | <-+
| 2019-01-03 |   120 |   | -- Generated
| 2019-01-04 |   120 |   | -- Generated
| 2019-01-05 |   125 |
| 2019-01-06 |   128 | <-+
| 2019-01-07 |   128 |   | -- Generated
| 2019-01-08 |   128 |   | -- Generated
| 2019-01-09 |   128 |   | -- Generated
| 2019-01-10 |   130 |
+------------+-------+

In the generated columns, we’ll just repeat the most recent value.

How to do this with SQL?

For the sake of this example, I’m using Oracle SQL, as the OP was expecting to do this with Oracle. The idea is to do this in two steps:

  1. Generate all the dates between the first and the last data points
  2. For each date, find either the current data point, or the most recent one

But first, let’s create the data:

create table t (value_date, value) as
  select date '2019-01-01', 100 from dual union all
  select date '2019-01-02', 120 from dual union all
  select date '2019-01-05', 125 from dual union all
  select date '2019-01-06', 128 from dual union all
  select date '2019-01-10', 130 from dual;

1. Generating all the dates

In Oracle, we can use the convenient CONNECT BY syntax for this. We could also use some other tool to generate dates to fill the gaps, including SQL standard recursion using WITH, or some PIPELINED function, but I like CONNECT BY for this purpose.

We’ll write:

select (
  select min(t.value_date) 
  from t
) + level - 1 as value_date
from dual
connect by level <= (
  select max(t.value_date) - min(t.value_date) + 1
  from t
)

This produces:

VALUE_DATE|
----------|
2019-01-01|
2019-01-02|
2019-01-03|
2019-01-04|
2019-01-05|
2019-01-06|
2019-01-07|
2019-01-08|
2019-01-09|
2019-01-10|

Now we wrap the above query in a derived table and left join the actual data set:

select 
  d.value_date,
  t.value
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level <= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

The date gaps are now filled, but our values column is still sparse:

VALUE_DATE|VALUE|
----------|-----|
2019-01-01|  100|
2019-01-02|  120|
2019-01-03|     |
2019-01-04|     |
2019-01-05|  125|
2019-01-06|  128|
2019-01-07|     |
2019-01-08|     |
2019-01-09|     |
2019-01-10|  130|

2. Fill the value gaps

On each row, the VALUE column should either contain the actual value, or the “last_value” preceding the current row, ignoring all the nulls. Note that I specifically wrote this requirement using specific English language. We can now translate that sentence directly to SQL:

last_value (t.value) ignore nulls over (order by d.value_date)

Since we have added an ORDER BY clause to the window function, the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW applies, which colloquially means “all the preceding rows”. (Technically, that’s not accurate. It means all rows with values less than or equal to the value of the current row – see Kim Berg Hansen’s comment)

Convenient! We’re trying to find the last value in the window of all the preceding rows, ignoring the nulls.

This is standard SQL, but unfortunately not all RDBMS support IGNORE NULLS. Among the ones supported by jOOQ, currently these ones support the syntax:

  • DB2
  • H2
  • Informix
  • Oracle
  • Redshift
  • Sybase SQL Anywhere
  • Teradata

Sometimes, not the exact standard syntax is supported, but the standard feature. Use https://www.jooq.org/translate to see different syntax variants.

The full query now reads:

select 
  d.value_date,
  last_value (t.value) ignore nulls over (order by d.value_date)
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level <= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

… and it yields the desired result:

VALUE_DATE         |VALUE|
-------------------|-----|
2019-01-01 00:00:00|  100|
2019-01-02 00:00:00|  120|
2019-01-03 00:00:00|  120|
2019-01-04 00:00:00|  120|
2019-01-05 00:00:00|  125|
2019-01-06 00:00:00|  128|
2019-01-07 00:00:00|  128|
2019-01-08 00:00:00|  128|
2019-01-09 00:00:00|  128|
2019-01-10 00:00:00|  130|

Other RDBMS

This solution made use of some Oracle specific features such as CONNECT BY. In other RDBMS, the same idea can be implemented by using a different way of generating data. This article focuses only on using IGNORE NULLS. If you’re interested, feel free to post an alternative solution in the comments for your RDBMS.

Calling an Oracle Function with PL/SQL BOOLEAN Type from SQL

One of the most wanted features in the Oracle database is the BOOLEAN type. The SQL standard specified it a while ago, and RDBMS like PostgreSQL show how powerful it can be, e.g. when using the EVERY() aggregate function.

The PL/SQL language already has support for boolean types. We can write:

CREATE OR REPLACE FUNCTION number_to_boolean (i NUMBER) 
RETURN BOOLEAN 
IS
BEGIN
  RETURN NOT i = 0;
END number_to_boolean;
/

CREATE OR REPLACE FUNCTION boolean_to_number (b BOOLEAN) 
RETURN NUMBER 
IS
BEGIN
  RETURN CASE WHEN b THEN 1 WHEN NOT b THEN 0 END;
END boolean_to_number;
/

From PL/SQL, we can now easily call the above functions:

SET SERVEROUTPUT ON
BEGIN
  IF number_to_boolean(1) THEN
    dbms_output.put_line('1 is true');
  END IF;
  IF NOT number_to_boolean(0) THEN
    dbms_output.put_line('0 is false');
  END IF;
  IF number_to_boolean(NULL) IS NULL THEN
    dbms_output.put_line('null is null');
  END IF;
END;
/

The above prints

1 is true
0 is false
null is null

But we cannot do the same from the SQL engine:

SELECT 
  number_to_boolean(1), 
  number_to_boolean(0), 
  number_to_boolean(null) 
FROM dual;

This yields:

ORA-00902: invalid datatype

Eventually, Oracle will fix this by supporting boolean types in the SQL engine (show your love to Oracle here).

The WITH clause

Until then, we can make use of a nice workaround using new functionality from Oracle 12c. We can declare functions in the WITH clause! Run this:

WITH
  FUNCTION f RETURN NUMBER IS 
  BEGIN 
    RETURN 1; 
  END f;
SELECT f
FROM dual;

You’ll get

 F
---
 1

That’s wonderful, and what’s even better, this part of the WITH clause is written in PL/SQL, where we can use the BOOLEAN type again. So we can define bridge functions for each function call. Instead of this:

SELECT 
  number_to_boolean(1), 
  number_to_boolean(0), 
  number_to_boolean(null) 
FROM dual;

We can write this:

WITH
  FUNCTION number_to_boolean_(i NUMBER)
  RETURN NUMBER
  IS
    b BOOLEAN;
  BEGIN
    -- Actual function call
    b := number_to_boolean(i);
    
    -- Translation to numeric result
    RETURN CASE b WHEN TRUE THEN 1 WHEN FALSE THEN 0 END;
  END number_to_boolean_;
SELECT 
  number_to_boolean_(1) AS a, 
  number_to_boolean_(0) AS b, 
  number_to_boolean_(null) AS c
FROM dual;

This now yields:

 A   B   C
-------------
 1   0   null

Of course, we don’t get an actual boolean type back in the result set, as the SQL engine cannot process that. But if you’re calling this function from JDBC, 1/0/null can be translated transparently to true/false/null.

It also works for chaining. Instead of the following, which still yields ORA-00902:

SELECT 
  boolean_to_number(number_to_boolean(1)), 
  boolean_to_number(number_to_boolean(0)), 
  boolean_to_number(number_to_boolean(null))
FROM dual;

We can write this:

WITH
  FUNCTION number_to_boolean_(i NUMBER)
  RETURN NUMBER
  IS
    b BOOLEAN;
  BEGIN
    -- Actual function call
    b := number_to_boolean(i);
    
    -- Translation to numeric result
    RETURN CASE b WHEN TRUE THEN 1 WHEN FALSE THEN 0 END;
  END number_to_boolean_;
  
  FUNCTION boolean_to_number_(b NUMBER)
  RETURN NUMBER
  IS
  BEGIN
    -- Actual function call
    RETURN boolean_to_number(NOT b = 0);
  END boolean_to_number_;
SELECT 
  boolean_to_number_(number_to_boolean_(1)) AS a, 
  boolean_to_number_(number_to_boolean_(0)) AS b, 
  boolean_to_number_(number_to_boolean_(null)) AS c
FROM dual;

… which again yields

 A   B   C
-------------
 1   0   null

And now, the 1/0/null integers are the actual desired result types.

This technique can be automated for any type of PL/SQL function that accepts and/or returns a PL/SQL BOOLEAN type, or even for functions that accept %ROWTYPE parameters, which we’ll work into jOOQ soon, in the near future.

A more real world example can be seen in this Stack Overflow question.

jOOQ 3.12 support

In jOOQ 3.12, we will add native support for using such functions in SQL through #8522. We have already supported PL/SQL boolean types in standalone procedure calls since jOOQ 3.8. With the next version, we can call a function like this one:

FUNCTION f_bool (i BOOLEAN) RETURN BOOLEAN;

From anywhere within a jOOQ statement, e.g.

Record1<Integer> r =
create()
    .select(one())
    .where(PlsObjects.fBool(false))
    .fetchOne();

assertNull(r);

When the above is called, the following SQL statement is generated by jOOQ 3.12, behind the scenes:

with
  function "F_BOOL_"(I integer)
  return integer
  is
    "r" boolean;
  begin
    "r" := "TEST"."PLS_OBJECTS"."F_BOOL"(not I = 0);
    return case when "r" then 1 when not "r" then 0 end;
  end "F_BOOL_";
  select 1
from dual
where (F_BOOL_(0) = 1)

Notice how the boolean expression codes like a true boolean / predicate?

The Difference Between SQL’s JOIN .. ON Clause and the Where Clause

A question that is frequently occurring among my SQL training‘s participants is:

What’s the difference between putting a predicate in the JOIN .. ON clause and the WHERE clause?

I can definitely see how that’s confusing some people, as there seems to be no difference at first sight, when running queries like these, e.g. in Oracle. I’m using the Sakila database, as always:

-- First query
SELECT a.actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE fa.film_id < 10
GROUP BY a.actor_id, a.first_name, a.last_name
ORDER BY count(fa.film_id) DESC;

This will yield something like:

ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT
--------------------------------------
108       WARREN      NOLTE      3
162       OPRAH       KILMER     3
19        BOB         FAWCETT    2
10        CHRISTIAN   GABLE      2
53        MENA        TEMPLE     2
137       MORGAN      WILLIAMS   1
2         NICK        WAHLBERG   1

Of course, we could have written this instead, and received the same result:

-- Second query
SELECT a.actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
  AND fa.film_id < 10
GROUP BY a.actor_id, a.first_name, a.last_name
ORDER BY count(fa.film_id) DESC;

Now, I’ve moved the FILM_ID < 10 filter from the WHERE clause to the ON clause. But the execution plan is the same for both queries:

---------------------------------------------------------
| Id  | Operation               | Name          | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT        |               |    49 |
|   1 |  SORT ORDER BY          |               |    49 |
|   2 |   HASH GROUP BY         |               |    49 |
|*  3 |    HASH JOIN            |               |    49 |
|*  4 |     INDEX FAST FULL SCAN| PK_FILM_ACTOR |    49 |
|   5 |     VIEW                | VW_GBF_7      |   200 |
|   6 |      TABLE ACCESS FULL  | ACTOR         |   200 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ITEM_1"="FA"."ACTOR_ID")
   4 - filter("FA"."FILM_ID"<10)

It does not seem matter at all. Both queries yield the same result as well as the same plan. So…

Are ON and WHERE really the same thing?

They are when you run an inner join. But they are not when you run an outer join.

And now, let’s compare these two queries here:

-- First query
SELECT a.actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE fa.film_id < 10
GROUP BY a.actor_id, a.first_name, a.last_name
ORDER BY count(fa.film_id) ASC;

Yielding

ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT
--------------------------------------
194       MERYL       ALLEN      1
198       MARY        KEITEL     1
30        SANDRA      PECK       1
85        MINNIE      ZELLWEGER  1
123       JULIANNE    DENCH      1

Notice that with this syntax, we’re not getting any actors that have no films with FILM_ID < 10. We should get dozens! How about this:

-- Second query
SELECT a.actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
  AND fa.film_id < 10
GROUP BY a.actor_id, a.first_name, a.last_name
ORDER BY count(fa.film_id) ASC;

This used to produce the same result for an (INNER) JOIN, but given the LEFT JOIN, we’re now not getting Susan Davis in the result:

ACTOR_ID  FIRST_NAME  LAST_NAME     COUNT
-----------------------------------------
3         ED          CHASE         0
4         JENNIFER    DAVIS         0
5         JOHNNY      LOLLOBRIGIDA  0
6         BETTE       NICHOLSON	    0
...
1         PENELOPE    GUINESS       1
200       THORA       TEMPLE        1
2         NICK        WAHLBERG      1
198       MARY        KEITEL        1

The plans are also different:

---------------------------------------------------------
| Id  | Operation               | Name          | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT        |               |    49 |
|   1 |  SORT ORDER BY          |               |    49 |
|   2 |   HASH GROUP BY         |               |    49 |
|*  3 |    HASH JOIN            |               |    49 |
|*  4 |     INDEX FAST FULL SCAN| PK_FILM_ACTOR |    49 |
|   5 |     VIEW                | VW_GBF_7      |   200 |
|   6 |      TABLE ACCESS FULL  | ACTOR         |   200 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ITEM_1"="FA"."ACTOR_ID")
   4 - filter("FA"."FILM_ID"<10)

No outer join here! Versus

---------------------------------------------------------------
| Id  | Operation                     | Name          | Rows  |
---------------------------------------------------------------
|   0 | SELECT STATEMENT              |               |   200 |
|   1 |  SORT ORDER BY                |               |   200 |
|   2 |   MERGE JOIN OUTER            |               |   200 |
|   3 |    TABLE ACCESS BY INDEX ROWID| ACTOR         |   200 |
|   4 |     INDEX FULL SCAN           | PK_ACTOR      |   200 |
|*  5 |    SORT JOIN                  |               |    44 |
|   6 |     VIEW                      | VW_GBC_5      |    44 |
|   7 |      HASH GROUP BY            |               |    44 |
|*  8 |       INDEX FAST FULL SCAN    | PK_FILM_ACTOR |    49 |
---------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - access("A"."ACTOR_ID"="ITEM_1"(+))
       filter("A"."ACTOR_ID"="ITEM_1"(+))
   8 - filter("FILM_ID"(+)<10)

The first query did not produce an outer join operation, the second one did!

What’s the difference?

The difference is:

  • An INNER JOIN produces all the actors who played in at least one film, filtering out the actors who did not play in a film. That’s the very definition of an inner join. If we filter the films with FILM_ID < 10, that simply means we don’t want any actors without such films either.
  • A LEFT JOIN will produce all the rows from the left side of the join, regardless if there is a matching row on the right side of the join.

In both cases, the matching rows are determined by the ON clause. If two rows don’t match, then:

  • The INNER JOIN removes them both from the result
  • The LEFT JOIN retains the left row in the result

But regardless what the JOIN produces, the WHERE clause will again remove rows that do not satisfy the filter. So,

  • In the INNER JOIN case, it does not matter if we remove actors with no films, and then actors without films with FILM_ID < 10, OR if we remove actors with no films with FILM_ID < 10 directly. They’re going to be removed anyway.
  • In the LEFT JOIN case, it does matter if we retain actors with no films, and then remove actors without films with FILM_ID < 10 (in case of which actors without films will be removed again), OR if we retain actors without films with FILM_ID < 10, and then not apply any further filters.

Conclusion

For INNER JOIN, WHERE predicates and ON predicates have the same effect.

For OUTER JOIN, WHERE predicates and ON predicates have a different effect.

In general, it is always best to put a predicate where it belongs, logically. If the predicate is related to a JOIN operation, it belongs in the ON clause. If a predicate is related to a filter applied to the entire FROM clause, it belongs in the WHERE clause.

The Cost of Useless Surrogate Keys in Relationship Tables

What’s a good natural key?

This is a very difficult question for most entities when you design your schema. In some rare cases, there seems to be an “obvious” candidate, such as a variety of ISO standards, including:

But even in those cases, there might be exceptions and the worst thing that can happen is a key change. Most database designs play it safe and use surrogate keys instead. Nothing wrong with that. But…

Relationship tables

There is one exception where a surrogate key is never really required. Those are relationship tables. For example, in the Sakila database, all relationship tables lack a surrogate key and use their respective foreign keys as a compound “natural” primary key instead:

So, the FILM_ACTOR table, for example, is defined as such:

CREATE TABLE film_actor (
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id)
);

There is really no point in adding another column FILM_ACTOR_ID or ID for an individual row in this table, even if a lot of ORMs and non-ORM-defined schemas will do this, simply for “consistency” reasons (and in a few cases, because they cannot handle compound keys).

Now, the presence or absence of such a surrogate key is usually not too relevant in every day work with this table. If you’re using an ORM, it will likely make no difference to client code. If you’re using SQL, it definitely doesn’t. You just never use that additional column.

But in terms of performance, it might make a huge difference!

Clustered indexes

In many RDBMS, when creating a table, you get to choose whether to use a “clustered index” or a “non clustered index” table layout. The main difference is:

Clustered index

… is a primary key index that “clusters” data together, which belongs together. In other words:

  • All the index column values are contained in the index tree structure
  • All the other column values are contained in the index leaf nodes

The benefit of this table layout is that primary key lookups can be much faster because your entire row is located in the index, which requires less disk I/O than the non clustered index for primary key lookups. The price for this is slower secondary index searches (e.g. searching for last names). The algorithmic complexities are:

  • O(log N) for primary key lookups
  • O(log N) for secondary key lookups plus O(M log N) for projections of non-secondary-key columns (quite a high price to pay)

… where

  • N is the size of the table
  • M is the number of rows that are searched in secondary keys

OLTP usage often profits from clustered indexes.

Non clustered index

… is a primary key index that resides “outside” of the table structure, which is a heap table. In other words:

  • All the index column values are contained in the index tree structure
  • All the index column values and other column values are contained in the heap table

The benefit of this table layout is that all lookups are equally fast, regardless if you’re using a primary key lookup or a secondary key search. There’s always an additional, constant time heap table lookup. The algorithmic complexities are:

  • O(log N) for primary key lookups plus O(1) for projections of non-primary-key columns (a moderate price to pay)
  • O(log N) for secondary key lookups plus O(M) for projections of non-secondary-key columns (a moderate price to pay)

OLAP usage definitely profits from heap tables.

Defaults

  • MySQL’s InnoDB offers clustered indexes only.
  • MySQL’s MyISAM offers heap tables only.
  • Oracle offers both and defaults to heap tables
  • PostgreSQL offers both and defaults to heap tables
  • SQL Server offers both and defaults to clustered indexes

Note that Oracle calls clustered indexes “index organised tables”

Performance

In this article, I’m checking MySQL’s performance as MySQL’s InnoDB doesn’t offer to switch the table layout. Curiously, the problems shown below could not be reproduced on PostgreSQL as shown by reddit user /u/ForeverAlot. Details here.

With the algorithmic complexities above, we can easily guess what I’m trying to hint at here. In the presence of a clustered index, we should avoid expensive secondary key searches when possible. Of course, these searches cannot always be avoided, but if we review the alternative design of these two tables:

CREATE TABLE film_actor_surrogate (
  id int NOT NULL,
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_surrogate_pkey PRIMARY KEY (id)
);

CREATE TABLE film_actor_natural (
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id)
);

… we can see that if we’re using a clustered index here, the clustering will be made based on either:

  • FILM_ACTOR_SURROGATE.ID, which is a very useless clustering
  • (FILM_ACTOR_NATURAL.ACTOR_ID, FILM_ACTOR_NATURAL.FILM_ID), which is a very useful clustering

In the latter case, whenever we look up an actor’s films, we can use the clustering index as a covering index, regardless if we project anything additional from that table or not.

In the former case, we have to rely on an additional secondary key index that contains (ACTOR_ID, FILM_ID), and chances are that secondary index is not covering if we have additional projections.

The surrogate key clustering is really useless, because we never use the table this way.

Does it matter?

We can easily design a benchmark for this case. You can find the complete benchmark code here on GitHub, to validate the results on your environment. The benchmark uses this database design:

create table parent_1 (id int not null primary key);
create table parent_2 (id int not null primary key);

create table child_surrogate (
  id int auto_increment, 
  parent_1_id int not null references parent_1, 
  parent_2_id int not null references parent_2, 
  payload_1 int, 
  payload_2 int, 
  primary key (id), 
  unique (parent_1_id, parent_2_id)
) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */
;

create table child_natural (
  parent_1_id int not null references parent_1, 
  parent_2_id int not null references parent_2, 
  payload_1 int, 
  payload_2 int, 
  primary key (parent_1_id, parent_2_id)
) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */
;

Unlike in the Sakila database, we’re now adding some “payload” to the relationship table, which is not unlikely. Recent versions of MySQL will default to InnoDB, which only supports a clustered index layout. You can uncomment the ENGINE storage clause to see how this would perform with MyISAM, which only supports heap tables.

The benchmark adds:

  • 10 000 rows in PARENT_1
  • 100 rows in PARENT_2
  • 1 000 000 rows in both CHILD tables (just a cross join of the above)

And then, it runs 5 iterations of 10000 repetitions of the following two queries, following our standard SQL benchmark technique:

-- Query 1
SELECT c.payload_1 + c.payload_2 AS a 
FROM parent_1 AS p1 
JOIN child_surrogate AS c ON p1.id = c.parent_1_id 
WHERE p1.id = 4;

-- Query 2
SELECT c.payload_1 + c.payload_2 AS a 
FROM parent_1 AS p1 
JOIN child_natural AS c ON p1.id = c.parent_1_id 
WHERE p1.id = 4;

Notice that MySQL does not implement join elimination, otherwise, the useless join to PARENT_1 would be eliminated. The benchmark results are very clear:

Using InnoDB (clustered indexes)

Run 0, Statement 1 : 3104
Run 0, Statement 2 : 1910
Run 1, Statement 1 : 3097
Run 1, Statement 2 : 1905
Run 2, Statement 1 : 3045
Run 2, Statement 2 : 2276
Run 3, Statement 1 : 3589
Run 3, Statement 2 : 1910
Run 4, Statement 1 : 2961
Run 4, Statement 2 : 1897

Using MyISAM (heap tables)

Run 0, Statement 1 : 3473
Run 0, Statement 2 : 3288
Run 1, Statement 1 : 3328
Run 1, Statement 2 : 3341
Run 2, Statement 1 : 3674
Run 2, Statement 2 : 3307
Run 3, Statement 1 : 3373
Run 3, Statement 2 : 3275
Run 4, Statement 1 : 3298
Run 4, Statement 2 : 3322

You shouldn’t read this as a comparison between InnoDB and MyISAM in general, but as a comparison of the different table structures within the boundaries of the same engine. Very obviously, the additional search complexity of the badly clustered index in CHILD_SURROGATE causes a 50% slower query execution on this type of query, without gaining anything.

In the case of the heap table, the additional surrogate key column did not have any significant effect.

Again, the full benchmark can be found here on GitHub, if you want to repeat it.

Conclusion

Not everyone agrees what is generally better: clustered or non clustered indexes. Not everyone agrees on the utility of surrogate keys on every table. These are both quite opinionated discussions.

But this article clearly showed that on relationship tables, which have a very clear candidate key, namely the set of outgoing foreign keys that defines the many-to-many relationship, the surrogate key not only doesn’t add value, but it actively hurts your performance on a set of queries when your table is using a clustered index.

MySQL’s InnoDB and SQL Server use clustered indexes by default, so if you’re using any of those RDBMS, do check if you have room for significant improvement by dropping your surrogate keys.