Calculate Percentiles to Learn About Data Set Skew in SQL

B-Tree indexes are perfect when your data is uniformly distributed. They are not really useful, when you have skewed data. I’ll explain later why this is the case, but let’s first learn how to detect “skew”

What is skew?

Skew is a term from statistics when a normal distribution is not symmetric. The example given on Wikipedia shows a distribution like this:

In RDBMS, we sometimes use the term skew colloquially to mean the same thing as non-uniform distribution, i.e. a normal distribution would also be skewed. We simply mean that some values appear more often than others. Thus, I will put the term “skew” in double quotes in this article. While your RDBMS’s statistics contain this information once they are calculated, we can also detect such “skew” manually in ad-hoc queries using percentiles, which are defined in the SQL standard and supported in a variety of databases, as ordinary aggregate functions, including:

  • Oracle
  • PostgreSQL
  • SQL Server (regrettably, only as window functions)

Uniform distribution

Let’s look at the FILM_ID values in the Sakila database:

SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY film_id) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY film_id) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY film_id) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY film_id) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY film_id) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY film_id) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY film_id) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY film_id) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY film_id) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY film_id) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY film_id) AS "100%"
FROM film;

What are we calculating here? We’re trying to find 11 different values for which we can say that:

  • 0% of the film_ids are lower than the “0%” value
  • 10% of the film_ids are lower than the “10%” value

Or in other words:

  • 0% is the MIN(film_id) value
  • 50% is the MEDIAN(film_id) value
  • 100% is the MAX(film_id) value

The result shows an unsurprisingly uniform distribution:

0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
---|----|----|----|----|----|----|----|----|----|-----|
1  |100 |200 |300 |400 |500 |600 |700 |800 |900 |1000 |

We can plot this in Microsoft Excel or some other tool to get this nice curve:

This is not surprising, as the IDs are just consecutive values, which is a desired property of surrogate keys.

“Skewed” distribution

It’s a different story when we look at the distribution of amounts in the payment table:

SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY amount) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY amount) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY amount) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY amount) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY amount) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY amount) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY amount) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY amount) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY amount) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY amount) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY amount) AS "100%"
FROM payment;

We’re now getting:

0%   |10%  |20%  |30%  |40%  |50%  |60%  |70%  |80%  |90%  |100% 
-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----
0.00 |0.99 |1.99 |2.99 |2.99 |3.99 |4.99 |4.99 |5.99 |6.99 |11.99

This looks … “skewed”, although clearly the bias is mainly caused by the fact that this data is generated. When we plot the above, we’re getting:

The slope is less steep at the beginning of this curve, which essentially means that more values exist at the lower end of the range than at the upper end. We can validate this with another query:

SELECT amount, count(*)
FROM (
  SELECT trunc(amount) AS amount
  FROM payment
) t 
GROUP BY amount
ORDER BY amount;

… which yields:

amount |count |
-------|------|
0      |3003  |
1      |641   |
2      |3542  |
3      |1117  |
4      |3789  |
5      |1306  |
6      |1119  |
7      |675   |
8      |486   |
9      |257   |
10     |104   |
11     |10    |

Plotted:

When plotting this, we can see that there are more amounts in the lower half of the range than in the upper half, which leads to percentiles growing slower.

Correlations

This technique can also be applied to detect correlations in data. We can, for instance, try to find the percentiles of the length of films, and group data sets by rating. I’m using a GROUPING SETS function here, the ROLLUP() function, to calculate the grand total as well. Just check out the query and its results, and you’ll see:

SELECT
  rating,
  count(*),
  percentile_disc(0.0) WITHIN GROUP (ORDER BY length) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY length) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY length) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY length) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY length) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY length) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY length) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY length) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY length) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY length) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY length) AS "100%"
FROM film
GROUP BY ROLLUP(rating);

This yields:

rating |count |0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
-------|------|---|----|----|----|----|----|----|----|----|----|-----|
G      |178   |47 |57  |67  |80  |93  |107 |121 |138 |156 |176 |185  |
PG     |194   |46 |58  |72  |85  |99  |113 |122 |137 |151 |168 |185  |
PG-13  |223   |46 |61  |76  |92  |110 |125 |138 |150 |162 |176 |185  |
R      |195   |49 |68  |82  |90  |104 |115 |129 |145 |160 |173 |185  |
NC-17  |210   |46 |58  |74  |84  |97  |112 |125 |138 |153 |174 |184  |
       |1000  |46 |60  |74  |86  |102 |114 |128 |142 |156 |173 |185  |

So, the GROUP BY clause produced one row per rating, and an additional grand total column at the bottom. For illustration purposes, I’ve added the COUNT(*) column, to show how many films are in each group. The 5 first rows sum up to 1000, which is again the grand total at the bottom.

Let’s plot the percentiles now as line and bar charts:

We can “see” that there is no strong correlation between the two data points. Both data sets are close to uniformly distributed, quite independently of the rating, with the exception of PG-13, which is just slightly skewed towards longer film lengths.

Again, this isn’t terribly interesting as the data set was generated, probably using some randomness to avoid perfectly uniform distribution. In real world scenarios, the above data would have been more “skewed”.

How does this help with performance?

A balanced tree index is very useful when data is quite uniformly distributed, because in that case, it can help access data points or ranges of data in O(log(N)) time. This is quite a useful property for queries that look for film_id values, e.g.

SELECT *
FROM film
WHERE film_id = 1

When accessing “skewed” data, some values are more equal than others. This means that for example if we’re looking for amounts in the payment table, these two queries are not the same:

-- A lot of rows returned (3644)
SELECT * FROM payment WHERE amount BETWEEN 0 AND 2;

-- Few rows returned (361)
SELECT * FROM payment WHERE amount BETWEEN 9 AND 11;

An index on the amount column could have been useful for the second query, but maybe not for the first one.

There are several things we can do to make sure optimal index usage is being applied for all sorts of queries. In case of uniformly distributed data, we usually don’t have to do anything as SQL developers. In case of “skewed” data sets, it may be worth thinking about:

  • Using histogram statistics
  • Hinting the optimiser (in Oracle or SQL Server)
  • Avoiding bind variables (only in extreme cases)

Conclusion

Not all data sets are equal. They are often “skewed”. By “skewed”, in SQL, we don’t mean the statistical meaning of a normal distribution being skewed asymmetrically. We mean that a distribution is not uniform, so even a normal distribution is “skewed”. When it is, then some values appear way more often than others. Some examples are:

Uniform distribution

  • Surrogate keys generated from sequences (consecutive)
  • Surrogate keys generated from UUIDs (random)
  • Foreign keys on one-to-one relationships

Slight “skew”

Possibly significant “skew”

This really depends on the actual data set, but do expect significant “skew” in these data types

  • Foreign keys on to-many relationships (e.g. some customers have more assets than others)
  • Numeric values (e.g. amount)
  • Codes and other discrete values (e.g. film rating, payment settlement codes, etc.)

This article has shown how we can use simple SQL aggregate functions, including the percentiles, to calculate and visualise such “skew”.

How to Work Around ORA-38104: Columns referenced in the ON Clause cannot be updated

Standard SQL is a beautiful language. Vendor specific implementations, however, have their warts. In Oracle, for example, it’s not possible to update any columns in a MERGE statement, which have been referenced by the ON clause. For example:

CREATE TABLE person (
  id NUMBER(18) NOT NULL PRIMARY KEY,
  user_name VARCHAR2(50) NOT NULL UNIQUE,
  score NUMBER(18)
);

Now, in MySQL, we can run a non-standard INSERT .. ON DUPLICATE KEY UPDATE statement like this:

INSERT INTO person (id, user_name, score)
VALUES (1, 'foo', 100)
ON DUPLICATE KEY UPDATE
  SET user_name = 'foo', score = 100

Behind the scenes, MySQL will check all unique constraints for duplicates and reject the insert, replacing it by the update statement instead. It’s debatable whether this is really useful (ideally, we want to check only a single unique constraint for duplicates), but that’s what MySQL offers.

In case we want to run the same behaviour by Oracle, we could use the MERGE statement:

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id OR t.user_name = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

That looks reasonable, but it doesn’t work. We’ll get:

SQL-Fehler: ORA-38104: Columns referenced in the ON Clause cannot be updated: “T”.”USER_NAME”

Obviously, this is some protection against the situation where such an update would suddenly move a row from the matched to the not matched group. In this particular example, it might not look like something that could cause problems, but if vendor specific extensions such as the WHERE or DELETE clause would be used, things might look different.

However, the parser is not very smart, in fact, it is almost not smart at all. While it detects extremely silly attempts at circumventing this limitation, such as this:

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
-- Circumvention attempt here: NVL()
ON (t.id = s.id OR nvl(t.user_name, null) = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

It does not detect any of these attempts:

Using row value expressions

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id OR 
-- Circumvention attempt here: row value expressions
  (t.user_name, 'dummy') = ((s.user_name, 'dummy')))
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

Seemingly without any penalty on the execution plan. Both indexes are being used:

---------------------------------------------------------------------------
| Id  | Operation                               | Name            | Rows  |
---------------------------------------------------------------------------
|   0 | MERGE STATEMENT                         |                 |     1 |
|   1 |  MERGE                                  | PERSON          |       |
|   2 |   VIEW                                  |                 |       |
|   3 |    NESTED LOOPS OUTER                   |                 |     1 |
|   4 |     FAST DUAL                           |                 |     1 |
|   5 |     VIEW                                | VW_LAT_8626BD41 |     1 |
|   6 |      TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |     1 |
|   7 |       BITMAP CONVERSION TO ROWIDS       |                 |       |
|   8 |        BITMAP OR                        |                 |       |
|   9 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 10 |          INDEX RANGE SCAN               | SYS_C00106110   |       |
|  11 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 12 |          INDEX RANGE SCAN               | SYS_C00106111   |       |
---------------------------------------------------------------------------

Correlated subquery

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id OR 
-- Circumvention attempt here: correlated subquery
  (SELECT t.user_name FROM dual) = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

This seems to prevent any index usage, and should thus be avoided:

----------------------------------------------------------
| Id  | Operation              | Name            | Rows  |
----------------------------------------------------------
|   0 | MERGE STATEMENT        |                 |     1 |
|   1 |  MERGE                 | PERSON          |       |
|   2 |   VIEW                 |                 |       |
|   3 |    NESTED LOOPS OUTER  |                 |     1 |
|   4 |     FAST DUAL          |                 |     1 |
|   5 |     VIEW               | VW_LAT_1846A928 |     1 |
|*  6 |      FILTER            |                 |       |
|   7 |       TABLE ACCESS FULL| PERSON          |     1 |
|   8 |       FAST DUAL        |                 |     1 |
----------------------------------------------------------

Using NVL() and updating a view instead

Just plain simple usage of NVL() inside of the ON clause didn’t work before. The parser was smart enough to detect that. But it isn’t smart enough to detect NVL() inside of a view / derived table.

MERGE INTO (
  SELECT id, user_name, nvl(user_name, null) n, score
  FROM person
) t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
-- Circumvention attempt here: renamed column
ON (t.id = s.id OR t.n = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

Notice that both USER_NAME and N columns are the same thing, but the parser doesn’t notice this and thinks we’re fine.

The execution plan is still optimal, as Oracle seems to have a way to optimise NVL() expressions (but not coalesce and others!):

---------------------------------------------------------------------------
| Id  | Operation                               | Name            | Rows  |
---------------------------------------------------------------------------
|   0 | MERGE STATEMENT                         |                 |     1 |
|   1 |  MERGE                                  | PERSON          |       |
|   2 |   VIEW                                  |                 |       |
|   3 |    NESTED LOOPS OUTER                   |                 |     1 |
|   4 |     FAST DUAL                           |                 |     1 |
|   5 |     VIEW                                | VW_LAT_46651921 |     1 |
|   6 |      TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |     1 |
|   7 |       BITMAP CONVERSION TO ROWIDS       |                 |       |
|   8 |        BITMAP OR                        |                 |       |
|   9 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 10 |          INDEX RANGE SCAN               | SYS_C00106110   |       |
|  11 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 12 |          INDEX RANGE SCAN               | SYS_C00106111   |       |
---------------------------------------------------------------------------

Using the WHERE clause

If we hadn’t had an OR predicate in our ON clause, but a AND predicate, then we could have used the WHERE clause in Oracle. This works:

-- NOT the same query as the original one!
MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
  WHERE t.user_name = s.user_name
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score);

This is not the same query as the original one. I just listed it here for completeness’ sake. Also to remind readers of the fact that this approach as well doesn’t seem to use indexes optimally. Only the primary key index (from the ON clause) seems to be used. The unique key is not being used:

----------------------------------------------------------------
| Id  | Operation                      | Name          | Rows  |
----------------------------------------------------------------
|   0 | MERGE STATEMENT                |               |     1 |
|   1 |  MERGE                         | PERSON        |       |
|   2 |   VIEW                         |               |       |
|   3 |    NESTED LOOPS OUTER          |               |     1 |
|   4 |     VIEW                       |               |     1 |
|   5 |      FAST DUAL                 |               |     1 |
|   6 |     TABLE ACCESS BY INDEX ROWID| PERSON        |     1 |
|*  7 |      INDEX UNIQUE SCAN         | SYS_C00106110 |     1 |
----------------------------------------------------------------

Careful

Be careful when applying the above workarounds. Assuming that ORA-38104 is a good thing (i.e. that Oracle still thinks it should be enforced), then the above workarounds simply expose bugs in the parser, which should detect such cases. The above behaviour has been observed in Oracle 12c and 18c.

I personally believe that ORA-38104 should be abandoned entirely, and the root cause for this restriction should be removed. But it is certainly worth exploring alternative options rather than relying on the above workarounds in production code, apart from the occasional one-shot migration query, where such loop holes are always nice tools to exploit.

How to Unit Test Your Annotation Processor using jOOR

Annotation processors can be useful as a hacky workaround to get some language feature into the Java language. The best example is Lombok, which enhances the Java language with quite a few annotation-based features.

jOOQ also has an annotation processor that helps validate SQL syntax for:

  • Plain SQL usage (SQL injection risk)
  • SQL dialect support (prevent using an Oracle only feature on MySQL)

You can read about it more in detail here.

Unit testing annotation processors

Unit testing annotation processors is a bit more tricky than using them. Your processor hooks into the Java compiler and manipulates the compiled AST (or does other things). If you want to test your own processor, you need the test to run a Java compiler, but that is difficult to do in a normal project setup, especially if the expected behaviour for a given test is a compilation error.

Let’s assume we have the following two annotations:

@interface A {}
@interface B {}

And now, we would like to establish a rule that @A must always be accompanied by @B. For example:

// This must not compile
@A
class Bad {}

// This is fine
@A @B
class Good {}

We’ll enforce that with an annotation processor:

class AProcessor implements Processor {
    boolean processed;

    @Override
    public Set<String> getSupportedOptions() {
        return Collections.emptySet();
    }

    @Override
    public Set<String> getSupportedAnnotationTypes() {
        return Collections.singleton("*");
    }

    @Override
    public SourceVersion getSupportedSourceVersion() {
        return SourceVersion.RELEASE_8;
    }

    @Override
    public void init(ProcessingEnvironment processingEnv) {
    }

    @Override
    public boolean process(Set<? extends TypeElement> annotations, RoundEnvironment roundEnv) {
        for (TypeElement e1 : annotations)
            if (e1.getQualifiedName().contentEquals(A.class.getName()))
                for (Element e2 : roundEnv.getElementsAnnotatedWith(e1))
                    if (e2.getAnnotation(B.class) == null)
                        throw new RuntimeException("Annotation A must be accompanied by annotation B");

        this.processed = true;
        return false;
    }

    @Override
    public Iterable<? extends Completion> getCompletions(Element element, AnnotationMirror annotation, ExecutableElement member, String userText) {
        return Collections.emptyList();
    }
}

Now, this works. We can easily verify that manually by adding the annotation processor to some Maven compiler configuration and by annotating a few classes with A and B. But then, someone changes the code and we don’t notice the regression. How can we unit test this, rather than doing things manually?

jOOR 0.9.10 support for annotation processors

jOOR is our little open source reflection library that we’re using internally in jOOQ

jOOR has a convenient API to invoke the javax.tools.JavaCompiler API through Reflect.compile(). The most recent release 0.9.10 now takes an optional CompileOptions argument where annotation processors can be registered.

This means, we can now write a very simple unit test as follows (and if you’re using Java 12, you can profit from raw string literals! For a Java 11 compatible version without raw string literals, see our unit tests on github):

@Test
public void testCompileWithAnnotationProcessors() {
    AProcessor p = new AProcessor();

    try {
        Reflect.compile(
            "org.joor.test.FailAnnotationProcessing",
            ```
             package org.joor.test; 
             @A 
             public class FailAnnotationProcessing {
             }
            ```,
            new CompileOptions().processors(p)
        ).create().get();
        Assert.fail();
    }
    catch (ReflectException expected) {
        assertFalse(p.processed);
    }

    Reflect.compile(
        "org.joor.test.SucceedAnnotationProcessing",
        ```
         package org.joor.test; 
         @A @B 
         public class SucceedAnnotationProcessing {
         }
        ```,
        new CompileOptions().processors(p)
    ).create().get();
    assertTrue(p.processed);
}

So easy! Never have regressions in your annotation processors again!

How to Create a Good MCVE (Minimal Complete Verifiable Example)

Reporting a bug takes time, and trust me, every vendor appreciates your reporting of a bug! Your voice counts as many voices, for all the other customers of a product who do not want to or cannot take the time to report the same bug are numerous.

So, first off, thanks for taking that time and reaching out to us vendors. We really appreciate your help!

Having said so, reporting a bug can be a tedious exercise. For both parties, the one reporting the bug and the one receiving it. There are extremely simple bugs, such as typos in documentation. They can be easily pointed to and just as easily be fixed. There are much trickier bugs, such as concurrency issues in complicated project setups. They take time to reproduce. This is why an MCVE (Minimal Complete Verifiable Example) is so useful. The linked stack overflow page explains why it is so useful to answer questions. But the same arguments apply when reporting a bug.

And that’s where the tricky part starts. It isn’t easy to create an example that is:

  • Minimal: Your real world application code is huge. You cannot dump the entirety of it to the vendor for various reasons. And the vendor cannot look through it all to try to reproduce it. So, the problem has to be isolated into an example of minimal scope, with no unnecessary additional functionality. That’s hard too, because your project has been set up months or years ago. You don’t want to spend too much time setting up a new project
  • Complete: When reducing the problem to a minimal one, we’re tempted to just describe it in prose. But that can be difficult as well, because prose is hardly complete. It’s difficult to describe a problem when it would be quite easy to show the code. But that brings us back to the minimal part. We want to show only the relevant code, not all of it.
  • Verifiable: Ultimately, the ideal example can be used by the vendor to reproduce the problem, because once that’s possible, the vendor can start debugging it and finding the right spots to fix quite easily. Otherwise, it’s just guessing and going back and forth with the reporter, just to write more prose. That’s tiring on both sides.

This is why we now have an example project on GitHub to help you create that MCVE:

https://github.com/jOOQ/jOOQ-mcve

It is a minimal example that uses:

This example can be forked on GitHub and modified by you directly, in order to show how to reproduce your issue. In the future, we’ll add more example setups that may be helpful to reproduce your specific issue.

Thanks again for taking the time to report issues. We vendors really appreciate your work!

How to Aggregate an Archive Log’s Deltas into a Snapshot with SQL

A customer of my popular SQL training (which you should book!) has recently challenged me to optimise a hierarchical query that merges an archive log’s deltas in order to obtain a snapshot of some record at a given point in time. In this article, I will reproduce their problem statement in a simplified version and show how this can be done with SQL Server, using a few cool SQL features:

All of these are topics covered in the training, which were immediately applicable to this problem statement.

The problem statement

This was their archive design. They designed for uncertainty, meaning that for some entities in their system, they did not know what kinds of attributes will be part of the entity in the future. Given their application design, users could even add their own custom attributes to an entity.

This kind of thing is typically solved with the EAV (Entity Attribute Value) model, a “workaround” to denormalise data sets in SQL databases in the event of such schema uncertainty.

EAV can be implemented in several ways:

Through classic SQL tables only

An example implementation is this:

CREATE TABLE eav_classic (
  entity_type     VARCHAR (100) NOT NULL,
  entity_id       BIGINT        NOT NULL,
  attribute_name  VARCHAR (100) NOT NULL,
  attribute_type  VARCHAR (100) NOT NULL,
  attribute_value VARCHAR (100)     NULL,

  CONSTRAINT eav_classic_pk 
    PRIMARY KEY (entity_type, entity_id, attribute_name)
);

The drawbacks of this non-normalised design are immediately obvious. Most specifically, there is no simple way to establish referential integrity. But this may be totally OK, especially for archive logs, and for smaller databases (datomic does something similar)

Through tables containing JSON or XML data

Whenever you have schema-on-read data, JSON or XML data types may be appropriate, so this is a perfectly valid alternative:

CREATE TABLE eav_json (
  entity_type     VARCHAR (100)   NOT NULL,
  entity_id       BIGINT          NOT NULL,
  attributes      VARCHAR (10000) NOT NULL 
    CHECK (ISJSON(attributes) = 1),

  CONSTRAINT eav_json_pk 
    PRIMARY KEY (entity_type, entity_id)
);

If your database supports a JSON data type, obviously, you will prefer that over the above emulation

For the rest of this article, I will use the JSON

Versioning the EAV table

Versioning data in an EAV model is quite easier than in a normalised schema. We can just add a version number and/or timestamp to the record. In their case, something like this may make sense:

CREATE TABLE history (
  id          BIGINT IDENTITY (1, 1) NOT NULL PRIMARY KEY,
  ts          DATETIME               NOT NULL,
  entity_type VARCHAR(100)           NOT NULL,
  entity_id   BIGINT                 NOT NULL,
  delta       VARCHAR(8000)          NOT NULL 
    CHECK (ISJSON(delta) = 1)
);

INSERT INTO history (entity_type, entity_id, ts, delta)
VALUES ('Person', 1, '2000-01-01 00:00:00', '{"first_name": "John", "last_name": "Doe"}'),
       ('Person', 1, '2000-01-01 01:00:00', '{"age": 37}'),
       ('Person', 1, '2000-01-01 02:00:00', '{"age": 38}'),
       ('Person', 1, '2000-01-01 03:00:00', '{"city": "New York"}'),
       ('Person', 1, '2000-01-01 04:00:00', '{"city": "Zurich", "age": null}')
;

This table now contains a set of deltas applied to the Person entity with ID = 1. It corresponds to the following sequence of SQL statements on an ordinary entity:

INSERT INTO person (id, first_name, last_name) 
  VALUES ('John', 'Doe');
UPDATE person SET age = 37 WHERE id = 1;
UPDATE person SET age = 38 WHERE id = 1;
UPDATE person SET city = 'New York' WHERE id = 1;
UPDATE person SET city = 'Zurich', age = null WHERE id = 1;

You could even see their hand-written log like a transaction log of the database system, kinda like what you can extract using products like Golden Gate or Debezium. If you think of the transaction log as an event stream, the RDBMS’s current data representation is like a snapshot that you can get when applying any number of deltas to your tables.

Sometimes, you don’t want to completely change your architecture and go full “event sourcing”, but just need this kind of log for a specific set of auditable entities. And e.g. for reasons like still supporting very old SQL Server versions, as well as supporting other databases, you may choose also not to use the SQL:2011 temporal table feature, which has also been implemented in SQL Server 2016 and more recent versions.

With that out of our way…

How to access any arbitrary snapshot version?

When we visually process our HISTORY table, we can see that Person ID = 1 had the following values at any given time:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York
04:00:00    John          Doe                 Zurich

Remember, this is always the same record of Person ID = 1, its snapshots represented at different times in the time axis. The goal here is to be able to find the record of John Doe at any given time.

Again, if we had been using the SQL:2011 temporal table feature, we could write

-- SQL Server
SELECT * 
FROM Person
FOR SYSTEM_TIME AS OF '2000-01-01 02:00:00.0000000'; 

-- Oracle (flashback query)
SELECT *
FROM Person
AS OF TIMESTAMP TIMESTAMP '2000-01-01 02:00:00'

Side note: Do note that Oracle’s flashback query needs to be properly configured:

  • Not all data is “flashbackable”
  • DDL tends to destroy the archive
  • Proper grants are needed to access the flashback archive

Similar limitations may apply in SQL Server.

What if the RDBMS can’t help us?

If again for some reason, we cannot use the RDBMS’s temporal table features, we’ll roll our own as we’ve seen. So, our query in SQL Server to access the snapshot at any given time may be this:

SELECT 
  '{' 
+ string_agg(
    CASE type WHEN 0 THEN NULL ELSE 
      '"' + [key] + '": ' + 
      CASE type WHEN 1 THEN '"' + value + '"' ELSE value END
    END, ', ') 
+ '}'
FROM (
  SELECT *, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history
  OUTER APPLY openjson(delta)
  
  -- Apply all deltas prior to any given snapshot
  WHERE ts <= '2000-01-01 02:00:00'
) t
WHERE rn = 1;

What does this query do? Consider again our deltas at 04:00:00:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York
04:00:00    John          Doe          -      Zurich

Observe how each value has some color encoding:

  • Strong, red: The current snapshot’s attribute value, when the last delta was applied to any given attribute
  • Strong, black: A previous snapshot’s attribute value, when a previous, superseded delta was applied to any given attribute
  • Light grey: A previous snapshot’s attribute value that was inherited from another previous delta

For any given snapshot, we want to find the Strong, red values. E.g. at a previous snapshot time, the color encoding would have been:

At 03:00:00

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York

04:00:00    John          Doe          -      Zurich

At 02:00:00

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38

03:00:00    John          Doe          38     New York
04:00:00    John          Doe          -      Zurich

So, our query needs to find the delta that was applied last for any given attribute.

With SQL, we can find that easily. We can assign a row number to each delta per attribute in reverse order, something like this:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John (1)      Doe (1)
01:00:00    John          Doe          37 (3)
02:00:00    John          Doe          38 (2)
03:00:00    John          Doe          38     New York (2)
04:00:00    John          Doe          - (1)  Zurich (1)

Once we have that row number, we just filter out only those deltas whose row number is 1. Something like:

SELECT [key], value, row_number() OVER (
  PARTITION BY [key] ORDER BY ts DESC) rn
FROM history OUTER APPLY openjson(delta)
ORDER BY [key], ts;

Notice the OUTER APPLY openjson(delta) syntax. This just expands the JSON structure into key/value/type columns, which we can use more easily in a SQL query. Other database systems may have similar syntax for similar purposes. The result of the above query is:

key        |value    |rn 
-----------|---------|---
age        |37       |3  
age        |38       |2  
age        |         |1  
city       |New York |2  
city       |Zurich   |1  
first_name |John     |1  
last_name  |Doe      |1  

Filtering the ones whose row number is 1:

SELECT [key], value
FROM (
  SELECT ts, [key], value, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history OUTER APPLY openjson(delta)
) t
WHERE rn = 1
ORDER BY ts, [key]

This yields:

key        |value  
-----------|-------
first_name |John   
last_name  |Doe    
age        |       
city       |Zurich 

Exactly the data we wanted, in key/value form. Notice that this filtering step could have been done with DISTINCT ON in PostgreSQL, or with KEEP (DENSE_RANK FIRST ORDER BY ..) in Oracle – an exercise which I shall leave to the reader (feel free to leave the solution in the comments!)

And now, finally, just re-assemble the JSON using SQL Server 2017 STRING_AGG. PostgreSQL would offer us JSON_AGG here, Oracle has JSON_OBJECTAGG. With STRING_AGG, you have to take care of manually escaping all values according to JSON syntax rules, which is bad. In my example, I just replaced ” by \”. Other characters need escaping too, so if there is a built-in feature, use that instead of string processing.

The STRING_AGG function aggregates a CASE expression which translates different JSON data types into different formats, where:

  • 0 is NULL (and nulls are not aggregated)
  • 1 is string
  • everything else can be taken at its value for simplicity, e.g. numbers or booleans

Every value (except nulls) are prefixed by the JSON object’s attribute name (“key”).

SELECT 
  '{' 
+ string_agg(
    CASE type WHEN 0 THEN NULL ELSE 
      '"' + replace([key], '"', '\"') + '": ' + 
      CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
    END, ', ') 
+ '}'
FROM (
  SELECT *, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history
  OUTER APPLY openjson(delta)
  
  -- Apply all deltas prior to any given snapshot
  WHERE ts <= '2000-01-01 04:00:00'
) t
WHERE rn = 1;

This produces

{"city": "Zurich", "first_name": "John", "last_name": "Doe"}

A final query, that gets us the entire history of snapshots (watch the performance on this one, could definitely be optimised):

SELECT ts, (
  SELECT 
    '{' 
  + string_agg(
      CASE type WHEN 0 THEN NULL ELSE 
        '"' + replace([key], '"', '\"') + '": ' + 
        CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
      END, ', ') 
  + '}'
  FROM (
    SELECT *, row_number() OVER (
      PARTITION BY [key] ORDER BY ts DESC) rn
    FROM history
    OUTER APPLY openjson(delta)
    
    -- Apply all deltas prior to any given snapshot
    WHERE ts <= x.ts
  ) t
  WHERE rn = 1
)
FROM history x
GROUP BY ts;

It yields:

ts       |                                                                          
---------|--------------------------------------------------------------------------
00:00:00 |{"first_name": "John", "last_name": "Doe"}                                
01:00:00 |{"age": 37, "first_name": "John", "last_name": "Doe"}                     
02:00:00 |{"age": 38, "first_name": "John", "last_name": "Doe"}                     
03:00:00 |{"age": 38, "city": "New York", "first_name": "John", "last_name": "Doe"} 
04:00:00 |{"city": "Zurich", "first_name": "John", "last_name": "Doe"}              

So, the complete history of all the snapshot versions of the Person with ID = 1.

Very cool, and definitely good enough for their archive / audit query requirements.

How to Use jOOQ’s UpdatableRecord for CRUD to Apply a Delta

While jOOQ is not a full fledged ORM (as in an object graph persistence framework), there is still some convenience available to avoid hand-writing boring SQL for every day CRUD.

That’s the UpdatableRecord API. It has a few very useful features, including:

A 1:1 mapping to the underlying table

Every UpdatableRecord is mapped on a 1:1 basis to the underlying table (or view) of your database. For example, if you create a table like this (using PostgreSQL syntax):

CREATE TABLE customer (
  id BIGSERIAL NOT NULL PRIMARY KEY,
  first_name TEXT NOT NULL,
  last_name TEXT NOT NULL,
  vip BOOLEAN DEFAULT FALSE
);

… and if you’re using the code generator (which you should), then apart from the generated table that is used in SQL queries, you also have an UpdatableRecord that allows you to do this:

CustomerRecord customer = ctx.newRecord(CUSTOMER);
customer.setFirstName("John");
customer.setLastName("Doe");
customer.store();

The above statement will generate a JDBC statement like this:

INSERT INTO customer (first_name, last_name)
VALUES (?, ?)

Notice…

SQL DEFAULT expressions

… we only set the FIRST_NAME and LAST_NAME values on the record, and only those values have been included in the insert statement. The other columns are not included, which allows for SQL DEFAULT expressions to be applied. This is true for both the ID serial column, as well as the VIP defaulted column.

The actual database record will look something like this:

id     first_name   last_name   vip
-------------------------------------
1337   John         Doe         false

Notice how the VIP column is nullable, which means that an unknown VIP status is something our database design explicitly allows for. Of course, we could easily set that value to NULL as well, preventing the DEFAULT from applying. For example:

// Assuming you actually *need* the client round trip...
CustomerRecord customer = ctx.fetchOne(
  CUSTOMER, CUSTOMER.ID.eq(1337));
customer.setVip(null);
customer.store();

The above query would produce the following update statement:

UPDATE customer SET vip = ? WHERE id = ?

Sidenote: Please do not use this feature if you never really needed the data in the client. In that case, just write the UPDATE statement and save a client-server roundtrip!

Notice again that only the column that we’ve explicitly changed will be included in the update statement, not all the others. This has advantages and disadvantages:

Advantages

  • Even if the CustomerRecord was fetched incompletely (e.g. through a query), we don’t need all the other columns to modify this one.
  • If a trigger listens on columns that have been included in the actual query, we can distinguish between a value being set to NULL explicitly, and a value not being available from the DML statement. Think of JavaScript’s difference between null and undefined.

Disadvantages

  • If we allow users to change arbitrary columns on such a record, there is a very high number of possible resulting queries, which can hurt execution plan cache performance in the database.

This is the tradeoff. jOOQ opted for full support of SQL DEFAULT values, whose application is activated when a Record.changed() flag is not set. As soon as the flag is set, jOOQ assumes that the explicit setting of the value is wanton, and it will thus be propagated to the generated SQL query.

SQL DEFAULT expressions and POJOs

jOOQ Records can encode that difference. Plain Old Java Objects (POJOs) cannot. There is only null, if you choose a reference type for your data. What does null mean in that context? It’s not possible to know. A lot of jOOQ users will want that null value to encode the SQL DEFAULT value, not the SQL NULL value. But when you’re loading your POJO content into a jOOQ record, you’re effectively setting all the column values to NULL explicitly, because all the Record.changed() flags will be set:

// Load the customer record from a POJO
CustomerRecord customer = ctx.newRecord(CUSTOMER, customerPojo);
customer.update();

Now, quite likely, the resulting query will be:

UPDATE customer
SET
  first_name = ?,
  last_name = ?,
  vip = ?
WHERE id = ?

But what if your CustomerPojo was retrieved from a JSON document, e.g. like this one?

{
  "id": 1337,
  "lastName": "Smith"
}

It looks like the intent of this document is for the record “John Doe” to be renamed to “John Smith”, so this update statement should have resulted instead:

UPDATE customer
SET last_name = ?
WHERE id = ?

Clearly, we shouldn’t set the FIRST_NAME nor VIP values to NULL.

Both the JSON document structure and the jOOQ UpdatableRecord can encode the difference between undefined / DEFAULT and null / NULL. But the POJO cannot. So, the only reasonable thing for jOOQ to do is to set all Record.changed() flags to true.

Of course, you can manually set them to false again to work around this limitation, or perhaps don’t use the intermediary POJO data structure, but load the JSON data directly into the jOOQ Record.

Imperative Loop or Functional Stream Pipeline? Beware of the Performance Impact!

I like weird, yet concise language constructs and API usages

Yes. I am guilty. Evil? Don’t know. But guilty. I heavily use and abuse the java.lang.Boolean type to implement three valued logic in Java:

  • Boolean.TRUE means true (duh)
  • Boolean.FALSE means false
  • null can mean anything like “unknown” or “uninitialised”, etc.

I know – a lot of enterprise developers will bikeshed and cargo cult the old saying:

Code is read more often than it is written

But as with everything, there is a tradeoff. For instance, in algorithm-heavy, micro optimised library code, it is usually more important to have code that really performs well, rather than code that apparently doesn’t need comments because the author has written it in such a clear and beautiful way.

I don’t think it matters much in the case of the boolean type (where I’m just too lazy to encode every three valued situation in an enum). But here’s a more interesting example from that same twitter thread. The code is simple:

woot:
if (something) {
  for (Object o : list) 
    if (something(o))
      break woot;

  throw new E();
}

Yes. You can break out of “labeled ifs”. Because in Java, any statement can be labeled, and if the statement is a compound statement (observe the curly braces following the if), then it may make sense to break out of it. Even if you’ve never seen that idiom, I think it’s quite immediately clear what it does.

Ghasp!

If Java were a bit more classic, it might have supported this syntax:

if (something) {
  for (Object o : list) 
    if (something(o))
      goto woot;

  throw new E();
}
woot:

Nicolai suggested that the main reason I hadn’t written the following, equivalent, and arguably more elegant logic, is because jOOQ still supports Java 6:

if (something && list.stream().noneMatch(this::something))
  throw new E();

It’s more concise! So, it’s better, right? Everything new is always better.

A third option would have been the less concise solution that essentially just replaces break by return:

if (something && noneMatchSomething(list)
  throw new E();

// And then:
private boolean noneMatchSomething(List<?> list) {
  for (Object o : list)
    if (something(o))
      return false;
  return true;
}

There’s an otherwise useless method that has been extracted. The main benefit is that people are not used to breaking out of labeled statements (other than loops, and even then it’s rare), so this is again about some subjective “readability”. I personally find this particular example less readable, because the extracted method is no longer local. I have to jump around in the class and interrupt my train of thoughts. But of course, YMMV with respect to the two imperative alternatives.

Back to objectivity: Performance

When I tweet about Java these days, I’m mostly tweeting about my experience writing jOOQ. A library. A library that has been tuned so much over the past years, that the big client side bottleneck (apart from the obvious database call) is the internal StringBuilder that is used to generate dynamic SQL. And compared to most database queries, you will not even notice that.

But sometimes you do. E.g. if you’re using an in-memory H2 database and run some rather trivial queries, then jOOQ’s overhead can become measurable again. Yes. There are some use-cases, which I do want to take seriously as well, where the difference between an imperative loop and a stream pipeline is measurable.

In the above examples, let’s remove the throw statement and replace it by something simpler (because exceptions have their own significant overhead).

I’ve created this JMH benchmark, which compares the 3 approaches:

  • Imperative with break
  • Imperative with return
  • Stream

Here’s the benchmark

package org.jooq.test.benchmark;

import java.util.ArrayList;
import java.util.List;

import org.openjdk.jmh.annotations.*;

@Fork(value = 3, jvmArgsAppend = "-Djmh.stack.lines=3")
@Warmup(iterations = 5, time = 3)
@Measurement(iterations = 7, time = 3)
public class ImperativeVsStream {

    @State(Scope.Benchmark)
    public static class BenchmarkState {

        boolean something = true;

        @Param({ "2", "8" })
        int listSize;

        List<Integer> list = new ArrayList<>();

        boolean something() {
            return something;
        }

        boolean something(Integer o) {
            return o > 2;
        }

        @Setup(Level.Trial)
        public void setup() throws Exception {
            for (int i = 0; i < listSize; i++)
                list.add(i);
        }

        @TearDown(Level.Trial)
        public void teardown() throws Exception {
            list = null;
        }
    }

    @Benchmark
    public Object testImperativeWithBreak(BenchmarkState state) {
        woot:
        if (state.something()) {
            for (Integer o : state.list)
                if (state.something(o))
                    break woot;

            return 1;
        }

        return 0;
    }

    @Benchmark
    public Object testImperativeWithReturn(BenchmarkState state) {
        if (state.something() && woot(state))
            return 1;

        return 0;
    }

    private boolean woot(BenchmarkState state) {
        for (Integer o : state.list)
            if (state.something(o))
                return false;

        return true;
    }

    @Benchmark
    public Object testStreamNoneMatch(BenchmarkState state) {
        if (state.something() && state.list.stream().noneMatch(state::something))
            return 1;

        return 0;
    }

    @Benchmark
    public Object testStreamAnyMatch(BenchmarkState state) {
        if (state.something() && !state.list.stream().anyMatch(state::something))
            return 1;

        return 0;
    }

    @Benchmark
    public Object testStreamAllMatch(BenchmarkState state) {
        if (state.something() && state.list.stream().allMatch(s -> !state.something(s)))
            return 1;

        return 0;
    }
}

The results are pretty clear:

Benchmark                                    (listSize)   Mode  Cnt         Score          Error  Units
ImperativeVsStream.testImperativeWithBreak            2  thrpt   14  86513288.062 ± 11950020.875  ops/s
ImperativeVsStream.testImperativeWithBreak            8  thrpt   14  74147172.906 ± 10089521.354  ops/s
ImperativeVsStream.testImperativeWithReturn           2  thrpt   14  97740974.281 ± 14593214.683  ops/s
ImperativeVsStream.testImperativeWithReturn           8  thrpt   14  81457864.875 ±  7376337.062  ops/s
ImperativeVsStream.testStreamAllMatch                 2  thrpt   14  14924513.929 ±  5446744.593  ops/s
ImperativeVsStream.testStreamAllMatch                 8  thrpt   14  12325486.891 ±  1365682.871  ops/s
ImperativeVsStream.testStreamAnyMatch                 2  thrpt   14  15729363.399 ±  2295020.470  ops/s
ImperativeVsStream.testStreamAnyMatch                 8  thrpt   14  13696297.091 ±   829121.255  ops/s
ImperativeVsStream.testStreamNoneMatch                2  thrpt   14  18991796.562 ±   147748.129  ops/s
ImperativeVsStream.testStreamNoneMatch                8  thrpt   14  15131005.381 ±   389830.419  ops/s

With this simple example, break or return don’t matter. At some point, adding additional methods might start getting in the way of inlining (because of stacks getting too deep), but not creating additional methods might be getting in the way of inlining as well (because of method bodies getting too large). I don’t want to bet on either approach here at this level, nor is jOOQ tuned that much. Like most similar libraries, the traversal of the jOOQ expression tree generates stack that are too deep to completely inline anyway.

But the very obvious loser here is the Stream approach, which is roughly 6.5x slower in this benchmark than the imperative approaches. This isn’t surprising. The stream pipeline has to be set up every single time to represent something as trivial as the above imperative loop. I’ve already blogged about this in the past, where I compared replacing simple for loops by Stream.forEach()

Meh, does it matter?

In your business logic? Probably not. Your business logic is I/O bound, mostly because of the database. Wasting a few CPU cycles on a client side loop is not the main issue. Even if it is, the waste probably happens because your loop shouldn’t even be at the client side in the first place, but moved into the database as well. I’m currently touring conferences with a call about that topic:

In your infrastructure logic? Maybe! If you’re writing a library, or if you’re using a library like jOOQ, then yes. Chances are that a lot of your logic is CPU bound. You should occasionally profile your application and spot such bottlenecks, both in your code and in third party libraries. E.g. in most of jOOQ’s internals, using a stream pipeline might be a very bad choice, because ultimately, jOOQ is something that might be invoked from within your loops, thus adding significant overhead to your application, if your queries are not heavy (e.g. again when run against an H2 in-memory database).

So, given that you’re clearly “micro-losing” on the performance side by using the Stream API, you may need to evaluate the readability tradeoff more carefully. When business logic is complex, readability is very important compared to micro optimisations. With infrastructure logic, it is much less likely so, in my opinion. And I’m not alone:

Note: there’s that other cargo cult of premature optimisation going around. Yes, you shouldn’t worry about these details too early in your application implementation. But you should still know when to worry about them, and be aware of the tradeoffs.

And while you’re still debating what name to give to that extracted method, I’ve written 5 new labeled if statements! ;-)