How to Emulate PERCENTILE_DISC in MySQL and Other RDBMS

In my previous article, I showed what the very useful percentile functions (also known as inverse distribution functions) can be used for.

Unfortunately, these functions are not ubiquitously available in SQL dialects. As of jOOQ 3.11, they are known to work in these dialects:

Dialect As aggregate function As window function
MariaDB 10.3.3 No Yes
Oracle 18c Yes Yes
PostgreSQL 11 Yes No
SQL Server 2017 No Yes
Teradata 16 Yes No

Oracle has the most sophisticated implementation, which supports both the ordered set aggregate function, and the window function version:

  • Aggregate function: PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x)
  • Window function: PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x) OVER (PARTITION BY y)

Workarounds if the feature is unavailable

Luckily, as soon as an RDBMS supports window functions, we can easily emulate PERCENTILE_DISC using PERCENT_RANK and FIRST_VALUE as follows. We’re using the Sakila database in this example.

Emulating window functions

Let’s emulate these first, as it requires a bit less SQL transformations. This query works out of the box in Oracle:

SELECT DISTINCT
  rating,
  percentile_disc(0.5) 
    WITHIN GROUP (ORDER BY length) 
    OVER() x1,
  percentile_disc(0.5) 
    WITHIN GROUP (ORDER BY length) 
    OVER (PARTITION BY rating) x2
FROM film
ORDER BY rating;

Yielding

RATING  X1      X2
-------------------
G       114     107
NC-17   114     112
PG      114     113
PG-13   114     125
R       114     115

What we can read from this is that the median length of all films is 114 minutes, and the median lengths of films per rating range from 107 minutes to 125 minutes. I’ve used DISTINCT because we don’t care about visualising these values on a per-row basis in this case. This also works in SQL Server.

Now, let’s assume we’re using PostgreSQL, which doesn’t support inverse distribution window functions, or MySQL, which doesn’t support inverse distribution functions at all, but both support PERCENT_RANK and FIRST_VALUE. Here’s the complete query:

SELECT DISTINCT
  rating,
  first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1,
  first_value(length) OVER (
    PARTITION BY rating 
    ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
FROM (
  SELECT
    rating,
    length,
    percent_rank() OVER (ORDER BY length) p1,
    percent_rank() OVER (PARTITION BY rating ORDER BY length) p2
  FROM film
) t
ORDER BY rating;

So, we’re doing this in two steps (visual example further down):

  1. PERCENT_RANK: In a derived table, we’re calculating the PERCENT_RANK value, which attributes a rank to each row ordered by length, going from 0 to 1. This makes sense. When looking for the median value, we’re really looking for the value whose PERCENT_RANK is 0.5 or less. When looking for the 90% percentile, we’re looking for the value whose PERCENT_RANK is 0.9 or less
  2. FIRST_VALUE: Once we’ve found the PERCENT_RANK, we’re not quite done yet. We need to find the last row whose PERCENT_RANK is less or equal to the percentile we’re interested in. I could have used LAST_VALUE, but then I would have needed to resort to using the quite verbose range clause of window functions. Instead, I when ordering the rows by PERCENT_RANK (p1 or p2), I translated all ranks higher than the percentile I’m looking for into NULL using a CASE expression, and then I made sure using NULLS LAST that the percentile I’m looking for will be the first row in the FIRST_VALUE function’s window specification. Easy!

To visualise this, let’s run these queries, which also project the p1 and p2 values respectively:

SELECT
  length,
  CASE WHEN p1 <= 0.5 THEN p1 END::numeric(3,2) p1,
  first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1
FROM (
  SELECT
    length,
    percent_rank() OVER (ORDER BY length) p1
  FROM film
) t
ORDER BY length;

The result is

length |p1   |x1  |
-------|-----|----|
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
47     |0.01 |114 |
...
113    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 | <-- Last row whose PERCENT_RANK is <= 0.5
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
...
185    |     |114 |
185    |     |114 |
185    |     |114 |

So the FIRST_VALUE function just searches for that first row (descendingly, i.e. bottom up) whose p1 value is non-null.

The same for p2:

SELECT 
  length,
  rating,
  CASE WHEN p2 <= 0.5 THEN p2 END::numeric(3,2) p2,
  first_value(length) OVER (
    PARTITION BY rating 
    ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
FROM (
  SELECT
    rating,
    length,
    percent_rank() OVER (PARTITION BY rating ORDER BY length) p2
  FROM film
) t
ORDER BY rating, length;

Yielding:

length |rating |p2   |x2  |
-------|-------|-----|----|
47     |G      |0.00 |107 |
47     |G      |0.00 |107 |
48     |G      |0.01 |107 |
48     |G      |0.01 |107 |
...
105    |G      |0.47 |107 |
106    |G      |0.49 |107 |
107    |G      |0.49 |107 |
107    |G      |0.49 |107 | <-- Last row in G partition whose
108    |G      |     |107 |     PERCENT_RANK is <= 0.5
108    |G      |     |107 |
109    |G      |     |107 |
...
185    |G      |     |107 |
185    |G      |     |107 |
46     |PG     |0.00 |113 |
47     |PG     |0.01 |113 |
47     |PG     |0.01 |113 |
...
111    |PG     |0.49 |113 |
113    |PG     |0.49 |113 |
113    |PG     |0.49 |113 | <-- Last row in PG partition whose
114    |PG     |     |113 |     PERCENT_RANK is <= 0.5
114    |PG     |     |113 |
...

Perfect! Notice if your RDBMS doesn’t support the NULLS LAST clause in your ORDER BY clause (e.g. MySQL), you might either hope that it defaults to sorting NULLS LAST (MySQL does), or you can emulate it as such:

-- This
ORDER BY x NULLS LAST

-- Is the same as this
ORDER BY
  CASE WHEN x IS NULL THEN 1 ELSE 0 END,
  x

Emulating aggregate functions

If you’re using SQL Server and want aggregate function behaviour, I recommend using the window function instead and emulate aggregation using DISTINCT. It will probably be easier than the emulation below. Do check for performance though!

When you’re using e.g. MySQL, which doesn’t have inverse distribution function support at all, then this chapter is for you.

Here’s how to use the aggregate function version in Oracle:

-- Without GROUP BY
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x1
FROM film;

-- With GROUP BY
SELECT
  rating,
  percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x2
FROM film
GROUP BY rating
ORDER BY rating;

Trivial! The result is the same as before:

X1
---
114


RATING  X2
-----------
G       107
NC-17   112
PG      113
PG-13   125
R       115

Now, let’s emulate these on e.g. MySQL, using window functions.

-- Without GROUP BY
SELECT
  MAX(x1) x1
FROM (
  SELECT first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1
  FROM (
    SELECT
      length,
      percent_rank() OVER (ORDER BY length) p1
    FROM film
  ) t
) t;

It’s exactly the same technique as before, except we now have to turn the window function behaviour (don’t group, preserve rows, repeat aggregation value on each row) back into aggregate function behaviour (group, collapse rows) by using an aggregate function, such as MAX(). This is the same as what I did before with DISTINCT, for illustration purposes.

-- With GROUP BY
SELECT
  rating,
  MAX(x2) x2
FROM (
  SELECT
    rating,
    first_value(length) OVER (
      PARTITION BY rating 
      ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
  FROM (
    SELECT
      rating,
      length,
      percent_rank() OVER (
        PARTITION BY rating 
        ORDER BY length) p2
    FROM film
  ) t
) t
GROUP BY rating
ORDER BY rating;

All we’re really doing (again) is translate the GROUP BY expression to a PARTITION BY expression in the window function, and then redo the previous exercise.

Conclusion

Window functions are extremely powerful. They can be used and combined to calculate a variety of other aggregations. With the above approach, we can calculate the PERCENTILE_DISC inverse distribution function, which is not readily available in most RDBMS using a more verbose but equally powerful approach that uses PERCENT_RANK and FIRST_VALUE in all RDBMS that support window functions. A similar exercise could be made with PERCENTILE_CONT with a slightly more tricky approach to finding that FIRST_VALUE, which I’ll leave as an exercise to the reader.

A future jOOQ version might emulate this for you, automatically.

Liked this article? You may also like 10 SQL Tricks That You Didn’t Think Were Possible.

Calculate Percentiles to Learn About Data Set Skew in SQL

B-Tree indexes are perfect when your data is uniformly distributed. They are not really useful, when you have skewed data. I’ll explain later why this is the case, but let’s first learn how to detect “skew”

What is skew?

Skew is a term from statistics when a normal distribution is not symmetric. The example given on Wikipedia shows a distribution like this:

In RDBMS, we sometimes use the term skew colloquially to mean the same thing as non-uniform distribution, i.e. a normal distribution would also be skewed. We simply mean that some values appear more often than others. Thus, I will put the term “skew” in double quotes in this article. While your RDBMS’s statistics contain this information once they are calculated, we can also detect such “skew” manually in ad-hoc queries using percentiles, which are defined in the SQL standard and supported in a variety of databases, as ordinary aggregate functions, including:

  • Oracle
  • PostgreSQL
  • SQL Server (regrettably, only as window functions)

Uniform distribution

Let’s look at the FILM_ID values in the Sakila database:

SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY film_id) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY film_id) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY film_id) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY film_id) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY film_id) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY film_id) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY film_id) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY film_id) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY film_id) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY film_id) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY film_id) AS "100%"
FROM film;

What are we calculating here? We’re trying to find 11 different values for which we can say that:

  • 0% of the film_ids are lower than the “0%” value
  • 10% of the film_ids are lower than the “10%” value

Or in other words:

  • 0% is the MIN(film_id) value
  • 50% is the MEDIAN(film_id) value
  • 100% is the MAX(film_id) value

The result shows an unsurprisingly uniform distribution:

0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
---|----|----|----|----|----|----|----|----|----|-----|
1  |100 |200 |300 |400 |500 |600 |700 |800 |900 |1000 |

We can plot this in Microsoft Excel or some other tool to get this nice curve:

This is not surprising, as the IDs are just consecutive values, which is a desired property of surrogate keys.

“Skewed” distribution

It’s a different story when we look at the distribution of amounts in the payment table:

SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY amount) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY amount) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY amount) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY amount) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY amount) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY amount) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY amount) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY amount) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY amount) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY amount) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY amount) AS "100%"
FROM payment;

We’re now getting:

0%   |10%  |20%  |30%  |40%  |50%  |60%  |70%  |80%  |90%  |100% 
-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----
0.00 |0.99 |1.99 |2.99 |2.99 |3.99 |4.99 |4.99 |5.99 |6.99 |11.99

This looks … “skewed”, although clearly the bias is mainly caused by the fact that this data is generated. When we plot the above, we’re getting:

The slope is less steep at the beginning of this curve, which essentially means that more values exist at the lower end of the range than at the upper end. We can validate this with another query:

SELECT amount, count(*)
FROM (
  SELECT trunc(amount) AS amount
  FROM payment
) t 
GROUP BY amount
ORDER BY amount;

… which yields:

amount |count |
-------|------|
0      |3003  |
1      |641   |
2      |3542  |
3      |1117  |
4      |3789  |
5      |1306  |
6      |1119  |
7      |675   |
8      |486   |
9      |257   |
10     |104   |
11     |10    |

Plotted:

When plotting this, we can see that there are more amounts in the lower half of the range than in the upper half, which leads to percentiles growing slower.

Correlations

This technique can also be applied to detect correlations in data. We can, for instance, try to find the percentiles of the length of films, and group data sets by rating. I’m using a GROUPING SETS function here, the ROLLUP() function, to calculate the grand total as well. Just check out the query and its results, and you’ll see:

SELECT
  rating,
  count(*),
  percentile_disc(0.0) WITHIN GROUP (ORDER BY length) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY length) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY length) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY length) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY length) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY length) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY length) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY length) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY length) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY length) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY length) AS "100%"
FROM film
GROUP BY ROLLUP(rating);

This yields:

rating |count |0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
-------|------|---|----|----|----|----|----|----|----|----|----|-----|
G      |178   |47 |57  |67  |80  |93  |107 |121 |138 |156 |176 |185  |
PG     |194   |46 |58  |72  |85  |99  |113 |122 |137 |151 |168 |185  |
PG-13  |223   |46 |61  |76  |92  |110 |125 |138 |150 |162 |176 |185  |
R      |195   |49 |68  |82  |90  |104 |115 |129 |145 |160 |173 |185  |
NC-17  |210   |46 |58  |74  |84  |97  |112 |125 |138 |153 |174 |184  |
       |1000  |46 |60  |74  |86  |102 |114 |128 |142 |156 |173 |185  |

So, the GROUP BY clause produced one row per rating, and an additional grand total column at the bottom. For illustration purposes, I’ve added the COUNT(*) column, to show how many films are in each group. The 5 first rows sum up to 1000, which is again the grand total at the bottom.

Let’s plot the percentiles now as line and bar charts:

We can “see” that there is no strong correlation between the two data points. Both data sets are close to uniformly distributed, quite independently of the rating, with the exception of PG-13, which is just slightly skewed towards longer film lengths.

Again, this isn’t terribly interesting as the data set was generated, probably using some randomness to avoid perfectly uniform distribution. In real world scenarios, the above data would have been more “skewed”.

How does this help with performance?

A balanced tree index is very useful when data is quite uniformly distributed, because in that case, it can help access data points or ranges of data in O(log(N)) time. This is quite a useful property for queries that look for film_id values, e.g.

SELECT *
FROM film
WHERE film_id = 1

When accessing “skewed” data, some values are more equal than others. This means that for example if we’re looking for amounts in the payment table, these two queries are not the same:

-- A lot of rows returned (3644)
SELECT * FROM payment WHERE amount BETWEEN 0 AND 2;

-- Few rows returned (361)
SELECT * FROM payment WHERE amount BETWEEN 9 AND 11;

An index on the amount column could have been useful for the second query, but maybe not for the first one.

There are several things we can do to make sure optimal index usage is being applied for all sorts of queries. In case of uniformly distributed data, we usually don’t have to do anything as SQL developers. In case of “skewed” data sets, it may be worth thinking about:

  • Using histogram statistics
  • Hinting the optimiser (in Oracle or SQL Server)
  • Avoiding bind variables (only in extreme cases)

Conclusion

Not all data sets are equal. They are often “skewed”. By “skewed”, in SQL, we don’t mean the statistical meaning of a normal distribution being skewed asymmetrically. We mean that a distribution is not uniform, so even a normal distribution is “skewed”. When it is, then some values appear way more often than others. Some examples are:

Uniform distribution

  • Surrogate keys generated from sequences (consecutive)
  • Surrogate keys generated from UUIDs (random)
  • Foreign keys on one-to-one relationships

Slight “skew”

Possibly significant “skew”

This really depends on the actual data set, but do expect significant “skew” in these data types

  • Foreign keys on to-many relationships (e.g. some customers have more assets than others)
  • Numeric values (e.g. amount)
  • Codes and other discrete values (e.g. film rating, payment settlement codes, etc.)

This article has shown how we can use simple SQL aggregate functions, including the percentiles, to calculate and visualise such “skew”.

How to Work Around ORA-38104: Columns referenced in the ON Clause cannot be updated

Standard SQL is a beautiful language. Vendor specific implementations, however, have their warts. In Oracle, for example, it’s not possible to update any columns in a MERGE statement, which have been referenced by the ON clause. For example:

CREATE TABLE person (
  id NUMBER(18) NOT NULL PRIMARY KEY,
  user_name VARCHAR2(50) NOT NULL UNIQUE,
  score NUMBER(18)
);

Now, in MySQL, we can run a non-standard INSERT .. ON DUPLICATE KEY UPDATE statement like this:

INSERT INTO person (id, user_name, score)
VALUES (1, 'foo', 100)
ON DUPLICATE KEY UPDATE
  SET user_name = 'foo', score = 100

Behind the scenes, MySQL will check all unique constraints for duplicates and reject the insert, replacing it by the update statement instead. It’s debatable whether this is really useful (ideally, we want to check only a single unique constraint for duplicates), but that’s what MySQL offers.

In case we want to run the same behaviour by Oracle, we could use the MERGE statement:

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id OR t.user_name = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

That looks reasonable, but it doesn’t work. We’ll get:

SQL-Fehler: ORA-38104: Columns referenced in the ON Clause cannot be updated: “T”.”USER_NAME”

Obviously, this is some protection against the situation where such an update would suddenly move a row from the matched to the not matched group. In this particular example, it might not look like something that could cause problems, but if vendor specific extensions such as the WHERE or DELETE clause would be used, things might look different.

However, the parser is not very smart, in fact, it is almost not smart at all. While it detects extremely silly attempts at circumventing this limitation, such as this:

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
-- Circumvention attempt here: NVL()
ON (t.id = s.id OR nvl(t.user_name, null) = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

It does not detect any of these attempts:

Using row value expressions

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id OR 
-- Circumvention attempt here: row value expressions
  (t.user_name, 'dummy') = ((s.user_name, 'dummy')))
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

Seemingly without any penalty on the execution plan. Both indexes are being used:

---------------------------------------------------------------------------
| Id  | Operation                               | Name            | Rows  |
---------------------------------------------------------------------------
|   0 | MERGE STATEMENT                         |                 |     1 |
|   1 |  MERGE                                  | PERSON          |       |
|   2 |   VIEW                                  |                 |       |
|   3 |    NESTED LOOPS OUTER                   |                 |     1 |
|   4 |     FAST DUAL                           |                 |     1 |
|   5 |     VIEW                                | VW_LAT_8626BD41 |     1 |
|   6 |      TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |     1 |
|   7 |       BITMAP CONVERSION TO ROWIDS       |                 |       |
|   8 |        BITMAP OR                        |                 |       |
|   9 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 10 |          INDEX RANGE SCAN               | SYS_C00106110   |       |
|  11 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 12 |          INDEX RANGE SCAN               | SYS_C00106111   |       |
---------------------------------------------------------------------------

Correlated subquery

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id OR 
-- Circumvention attempt here: correlated subquery
  (SELECT t.user_name FROM dual) = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

This seems to prevent any index usage, and should thus be avoided:

----------------------------------------------------------
| Id  | Operation              | Name            | Rows  |
----------------------------------------------------------
|   0 | MERGE STATEMENT        |                 |     1 |
|   1 |  MERGE                 | PERSON          |       |
|   2 |   VIEW                 |                 |       |
|   3 |    NESTED LOOPS OUTER  |                 |     1 |
|   4 |     FAST DUAL          |                 |     1 |
|   5 |     VIEW               | VW_LAT_1846A928 |     1 |
|*  6 |      FILTER            |                 |       |
|   7 |       TABLE ACCESS FULL| PERSON          |     1 |
|   8 |       FAST DUAL        |                 |     1 |
----------------------------------------------------------

Using NVL() and updating a view instead

Just plain simple usage of NVL() inside of the ON clause didn’t work before. The parser was smart enough to detect that. But it isn’t smart enough to detect NVL() inside of a view / derived table.

MERGE INTO (
  SELECT id, user_name, nvl(user_name, null) n, score
  FROM person
) t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
-- Circumvention attempt here: renamed column
ON (t.id = s.id OR t.n = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

Notice that both USER_NAME and N columns are the same thing, but the parser doesn’t notice this and thinks we’re fine.

The execution plan is still optimal, as Oracle seems to have a way to optimise NVL() expressions (but not coalesce and others!):

---------------------------------------------------------------------------
| Id  | Operation                               | Name            | Rows  |
---------------------------------------------------------------------------
|   0 | MERGE STATEMENT                         |                 |     1 |
|   1 |  MERGE                                  | PERSON          |       |
|   2 |   VIEW                                  |                 |       |
|   3 |    NESTED LOOPS OUTER                   |                 |     1 |
|   4 |     FAST DUAL                           |                 |     1 |
|   5 |     VIEW                                | VW_LAT_46651921 |     1 |
|   6 |      TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |     1 |
|   7 |       BITMAP CONVERSION TO ROWIDS       |                 |       |
|   8 |        BITMAP OR                        |                 |       |
|   9 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 10 |          INDEX RANGE SCAN               | SYS_C00106110   |       |
|  11 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 12 |          INDEX RANGE SCAN               | SYS_C00106111   |       |
---------------------------------------------------------------------------

Using the WHERE clause

If we hadn’t had an OR predicate in our ON clause, but a AND predicate, then we could have used the WHERE clause in Oracle. This works:

-- NOT the same query as the original one!
MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
  WHERE t.user_name = s.user_name
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score);

This is not the same query as the original one. I just listed it here for completeness’ sake. Also to remind readers of the fact that this approach as well doesn’t seem to use indexes optimally. Only the primary key index (from the ON clause) seems to be used. The unique key is not being used:

----------------------------------------------------------------
| Id  | Operation                      | Name          | Rows  |
----------------------------------------------------------------
|   0 | MERGE STATEMENT                |               |     1 |
|   1 |  MERGE                         | PERSON        |       |
|   2 |   VIEW                         |               |       |
|   3 |    NESTED LOOPS OUTER          |               |     1 |
|   4 |     VIEW                       |               |     1 |
|   5 |      FAST DUAL                 |               |     1 |
|   6 |     TABLE ACCESS BY INDEX ROWID| PERSON        |     1 |
|*  7 |      INDEX UNIQUE SCAN         | SYS_C00106110 |     1 |
----------------------------------------------------------------

Careful

Be careful when applying the above workarounds. Assuming that ORA-38104 is a good thing (i.e. that Oracle still thinks it should be enforced), then the above workarounds simply expose bugs in the parser, which should detect such cases. The above behaviour has been observed in Oracle 12c and 18c.

I personally believe that ORA-38104 should be abandoned entirely, and the root cause for this restriction should be removed. But it is certainly worth exploring alternative options rather than relying on the above workarounds in production code, apart from the occasional one-shot migration query, where such loop holes are always nice tools to exploit.

How to Aggregate an Archive Log’s Deltas into a Snapshot with SQL

A customer of my popular SQL training (which you should book!) has recently challenged me to optimise a hierarchical query that merges an archive log’s deltas in order to obtain a snapshot of some record at a given point in time. In this article, I will reproduce their problem statement in a simplified version and show how this can be done with SQL Server, using a few cool SQL features:

All of these are topics covered in the training, which were immediately applicable to this problem statement.

The problem statement

This was their archive design. They designed for uncertainty, meaning that for some entities in their system, they did not know what kinds of attributes will be part of the entity in the future. Given their application design, users could even add their own custom attributes to an entity.

This kind of thing is typically solved with the EAV (Entity Attribute Value) model, a “workaround” to denormalise data sets in SQL databases in the event of such schema uncertainty.

EAV can be implemented in several ways:

Through classic SQL tables only

An example implementation is this:

CREATE TABLE eav_classic (
  entity_type     VARCHAR (100) NOT NULL,
  entity_id       BIGINT        NOT NULL,
  attribute_name  VARCHAR (100) NOT NULL,
  attribute_type  VARCHAR (100) NOT NULL,
  attribute_value VARCHAR (100)     NULL,

  CONSTRAINT eav_classic_pk 
    PRIMARY KEY (entity_type, entity_id, attribute_name)
);

The drawbacks of this non-normalised design are immediately obvious. Most specifically, there is no simple way to establish referential integrity. But this may be totally OK, especially for archive logs, and for smaller databases (datomic does something similar)

Through tables containing JSON or XML data

Whenever you have schema-on-read data, JSON or XML data types may be appropriate, so this is a perfectly valid alternative:

CREATE TABLE eav_json (
  entity_type     VARCHAR (100)   NOT NULL,
  entity_id       BIGINT          NOT NULL,
  attributes      VARCHAR (10000) NOT NULL 
    CHECK (ISJSON(attributes) = 1),

  CONSTRAINT eav_json_pk 
    PRIMARY KEY (entity_type, entity_id)
);

If your database supports a JSON data type, obviously, you will prefer that over the above emulation

For the rest of this article, I will use the JSON

Versioning the EAV table

Versioning data in an EAV model is quite easier than in a normalised schema. We can just add a version number and/or timestamp to the record. In their case, something like this may make sense:

CREATE TABLE history (
  id          BIGINT IDENTITY (1, 1) NOT NULL PRIMARY KEY,
  ts          DATETIME               NOT NULL,
  entity_type VARCHAR(100)           NOT NULL,
  entity_id   BIGINT                 NOT NULL,
  delta       VARCHAR(8000)          NOT NULL 
    CHECK (ISJSON(delta) = 1)
);

INSERT INTO history (entity_type, entity_id, ts, delta)
VALUES ('Person', 1, '2000-01-01 00:00:00', '{"first_name": "John", "last_name": "Doe"}'),
       ('Person', 1, '2000-01-01 01:00:00', '{"age": 37}'),
       ('Person', 1, '2000-01-01 02:00:00', '{"age": 38}'),
       ('Person', 1, '2000-01-01 03:00:00', '{"city": "New York"}'),
       ('Person', 1, '2000-01-01 04:00:00', '{"city": "Zurich", "age": null}')
;

This table now contains a set of deltas applied to the Person entity with ID = 1. It corresponds to the following sequence of SQL statements on an ordinary entity:

INSERT INTO person (id, first_name, last_name) 
  VALUES ('John', 'Doe');
UPDATE person SET age = 37 WHERE id = 1;
UPDATE person SET age = 38 WHERE id = 1;
UPDATE person SET city = 'New York' WHERE id = 1;
UPDATE person SET city = 'Zurich', age = null WHERE id = 1;

You could even see their hand-written log like a transaction log of the database system, kinda like what you can extract using products like Golden Gate or Debezium. If you think of the transaction log as an event stream, the RDBMS’s current data representation is like a snapshot that you can get when applying any number of deltas to your tables.

Sometimes, you don’t want to completely change your architecture and go full “event sourcing”, but just need this kind of log for a specific set of auditable entities. And e.g. for reasons like still supporting very old SQL Server versions, as well as supporting other databases, you may choose also not to use the SQL:2011 temporal table feature, which has also been implemented in SQL Server 2016 and more recent versions.

With that out of our way…

How to access any arbitrary snapshot version?

When we visually process our HISTORY table, we can see that Person ID = 1 had the following values at any given time:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York
04:00:00    John          Doe                 Zurich

Remember, this is always the same record of Person ID = 1, its snapshots represented at different times in the time axis. The goal here is to be able to find the record of John Doe at any given time.

Again, if we had been using the SQL:2011 temporal table feature, we could write

-- SQL Server
SELECT * 
FROM Person
FOR SYSTEM_TIME AS OF '2000-01-01 02:00:00.0000000'; 

-- Oracle (flashback query)
SELECT *
FROM Person
AS OF TIMESTAMP TIMESTAMP '2000-01-01 02:00:00'

Side note: Do note that Oracle’s flashback query needs to be properly configured:

  • Not all data is “flashbackable”
  • DDL tends to destroy the archive
  • Proper grants are needed to access the flashback archive

Similar limitations may apply in SQL Server.

What if the RDBMS can’t help us?

If again for some reason, we cannot use the RDBMS’s temporal table features, we’ll roll our own as we’ve seen. So, our query in SQL Server to access the snapshot at any given time may be this:

SELECT 
  '{' 
+ string_agg(
    CASE type WHEN 0 THEN NULL ELSE 
      '"' + [key] + '": ' + 
      CASE type WHEN 1 THEN '"' + value + '"' ELSE value END
    END, ', ') 
+ '}'
FROM (
  SELECT *, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history
  OUTER APPLY openjson(delta)
  
  -- Apply all deltas prior to any given snapshot
  WHERE ts <= '2000-01-01 02:00:00'
) t
WHERE rn = 1;

What does this query do? Consider again our deltas at 04:00:00:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York
04:00:00    John          Doe          -      Zurich

Observe how each value has some color encoding:

  • Strong, red: The current snapshot’s attribute value, when the last delta was applied to any given attribute
  • Strong, black: A previous snapshot’s attribute value, when a previous, superseded delta was applied to any given attribute
  • Light grey: A previous snapshot’s attribute value that was inherited from another previous delta

For any given snapshot, we want to find the Strong, red values. E.g. at a previous snapshot time, the color encoding would have been:

At 03:00:00

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York

04:00:00    John          Doe          -      Zurich

At 02:00:00

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38

03:00:00    John          Doe          38     New York
04:00:00    John          Doe          -      Zurich

So, our query needs to find the delta that was applied last for any given attribute.

With SQL, we can find that easily. We can assign a row number to each delta per attribute in reverse order, something like this:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John (1)      Doe (1)
01:00:00    John          Doe          37 (3)
02:00:00    John          Doe          38 (2)
03:00:00    John          Doe          38     New York (2)
04:00:00    John          Doe          - (1)  Zurich (1)

Once we have that row number, we just filter out only those deltas whose row number is 1. Something like:

SELECT [key], value, row_number() OVER (
  PARTITION BY [key] ORDER BY ts DESC) rn
FROM history OUTER APPLY openjson(delta)
ORDER BY [key], ts;

Notice the OUTER APPLY openjson(delta) syntax. This just expands the JSON structure into key/value/type columns, which we can use more easily in a SQL query. Other database systems may have similar syntax for similar purposes. The result of the above query is:

key        |value    |rn 
-----------|---------|---
age        |37       |3  
age        |38       |2  
age        |         |1  
city       |New York |2  
city       |Zurich   |1  
first_name |John     |1  
last_name  |Doe      |1  

Filtering the ones whose row number is 1:

SELECT [key], value
FROM (
  SELECT ts, [key], value, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history OUTER APPLY openjson(delta)
) t
WHERE rn = 1
ORDER BY ts, [key]

This yields:

key        |value  
-----------|-------
first_name |John   
last_name  |Doe    
age        |       
city       |Zurich 

Exactly the data we wanted, in key/value form. Notice that this filtering step could have been done with DISTINCT ON in PostgreSQL, or with KEEP (DENSE_RANK FIRST ORDER BY ..) in Oracle – an exercise which I shall leave to the reader (feel free to leave the solution in the comments!)

And now, finally, just re-assemble the JSON using SQL Server 2017 STRING_AGG. PostgreSQL would offer us JSON_AGG here, Oracle has JSON_OBJECTAGG. With STRING_AGG, you have to take care of manually escaping all values according to JSON syntax rules, which is bad. In my example, I just replaced ” by \”. Other characters need escaping too, so if there is a built-in feature, use that instead of string processing.

The STRING_AGG function aggregates a CASE expression which translates different JSON data types into different formats, where:

  • 0 is NULL (and nulls are not aggregated)
  • 1 is string
  • everything else can be taken at its value for simplicity, e.g. numbers or booleans

Every value (except nulls) are prefixed by the JSON object’s attribute name (“key”).

SELECT 
  '{' 
+ string_agg(
    CASE type WHEN 0 THEN NULL ELSE 
      '"' + replace([key], '"', '\"') + '": ' + 
      CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
    END, ', ') 
+ '}'
FROM (
  SELECT *, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history
  OUTER APPLY openjson(delta)
  
  -- Apply all deltas prior to any given snapshot
  WHERE ts <= '2000-01-01 04:00:00'
) t
WHERE rn = 1;

This produces

{"city": "Zurich", "first_name": "John", "last_name": "Doe"}

A final query, that gets us the entire history of snapshots (watch the performance on this one, could definitely be optimised):

SELECT ts, (
  SELECT 
    '{' 
  + string_agg(
      CASE type WHEN 0 THEN NULL ELSE 
        '"' + replace([key], '"', '\"') + '": ' + 
        CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
      END, ', ') 
  + '}'
  FROM (
    SELECT *, row_number() OVER (
      PARTITION BY [key] ORDER BY ts DESC) rn
    FROM history
    OUTER APPLY openjson(delta)
    
    -- Apply all deltas prior to any given snapshot
    WHERE ts <= x.ts
  ) t
  WHERE rn = 1
)
FROM history x
GROUP BY ts;

It yields:

ts       |                                                                          
---------|--------------------------------------------------------------------------
00:00:00 |{"first_name": "John", "last_name": "Doe"}                                
01:00:00 |{"age": 37, "first_name": "John", "last_name": "Doe"}                     
02:00:00 |{"age": 38, "first_name": "John", "last_name": "Doe"}                     
03:00:00 |{"age": 38, "city": "New York", "first_name": "John", "last_name": "Doe"} 
04:00:00 |{"city": "Zurich", "first_name": "John", "last_name": "Doe"}              

So, the complete history of all the snapshot versions of the Person with ID = 1.

Very cool, and definitely good enough for their archive / audit query requirements.

How to Write a Multiplication Aggregate Function in SQL

Everyone knows the SQL SUM() aggregate function (and many people also know its window function variant).

When querying the Sakila database, we can get the daily revenue (using PostgreSQL syntax):

WITH p AS (
  SELECT
    CAST (payment_date AS DATE) AS date,
    amount
  FROM payment
)
SELECT
  date,
  SUM (amount) AS daily_revenue,
  SUM (SUM (amount)) OVER (ORDER BY date) AS cumulative_revenue
FROM p
GROUP BY date
ORDER BY date

The result will look something like this:

date       |daily_revenue |cumulative_revenue 
-----------|--------------|-------------------
2005-05-24 |29.92         |29.92              
2005-05-25 |573.63        |603.55             
2005-05-26 |754.26        |1357.81            
2005-05-27 |685.33        |2043.14            
2005-05-28 |804.04        |2847.18            
2005-05-29 |648.46        |3495.64            
2005-05-30 |628.42        |4124.06            
2005-05-31 |700.37        |4824.43            
2005-06-14 |57.84         |4882.27            
...

Doing the same with multiplication

This is already quite useful. Very occasionally, however, we do not need to aggregate multiple values in a sum (through addition), but in a product (through multiplication). I’ve just stumbled upon such a case on Stack Overflow, recently.

The question wanted to achieve the following result:

date        factor          accumulated
---------------------------------------
1986-01-10  null            1000
1986-01-13  -0.026595745    973.4042548
1986-01-14  0.005464481     978.7234036
1986-01-15  -0.016304348    962.7659569
1986-01-16  0               962.7659569
1986-01-17  0               962.7659569
1986-01-20  0               962.7659569
1986-01-21  0.005524862     968.0851061
1986-01-22  -0.005494506    962.765957
1986-01-23  0               962.765957
1986-01-24  -0.005524862    957.4468078
1986-01-27  0.005555556     962.7659569
1986-01-28  0               962.7659569
1986-01-29  0               962.7659569
1986-01-30  0               962.7659569
1986-01-31  0.027624309     989.3617013
1986-02-03  0.016129032     1005.319148
1986-02-04  0.042328041     1047.872338
1986-02-05  0.04568528      1095.744679

If this were a Microsoft Excel spreadsheet, the ACCUMULATED column would simply start with 1000 and have the following formula in all other rows:

accumulated(i) = accumulated(i - 1) * (1 + factor)

In other words (values truncated for simplicity):

1000.0 = start
 973.4 = 1000.0 * (1 - 0.026)
 978.7 =  973.4 * (1 + 0.005)
 962.7 =  978.7 * (1 - 0.016)
 962.7 =  962.7 * (1 - 0.000)
 962.7 =  962.7 * (1 - 0.000)
 962.7 =  962.7 * (1 - 0.000)
 968.0 =  962.7 * (1 + 0.005)
 ...

This is exciting because we’re not only requiring multiplicative aggregation, but even cumulative multiplicative aggregation. So, another window function.

But regrettably, SQL doesn’t offer a MUL() aggregate function, even if it were relatively simple to implement. We have two options:

  • Implementing a custom aggregate function (stay tuned for a future blog post)
  • Using a trick by summing logarithms, rather than multiplying operands directly

We’re implementing the latter for now. Check out this cool Wikipedia website about logarithmic identities, which we are going to blindly trust. In the middle of it, we have:

bx * by = bx + y

Which leads to:

logb(x * y) = logb(x) + logb(y)

How cool is that? And thus:

x * y = blogb(x) + logb(y)

So, we can define any multiplication in terms of a bunch of exponentiation to some base (say e) and logarithms to some base (say e). Or, in SQL:

x * y = EXP(LN(x) + LN(y))

Or, as an aggregate function:

MUL(x) = EXP(SUM(LN(x)))

Heh!

Our original problem can thus be solved very easily using this, as shown in my stack overflow answer:

SELECT
  date,
  factor,
  EXP(SUM(LN(1000 * (1 + COALESCE(factor, 1)))) 
       OVER (ORDER BY date)) AS accumulated
FROM t

And we get the nice result as previously shown. You may have to replace LN() by LOG() depending on your database.

Caveat: Negative numbers

Try running this:

SELECT LN(-1)

You’ll get:

SQL Error [2201E]: ERROR: cannot take logarithm of a negative number

Logarithms are defined only for strictly positive numbers, unless your database is capable of handling complex numbers as well. In case of which a single zero value would still break the aggregation.

But if your data set is defined to contain only strictly positive numbers, you’ll be fine – give or take some floating point rounding errors. Or, you’ll do some sign handling, which looks like this:

WITH v(i) AS (VALUES (-2), (-3), (-4))
SELECT 
  CASE 
    WHEN SUM (CASE WHEN i < 0 THEN -1 END) % 2 < 0 
    THEN -1 
    ELSE 1 
  END * EXP(SUM(LN(ABS(i)))) multiplication1
FROM v;

WITH v(i) AS (VALUES (-2), (-3), (-4), (-5))
SELECT 
  CASE 
    WHEN SUM (CASE WHEN i < 0 THEN -1 END) % 2 < 0 
    THEN -1 
    ELSE 1 
  END * EXP(SUM(LN(ABS(i)))) multiplication2
FROM v;

The above yielding

multiplication1      
--------------------
-23.999999999999993 


multiplication2     
-------------------
119.99999999999997 

Close enough.

Caveat: Zero

Try running this:

SELECT LN(0)

You’ll get:

SQL Error [2201E]: ERROR: cannot take logarithm of zero

Zero is different from negative numbers. A product that has a zero operand is always zero, so we should be able to handle this. We’ll do it in two steps:

  • Exclude zero values from the actual aggregation that uses EXP() and LN()
  • Add an additional CASE expression that checks if any of the operands is zero

The first step might not be necessary depending on how your database optimiser executes the second step.

WITH v(i) AS (VALUES (2), (3), (0))
SELECT 
  CASE 
    WHEN SUM (CASE WHEN i = 0 THEN 1 END) > 0
    THEN 0
    WHEN SUM (CASE WHEN i < 0 THEN -1 END) % 2 < 0 
    THEN -1 
    ELSE 1 
  END * EXP(SUM(LN(ABS(NULLIF(i, 0))))) multiplication
FROM v;

Extension: DISTINCT

Calculating the product of all DISTINCT values requires to repeat the DISTINCT keyword in 2 out of the above 3 sums:

WITH v(i) AS (VALUES (2), (3), (3))
SELECT 
  CASE 
    WHEN SUM (CASE WHEN i = 0 THEN 1 END) > 0
    THEN 0
    WHEN SUM (DISTINCT CASE WHEN i < 0 THEN -1 END) % 2 < 0 
    THEN -1 
    ELSE 1 
  END * EXP(SUM(DISTINCT LN(ABS(NULLIF(i, 0))))) multiplication
FROM v;

The result is now:

multiplication |
---------------|
6              |

Notice that the first SUM() that checks for the presence of NULL values doesn’t require a DISTINCT keyword, so we omit it to improve performance.

Extension: Window functions

Of course, if we are able to emulate a PRODUCT() aggregate function, we’d love to turn it into a window function as well. This can be done simply by transforming each individual SUM() into a window function:

WITH v(i, j) AS (
  VALUES (1, 2), (2, -3), (3, 4), 
         (4, -5), (5, 0), (6, 0)
)
SELECT i, j, 
  CASE 
    WHEN SUM (CASE WHEN j = 0 THEN 1 END) 
      OVER (ORDER BY i) > 0
    THEN 0
    WHEN SUM (CASE WHEN j < 0 THEN -1 END) 
      OVER (ORDER BY i) % 2 < 0 
    THEN -1 
    ELSE 1 
  END * EXP(SUM(LN(ABS(NULLIF(j, 0)))) 
    OVER (ORDER BY i)) multiplication
FROM v;

The result is now:

i |j  |multiplication      |
--|---|--------------------|
1 | 2 |2                   |
2 |-3 |-6                  |
3 | 4 |-23.999999999999993 |
4 |-5 |119.99999999999997  |
5 | 0 |0                   |
6 | 1 |0                   |

So cool! The cumulative product gets bigger and bigger until it hits he first zero, from then on it stays zero.

jOOQ support

jOOQ 3.12 will support this as well and emulate it correctly on all databases:
https://github.com/jOOQ/jOOQ/issues/5939

A note on Oracle performance

Do note that Oracle is very slow to calculate LN(number_type). It can be MUCH faster to calculate, instead LN(binary_double_type). An explicit type cast produced a 100x performance improvement in a simple test, documented here.

Beware of Hidden PL/SQL to SQL Context Switches

I recently stumbled upon a curious query on a customer’s productive Oracle database:

SELECT USER FROM SYS.DUAL

Two things caught my attention:

  • The query was executed many billions of times per month, accounting for about 0.3% of that system’s load. That’s 0.3% for something extremely silly!
  • I don’t think that customer would ever qualify the DUAL table as SYS.DUAL, which hints at some system functionality

I found it in Oracle Enterprise Manager, but you could also find it using a query like this one:

SELECT 
  sql_id, 
  executions, 
  elapsed_time, 
  ratio_to_report(elapsed_time) over() p, 
  sql_text
FROM v$sql
ORDER BY p DESC;

Why was this query being run so often? In Enterprise Manager, the query’s statistics overview displayed that the query originated from a function called STANDARD.USER (I don’t know yet where I could find this information in the dictionary views, manually).

Naively, I had always thought that the USER pseudo column or pseudo constant is some value from the context, but like many other functions, it’s really just a function in that package.

What does STANDARD.USER() do?

Now, I’m not 100% sure if that source code is something that I am allowed to reproduce from a legal perspective, this being Oracle and all. But if you run this query here, which I am freely allowing you to:

WITH s AS (
  SELECT s.*,
    MIN(CASE 
      WHEN upper(text) LIKE '%FUNCTION USER%' 
      THEN line END
    ) OVER () s
  FROM all_source s
  WHERE owner = 'SYS' 
  AND name = 'STANDARD'
  AND type = 'PACKAGE BODY'
)
SELECT text
FROM s
WHERE line >= s AND line < s + 6;

Then you might be able to see something like this:

  function USER return varchar2 is
  c varchar2(255);
  begin
        select user into c from sys.dual;
        return c;
  end;

This is just the result of some SQL query I’ve shown you. Any correspondence with actual source code is merely coincidental.

Let’s assume this were the actual source code of the STANDARD.USER() function. We can now clearly see that this very SQL query that I’ve observed before is being executed! Want to verify this?

Let’s benchmark

As always, I’m using the benchmark technique described here. The full benchmark logic is at the end of the article.

In essence, I’m comparing the performances of 500000 executions of this loop:

FOR i IN 1 .. v_repeat LOOP
  v := USER;
END LOOP;

With this one:

FOR i IN 1 .. v_repeat LOOP
  SELECT USER INTO v FROM dual;
END LOOP;

And this one:

FOR i IN 1 .. v_repeat LOOP
  -- Note: According to the doc, use 'SESSION_USER' instead
  v := sys_context('USERENV', 'CURRENT_USER');
END LOOP;

The result of this benchmark is:

Run 1, Statement 1 : 2.40509 (avg : 2.43158)
Run 1, Statement 2 : 2.13208 (avg : 2.11816)
Run 1, Statement 3 : 1.01452 (avg : 1.02081)

Run 2, Statement 1 : 2.41889 (avg : 2.43158)
Run 2, Statement 2 : 2.09753 (avg : 2.11816)
Run 2, Statement 3 : 1.00203 (avg : 1.02081)

Run 3, Statement 1 : 2.45384 (avg : 2.43158)
Run 3, Statement 2 : 2.09060 (avg : 2.11816)
Run 3, Statement 3 : 1.02239 (avg : 1.02081)

Run 4, Statement 1 : 2.39516 (avg : 2.43158)
Run 4, Statement 2 : 2.14140 (avg : 2.11816)
Run 4, Statement 3 : 1.06512 (avg : 1.02081)

Run 5, Statement 1 : 2.48493 (avg : 2.43158)
Run 5, Statement 2 : 2.12922 (avg : 2.11816)
Run 5, Statement 3 : 1.00000 (avg : 1.02081)

How to read this benchmark result? These aren’t actual times, which are not interesting, but relative times compared to the fastest run (run 5, statement 3 = 1). The explicit SELECT USER FROM DUAL is about half as fast as the SYS_CONTEXT call, and the USER call is a bit slower, even.

When re-running this query:

SELECT 
  sql_id, 
  executions, 
  ratio_to_report(elapsed_time) over() p, 
  sql_text
FROM v$sql
ORDER BY p DESC;

We can see:

SQL_ID          EXECUTIONS  P     SQL_TEXT
6r9s58qfu339c   1           0.26  DECLARE ...
1v717nvrhgbn9   2500000     0.14  SELECT USER FROM SYS.DUAL
...

So, this query has definitely been run way too many times, including the PL/SQL to SQL context switch that is involved.

I’m running this benchmark in Oracle 18.0.0.0.0 in Docker on a Windows machine. More close-to-the-metal and less virtualised setups might achieve more drastic results. See, e.g. Connor McDonald got a much better improvement from using SYS_CONTEXT:

In this particular case, The STANDARD.USER() reference was used very often in triggers to fill in audit columns of many tables. Very easy to fix. Just use sys_context('USERENV', 'CURRENT_USER') instead.

Full benchmark logic

SET SERVEROUTPUT ON

ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;

CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 500000;
  v NUMBER;
BEGIN

  -- Repeat the whole benchmark several times to 
  -- avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1 .. v_repeat LOOP
      v := v + length(USER);
    END LOOP;
  
    INSERT INTO results VALUES (r, 1, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1 .. v_repeat LOOP
      SELECT v + length(USER) INTO v FROM dual;
    END LOOP;
      
    INSERT INTO results VALUES (r, 2, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1 .. v_repeat LOOP
      -- Note: According to the doc, use 'SESSION_USER' instead
      v := v + length(sys_context('USERENV', 'CURRENT_USER'));
    END LOOP;
      
    INSERT INTO results VALUES (r, 3, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio,
      CAST(AVG(elapsed) OVER (PARTITION BY stmt) / 
           MIN(elapsed) OVER() AS NUMBER(10, 5)) avg_ratio
    FROM results
    ORDER BY run, stmt
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio || ' (avg : ' || rec.avg_ratio || ')');
  END LOOP;
  
  dbms_output.put_line('');
  dbms_output.put_line('Copyright Data Geekery GmbH');
  dbms_output.put_line('https://www.jooq.org/benchmark');
END;
/

DROP TABLE results;

Find the Next Non-NULL Row in a Series With SQL

I’ve stumbled across this fun SQL question on reddit, recently. The question was looking at a time series of data points where some events happened. For each event, we have the start time and the end time

timestamp             start    end
-----------------------------------
2018-09-03 07:00:00   1        null
2018-09-03 08:00:00   null     null
2018-09-03 09:00:00   null     null
2018-09-03 10:00:00   null     1
2018-09-03 12:00:00   null     null
2018-09-03 13:00:00   null     null
2018-09-03 14:00:00   1        null
2018-09-03 15:00:00   null     1

The desired output of the query should be this additional count column:

timestamp             start    end    count
-------------------------------------------
2018-09-03 07:00:00   1        null   4
2018-09-03 08:00:00   null     null   null
2018-09-03 09:00:00   null     null   null
2018-09-03 10:00:00   null     1      null
2018-09-03 12:00:00   null     null   null
2018-09-03 13:00:00   null     null   null
2018-09-03 14:00:00   1        null   2
2018-09-03 15:00:00   null     1      null

So, the rule is simple. Whenever an event starts, we would like to know how many consecutive entries it takes until the event stops again. We can visually see how that makes sense:

timestamp             start    end    count
-------------------------------------------
2018-09-03 07:00:00   1        null   4     -- 4 Rows in this event
2018-09-03 08:00:00   null     null   null
2018-09-03 09:00:00   null     null   null
2018-09-03 10:00:00   null     1      null

2018-09-03 12:00:00   null     null   null  -- No event here
2018-09-03 13:00:00   null     null   null

2018-09-03 14:00:00   1        null   2     -- 2 Rows in this event
2018-09-03 15:00:00   null     1      null

Some observations and assumptions about the problem at hand:

  • No two events will ever overlap
  • The time series does not progress monotonously, i.e. even if most data points are 1h apart, there can be larger or smaller gaps between data points
  • There are, however, no two identical timestamps in the series

How can we solve this problem?

Create the data set, first

We’re going to be using PostgreSQL for this example, but it will work with any database that supports window functions, which are most databases these days.

In PostgreSQL, we can use the VALUES() clause to generate data in memory easily. For the sake of simplicity, we’re not going to use timestamps, but integer representations of the timestamps. I’ve included the same out-of-the-ordinary gap between 4 and 6:

values (1, 1, null),
       (2, null, null),
       (3, null, null),
       (4, null, 1),
       (6, null, null),
       (7, null, null),
       (8, 1, null),
       (9, null, 1)

If we run this statement (yes, this is a standalone statement in PostgreSQL!), then the database will simply echo back the values we’ve sent it:

column1 |column2 |column3 |
--------|--------|--------|
1       |1       |        |
2       |        |        |
3       |        |        |
4       |        |1       |
6       |        |        |
7       |        |        |
8       |1       |        |
9       |        |1       |

How to deal with non-monotonously growing series

The fact that column1 is not growing monotonously means that we cannot use it / trust it as a means to calculate the length of an event. We need to calculate an additional column that has a guaranteed monotonously growing set of integers in it. The ROW_NUMBER() window function is perfect for that.

Consider this SQL statement:

with 
  d(a, b, c) as (
	values (1, 1, null),
	       (2, null, null),
	       (3, null, null),
	       (4, null, 1),
	       (6, null, null),
	       (7, null, null),
	       (8, 1, null),
	       (9, null, 1)
  ),
  t as (
    select 
      row_number() over (order by a) as rn, a, b, c
    from d
  )
select * from t;

The new rn column is a row number calculated for each row based on the ordering of a. For simplicity, I’ve aliased:

  • a = timestamp
  • b = start
  • c = end

The result of this query is:

rn |a |b |c |
---|--|--|--|
1  |1 |1 |  |
2  |2 |  |  |
3  |3 |  |  |
4  |4 |  |1 |
5  |6 |  |  |
6  |7 |  |  |
7  |8 |1 |  |
8  |9 |  |1 |

Nothing fancy yet.

Now, how to use this rn column to find the length of an event?

Visually, we can get the idea quickly, seeing that an event’s length can be calculated using the formula RN2 - RN1 + 1:

rn |a |b |c |
---|--|--|--|
1  |1 |1 |  | RN1 = 1
2  |2 |  |  |
3  |3 |  |  |
4  |4 |  |1 | RN2 = 4

5  |6 |  |  |
6  |7 |  |  |

7  |8 |1 |  | RN1 = 7
8  |9 |  |1 | RN2 = 8

We have two events:

  • 4 – 1 + 1 = 4
  • 8 – 7 + 1 = 2

So, all we have to do is for each starting point of an event at RN1, find the corresponding RN2, and run the arithmetic. This is quite a bit of syntax, but it isn’t so hard, so bear with me while I explain:

with 
  d(a, b, c) as (
	values (1, 1, null),
	       (2, null, null),
	       (3, null, null),
	       (4, null, 1),
	       (6, null, null),
	       (7, null, null),
	       (8, 1, null),
	       (9, null, 1)
  ),
  t as (
    select 
      row_number() over (order by a) as rn, a, b, c
    from d
  )

-- Interesting bit here:
select
  a, b, c,
  case 
    when b is not null then 
      min(case when c is not null then rn end) 
        over (order by rn 
          rows between 1 following and unbounded following) 
      - rn + 1 
  end cnt
from t;

Let’s look at this new cnt column, step by step. First, the easy part:

The CASE expression

There’s a case expression that goes like this:

case 
  when b is not null then 
    ...
end cnt

All this does is check if b is not null and if this is true, then calculate something. Remember, b = start, so we’re putting a calculated value in the row where an event started. That was the requirement.

The new window function

So, what do we calculate there?

min(...) over (...) ...

A window function that finds the minimum value over a window of data. That minimum value is RN2, the next row number value where the event ends. So, what do we put in the min() function to get that value?

min(case when c is not null then rn end) 
over (...) 
...

Another case expression. When c is not null, we know the event has ended (remember, c = end). And if the event has ended, we want to find that row’s rn value. So that would be the minimum value of that case expression for all the rows after the row that started the event. Visually:

rn |a |b |c | case expr | minimum "next" value
---|--|--|--|-----------|---------------------
1  |1 |1 |  | null      | 4
2  |2 |  |  | null      | null
3  |3 |  |  | null      | null
4  |4 |  |1 | 4         | null

5  |6 |  |  | null      | null
6  |7 |  |  | null      | null

7  |8 |1 |  | null      | 8
8  |9 |  |1 | 8         | null

Now, we only need to specify that OVER() clause to form a window of all rows that follow the current row.

min(case when c is not null then rn end) 
  over (order by rn 
    rows between 1 following and unbounded following) 
...

The window is ordered by rn and it starts 1 row after the current row (1 following) and ends in infinity (unbounded following).

The only thing left to do now is do the arithmetic:

min(case when c is not null then rn end) 
  over (order by rn 
    rows between 1 following and unbounded following) 
- rn + 1

This is a verbose way of calculating RN2 - RN1 + 1, and we’re doing that only in those columns that start an event. The result of the complete query above is now:

a |b |c |cnt |
--|--|--|----|
1 |1 |  |4   |
2 |  |  |    |
3 |  |  |    |
4 |  |1 |    |
6 |  |  |    |
7 |  |  |    |
8 |1 |  |2   |
9 |  |1 |    |

Read more about window functions on this blog.