## Calculate Percentiles to Learn About Data Set Skew in SQL

B-Tree indexes are perfect when your data is uniformly distributed. They are not really useful, when you have skewed data. I’ll explain later why this is the case, but let’s first learn how to detect “skew”

### What is skew?

Skew is a term from statistics when a normal distribution is not symmetric. The example given on Wikipedia shows a distribution like this:

In RDBMS, we sometimes use the term skew colloquially to mean the same thing as non-uniform distribution, i.e. a normal distribution would also be skewed. We simply mean that some values appear more often than others. Thus, I will put the term “skew” in double quotes in this article. While your RDBMS’s statistics contain this information once they are calculated, we can also detect such “skew” manually in ad-hoc queries using percentiles, which are defined in the SQL standard and supported in a variety of databases, as ordinary aggregate functions, including:

• Oracle
• PostgreSQL
• SQL Server (regrettably, only as window functions)

Uniform distribution

Let’s look at the FILM_ID values in the Sakila database:

```SELECT
percentile_disc(0.0) WITHIN GROUP (ORDER BY film_id) AS "0%",
percentile_disc(0.1) WITHIN GROUP (ORDER BY film_id) AS "10%",
percentile_disc(0.2) WITHIN GROUP (ORDER BY film_id) AS "20%",
percentile_disc(0.3) WITHIN GROUP (ORDER BY film_id) AS "30%",
percentile_disc(0.4) WITHIN GROUP (ORDER BY film_id) AS "40%",
percentile_disc(0.5) WITHIN GROUP (ORDER BY film_id) AS "50%",
percentile_disc(0.6) WITHIN GROUP (ORDER BY film_id) AS "60%",
percentile_disc(0.7) WITHIN GROUP (ORDER BY film_id) AS "70%",
percentile_disc(0.8) WITHIN GROUP (ORDER BY film_id) AS "80%",
percentile_disc(0.9) WITHIN GROUP (ORDER BY film_id) AS "90%",
percentile_disc(1.0) WITHIN GROUP (ORDER BY film_id) AS "100%"
FROM film;
```

What are we calculating here? We’re trying to find 11 different values for which we can say that:

• 0% of the film_ids are lower than the “0%” value
• 10% of the film_ids are lower than the “10%” value

Or in other words:

• 0% is the MIN(film_id) value
• 50% is the MEDIAN(film_id) value
• 100% is the MAX(film_id) value

The result shows an unsurprisingly uniform distribution:

```0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
---|----|----|----|----|----|----|----|----|----|-----|
1  |100 |200 |300 |400 |500 |600 |700 |800 |900 |1000 |
```

We can plot this in Microsoft Excel or some other tool to get this nice curve:

This is not surprising, as the IDs are just consecutive values, which is a desired property of surrogate keys.

“Skewed” distribution

It’s a different story when we look at the distribution of amounts in the payment table:

```SELECT
percentile_disc(0.0) WITHIN GROUP (ORDER BY amount) AS "0%",
percentile_disc(0.1) WITHIN GROUP (ORDER BY amount) AS "10%",
percentile_disc(0.2) WITHIN GROUP (ORDER BY amount) AS "20%",
percentile_disc(0.3) WITHIN GROUP (ORDER BY amount) AS "30%",
percentile_disc(0.4) WITHIN GROUP (ORDER BY amount) AS "40%",
percentile_disc(0.5) WITHIN GROUP (ORDER BY amount) AS "50%",
percentile_disc(0.6) WITHIN GROUP (ORDER BY amount) AS "60%",
percentile_disc(0.7) WITHIN GROUP (ORDER BY amount) AS "70%",
percentile_disc(0.8) WITHIN GROUP (ORDER BY amount) AS "80%",
percentile_disc(0.9) WITHIN GROUP (ORDER BY amount) AS "90%",
percentile_disc(1.0) WITHIN GROUP (ORDER BY amount) AS "100%"
FROM payment;
```

We’re now getting:

```0%   |10%  |20%  |30%  |40%  |50%  |60%  |70%  |80%  |90%  |100%
-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----
0.00 |0.99 |1.99 |2.99 |2.99 |3.99 |4.99 |4.99 |5.99 |6.99 |11.99
```

This looks … “skewed”, although clearly the bias is mainly caused by the fact that this data is generated. When we plot the above, we’re getting:

The slope is less steep at the beginning of this curve, which essentially means that more values exist at the lower end of the range than at the upper end. We can validate this with another query:

```SELECT amount, count(*)
FROM (
SELECT trunc(amount) AS amount
FROM payment
) t
GROUP BY amount
ORDER BY amount;
```

… which yields:

```amount |count |
-------|------|
0      |3003  |
1      |641   |
2      |3542  |
3      |1117  |
4      |3789  |
5      |1306  |
6      |1119  |
7      |675   |
8      |486   |
9      |257   |
10     |104   |
11     |10    |
```

Plotted:

When plotting this, we can see that there are more amounts in the lower half of the range than in the upper half, which leads to percentiles growing slower.

Correlations

This technique can also be applied to detect correlations in data. We can, for instance, try to find the percentiles of the length of films, and group data sets by rating. I’m using a GROUPING SETS function here, the ROLLUP() function, to calculate the grand total as well. Just check out the query and its results, and you’ll see:

```SELECT
rating,
count(*),
percentile_disc(0.0) WITHIN GROUP (ORDER BY length) AS "0%",
percentile_disc(0.1) WITHIN GROUP (ORDER BY length) AS "10%",
percentile_disc(0.2) WITHIN GROUP (ORDER BY length) AS "20%",
percentile_disc(0.3) WITHIN GROUP (ORDER BY length) AS "30%",
percentile_disc(0.4) WITHIN GROUP (ORDER BY length) AS "40%",
percentile_disc(0.5) WITHIN GROUP (ORDER BY length) AS "50%",
percentile_disc(0.6) WITHIN GROUP (ORDER BY length) AS "60%",
percentile_disc(0.7) WITHIN GROUP (ORDER BY length) AS "70%",
percentile_disc(0.8) WITHIN GROUP (ORDER BY length) AS "80%",
percentile_disc(0.9) WITHIN GROUP (ORDER BY length) AS "90%",
percentile_disc(1.0) WITHIN GROUP (ORDER BY length) AS "100%"
FROM film
GROUP BY ROLLUP(rating);
```

This yields:

```rating |count |0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
-------|------|---|----|----|----|----|----|----|----|----|----|-----|
G      |178   |47 |57  |67  |80  |93  |107 |121 |138 |156 |176 |185  |
PG     |194   |46 |58  |72  |85  |99  |113 |122 |137 |151 |168 |185  |
PG-13  |223   |46 |61  |76  |92  |110 |125 |138 |150 |162 |176 |185  |
R      |195   |49 |68  |82  |90  |104 |115 |129 |145 |160 |173 |185  |
NC-17  |210   |46 |58  |74  |84  |97  |112 |125 |138 |153 |174 |184  |
|1000  |46 |60  |74  |86  |102 |114 |128 |142 |156 |173 |185  |
```

So, the `GROUP BY` clause produced one row per rating, and an additional grand total column at the bottom. For illustration purposes, I’ve added the `COUNT(*)` column, to show how many films are in each group. The 5 first rows sum up to 1000, which is again the grand total at the bottom.

Let’s plot the percentiles now as line and bar charts:

We can “see” that there is no strong correlation between the two data points. Both data sets are close to uniformly distributed, quite independently of the rating, with the exception of PG-13, which is just slightly skewed towards longer film lengths.

Again, this isn’t terribly interesting as the data set was generated, probably using some randomness to avoid perfectly uniform distribution. In real world scenarios, the above data would have been more “skewed”.

### How does this help with performance?

A balanced tree index is very useful when data is quite uniformly distributed, because in that case, it can help access data points or ranges of data in O(log(N)) time. This is quite a useful property for queries that look for film_id values, e.g.

```SELECT *
FROM film
WHERE film_id = 1
```

When accessing “skewed” data, some values are more equal than others. This means that for example if we’re looking for amounts in the payment table, these two queries are not the same:

```-- A lot of rows returned (3644)
SELECT * FROM payment WHERE amount BETWEEN 0 AND 2;

-- Few rows returned (361)
SELECT * FROM payment WHERE amount BETWEEN 9 AND 11;
```

An index on the amount column could have been useful for the second query, but maybe not for the first one.

There are several things we can do to make sure optimal index usage is being applied for all sorts of queries. In case of uniformly distributed data, we usually don’t have to do anything as SQL developers. In case of “skewed” data sets, it may be worth thinking about:

• Using histogram statistics
• Hinting the optimiser (in Oracle or SQL Server)
• Avoiding bind variables (only in extreme cases)

### Conclusion

Not all data sets are equal. They are often “skewed”. By “skewed”, in SQL, we don’t mean the statistical meaning of a normal distribution being skewed asymmetrically. We mean that a distribution is not uniform, so even a normal distribution is “skewed”. When it is, then some values appear way more often than others. Some examples are:

Uniform distribution

• Surrogate keys generated from sequences (consecutive)
• Surrogate keys generated from UUIDs (random)
• Foreign keys on one-to-one relationships

Slight “skew”

Possibly significant “skew”

This really depends on the actual data set, but do expect significant “skew” in these data types

• Foreign keys on to-many relationships (e.g. some customers have more assets than others)
• Numeric values (e.g. amount)
• Codes and other discrete values (e.g. film rating, payment settlement codes, etc.)

This article has shown how we can use simple SQL aggregate functions, including the percentiles, to calculate and visualise such “skew”.

## How to Work Around ORA-38104: Columns referenced in the ON Clause cannot be updated

Standard SQL is a beautiful language. Vendor specific implementations, however, have their warts. In Oracle, for example, it’s not possible to update any columns in a MERGE statement, which have been referenced by the ON clause. For example:

```CREATE TABLE person (
id NUMBER(18) NOT NULL PRIMARY KEY,
user_name VARCHAR2(50) NOT NULL UNIQUE,
score NUMBER(18)
);
```

Now, in MySQL, we can run a non-standard `INSERT .. ON DUPLICATE KEY UPDATE` statement like this:

```INSERT INTO person (id, user_name, score)
VALUES (1, 'foo', 100)
ON DUPLICATE KEY UPDATE
SET user_name = 'foo', score = 100
```

Behind the scenes, MySQL will check all unique constraints for duplicates and reject the insert, replacing it by the update statement instead. It’s debatable whether this is really useful (ideally, we want to check only a single unique constraint for duplicates), but that’s what MySQL offers.

In case we want to run the same behaviour by Oracle, we could use the `MERGE` statement:

```MERGE INTO person t
USING (
SELECT 1 id, 'foo' user_name, 100 score
FROM dual
) s
ON (t.id = s.id OR t.user_name = s.user_name)
WHEN MATCHED THEN UPDATE
SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
VALUES (s.id, s.user_name, s.score)
```

That looks reasonable, but it doesn’t work. We’ll get:

SQL-Fehler: ORA-38104: Columns referenced in the ON Clause cannot be updated: “T”.”USER_NAME”

Obviously, this is some protection against the situation where such an update would suddenly move a row from the matched to the not matched group. In this particular example, it might not look like something that could cause problems, but if vendor specific extensions such as the `WHERE` or `DELETE` clause would be used, things might look different.

However, the parser is not very smart, in fact, it is almost not smart at all. While it detects extremely silly attempts at circumventing this limitation, such as this:

```MERGE INTO person t
USING (
SELECT 1 id, 'foo' user_name, 100 score
FROM dual
) s
-- Circumvention attempt here: NVL()
ON (t.id = s.id OR nvl(t.user_name, null) = s.user_name)
WHEN MATCHED THEN UPDATE
SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
VALUES (s.id, s.user_name, s.score)
```

It does not detect any of these attempts:

Using row value expressions

```MERGE INTO person t
USING (
SELECT 1 id, 'foo' user_name, 100 score
FROM dual
) s
ON (t.id = s.id OR
-- Circumvention attempt here: row value expressions
(t.user_name, 'dummy') = ((s.user_name, 'dummy')))
WHEN MATCHED THEN UPDATE
SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
VALUES (s.id, s.user_name, s.score)
```

Seemingly without any penalty on the execution plan. Both indexes are being used:

```---------------------------------------------------------------------------
| Id  | Operation                               | Name            | Rows  |
---------------------------------------------------------------------------
|   0 | MERGE STATEMENT                         |                 |     1 |
|   1 |  MERGE                                  | PERSON          |       |
|   2 |   VIEW                                  |                 |       |
|   3 |    NESTED LOOPS OUTER                   |                 |     1 |
|   4 |     FAST DUAL                           |                 |     1 |
|   5 |     VIEW                                | VW_LAT_8626BD41 |     1 |
|   6 |      TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |     1 |
|   7 |       BITMAP CONVERSION TO ROWIDS       |                 |       |
|   8 |        BITMAP OR                        |                 |       |
|   9 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 10 |          INDEX RANGE SCAN               | SYS_C00106110   |       |
|  11 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 12 |          INDEX RANGE SCAN               | SYS_C00106111   |       |
---------------------------------------------------------------------------
```

Correlated subquery

```MERGE INTO person t
USING (
SELECT 1 id, 'foo' user_name, 100 score
FROM dual
) s
ON (t.id = s.id OR
-- Circumvention attempt here: correlated subquery
(SELECT t.user_name FROM dual) = s.user_name)
WHEN MATCHED THEN UPDATE
SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
VALUES (s.id, s.user_name, s.score)
```

This seems to prevent any index usage, and should thus be avoided:

```----------------------------------------------------------
| Id  | Operation              | Name            | Rows  |
----------------------------------------------------------
|   0 | MERGE STATEMENT        |                 |     1 |
|   1 |  MERGE                 | PERSON          |       |
|   2 |   VIEW                 |                 |       |
|   3 |    NESTED LOOPS OUTER  |                 |     1 |
|   4 |     FAST DUAL          |                 |     1 |
|   5 |     VIEW               | VW_LAT_1846A928 |     1 |
|*  6 |      FILTER            |                 |       |
|   7 |       TABLE ACCESS FULL| PERSON          |     1 |
|   8 |       FAST DUAL        |                 |     1 |
----------------------------------------------------------
```

Using NVL() and updating a view instead

Just plain simple usage of NVL() inside of the ON clause didn’t work before. The parser was smart enough to detect that. But it isn’t smart enough to detect NVL() inside of a view / derived table.

```MERGE INTO (
SELECT id, user_name, nvl(user_name, null) n, score
FROM person
) t
USING (
SELECT 1 id, 'foo' user_name, 100 score
FROM dual
) s
-- Circumvention attempt here: renamed column
ON (t.id = s.id OR t.n = s.user_name)
WHEN MATCHED THEN UPDATE
SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
VALUES (s.id, s.user_name, s.score)
```

Notice that both `USER_NAME` and `N` columns are the same thing, but the parser doesn’t notice this and thinks we’re fine.

The execution plan is still optimal, as Oracle seems to have a way to optimise NVL() expressions (but not coalesce and others!):

```---------------------------------------------------------------------------
| Id  | Operation                               | Name            | Rows  |
---------------------------------------------------------------------------
|   0 | MERGE STATEMENT                         |                 |     1 |
|   1 |  MERGE                                  | PERSON          |       |
|   2 |   VIEW                                  |                 |       |
|   3 |    NESTED LOOPS OUTER                   |                 |     1 |
|   4 |     FAST DUAL                           |                 |     1 |
|   5 |     VIEW                                | VW_LAT_46651921 |     1 |
|   6 |      TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |     1 |
|   7 |       BITMAP CONVERSION TO ROWIDS       |                 |       |
|   8 |        BITMAP OR                        |                 |       |
|   9 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 10 |          INDEX RANGE SCAN               | SYS_C00106110   |       |
|  11 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 12 |          INDEX RANGE SCAN               | SYS_C00106111   |       |
---------------------------------------------------------------------------
```

Using the WHERE clause

If we hadn’t had an `OR` predicate in our `ON` clause, but a `AND` predicate, then we could have used the `WHERE` clause in Oracle. This works:

```-- NOT the same query as the original one!
MERGE INTO person t
USING (
SELECT 1 id, 'foo' user_name, 100 score
FROM dual
) s
ON (t.id = s.id)
WHEN MATCHED THEN UPDATE
SET t.user_name = s.user_name, t.score = 100
WHERE t.user_name = s.user_name
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
VALUES (s.id, s.user_name, s.score);
```

This is not the same query as the original one. I just listed it here for completeness’ sake. Also to remind readers of the fact that this approach as well doesn’t seem to use indexes optimally. Only the primary key index (from the `ON` clause) seems to be used. The unique key is not being used:

```----------------------------------------------------------------
| Id  | Operation                      | Name          | Rows  |
----------------------------------------------------------------
|   0 | MERGE STATEMENT                |               |     1 |
|   1 |  MERGE                         | PERSON        |       |
|   2 |   VIEW                         |               |       |
|   3 |    NESTED LOOPS OUTER          |               |     1 |
|   4 |     VIEW                       |               |     1 |
|   5 |      FAST DUAL                 |               |     1 |
|   6 |     TABLE ACCESS BY INDEX ROWID| PERSON        |     1 |
|*  7 |      INDEX UNIQUE SCAN         | SYS_C00106110 |     1 |
----------------------------------------------------------------
```

Careful

Be careful when applying the above workarounds. Assuming that ORA-38104 is a good thing (i.e. that Oracle still thinks it should be enforced), then the above workarounds simply expose bugs in the parser, which should detect such cases. The above behaviour has been observed in Oracle 12c and 18c.

I personally believe that ORA-38104 should be abandoned entirely, and the root cause for this restriction should be removed. But it is certainly worth exploring alternative options rather than relying on the above workarounds in production code, apart from the occasional one-shot migration query, where such loop holes are always nice tools to exploit.

## How to Aggregate an Archive Log’s Deltas into a Snapshot with SQL

A customer of my popular SQL training (which you should book!) has recently challenged me to optimise a hierarchical query that merges an archive log’s deltas in order to obtain a snapshot of some record at a given point in time. In this article, I will reproduce their problem statement in a simplified version and show how this can be done with SQL Server, using a few cool SQL features:

All of these are topics covered in the training, which were immediately applicable to this problem statement.

### The problem statement

This was their archive design. They designed for uncertainty, meaning that for some entities in their system, they did not know what kinds of attributes will be part of the entity in the future. Given their application design, users could even add their own custom attributes to an entity.

This kind of thing is typically solved with the EAV (Entity Attribute Value) model, a “workaround” to denormalise data sets in SQL databases in the event of such schema uncertainty.

EAV can be implemented in several ways:

Through classic SQL tables only

An example implementation is this:

```CREATE TABLE eav_classic (
entity_type     VARCHAR (100) NOT NULL,
entity_id       BIGINT        NOT NULL,
attribute_name  VARCHAR (100) NOT NULL,
attribute_type  VARCHAR (100) NOT NULL,
attribute_value VARCHAR (100)     NULL,

CONSTRAINT eav_classic_pk
PRIMARY KEY (entity_type, entity_id, attribute_name)
);
```

The drawbacks of this non-normalised design are immediately obvious. Most specifically, there is no simple way to establish referential integrity. But this may be totally OK, especially for archive logs, and for smaller databases (datomic does something similar)

Through tables containing JSON or XML data

Whenever you have schema-on-read data, JSON or XML data types may be appropriate, so this is a perfectly valid alternative:

```CREATE TABLE eav_json (
entity_type     VARCHAR (100)   NOT NULL,
entity_id       BIGINT          NOT NULL,
attributes      VARCHAR (10000) NOT NULL
CHECK (ISJSON(attributes) = 1),

CONSTRAINT eav_json_pk
PRIMARY KEY (entity_type, entity_id)
);
```

If your database supports a JSON data type, obviously, you will prefer that over the above emulation

For the rest of this article, I will use the JSON

### Versioning the EAV table

Versioning data in an EAV model is quite easier than in a normalised schema. We can just add a version number and/or timestamp to the record. In their case, something like this may make sense:

```CREATE TABLE history (
id          BIGINT IDENTITY (1, 1) NOT NULL PRIMARY KEY,
ts          DATETIME               NOT NULL,
entity_type VARCHAR(100)           NOT NULL,
entity_id   BIGINT                 NOT NULL,
delta       VARCHAR(8000)          NOT NULL
CHECK (ISJSON(delta) = 1)
);

INSERT INTO history (entity_type, entity_id, ts, delta)
VALUES ('Person', 1, '2000-01-01 00:00:00', '{"first_name": "John", "last_name": "Doe"}'),
('Person', 1, '2000-01-01 01:00:00', '{"age": 37}'),
('Person', 1, '2000-01-01 02:00:00', '{"age": 38}'),
('Person', 1, '2000-01-01 03:00:00', '{"city": "New York"}'),
('Person', 1, '2000-01-01 04:00:00', '{"city": "Zurich", "age": null}')
;
```

This table now contains a set of deltas applied to the Person entity with ID = 1. It corresponds to the following sequence of SQL statements on an ordinary entity:

```INSERT INTO person (id, first_name, last_name)
VALUES ('John', 'Doe');
UPDATE person SET age = 37 WHERE id = 1;
UPDATE person SET age = 38 WHERE id = 1;
UPDATE person SET city = 'New York' WHERE id = 1;
UPDATE person SET city = 'Zurich', age = null WHERE id = 1;
```

You could even see their hand-written log like a transaction log of the database system, kinda like what you can extract using products like Golden Gate or Debezium. If you think of the transaction log as an event stream, the RDBMS’s current data representation is like a snapshot that you can get when applying any number of deltas to your tables.

Sometimes, you don’t want to completely change your architecture and go full “event sourcing”, but just need this kind of log for a specific set of auditable entities. And e.g. for reasons like still supporting very old SQL Server versions, as well as supporting other databases, you may choose also not to use the SQL:2011 temporal table feature, which has also been implemented in SQL Server 2016 and more recent versions.

With that out of our way…

### How to access any arbitrary snapshot version?

When we visually process our HISTORY table, we can see that Person ID = 1 had the following values at any given time:

```TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York
04:00:00    John          Doe                 Zurich
```

Remember, this is always the same record of Person ID = 1, its snapshots represented at different times in the time axis. The goal here is to be able to find the record of John Doe at any given time.

Again, if we had been using the SQL:2011 temporal table feature, we could write

```-- SQL Server
SELECT *
FROM Person
FOR SYSTEM_TIME AS OF '2000-01-01 02:00:00.0000000';

-- Oracle (flashback query)
SELECT *
FROM Person
AS OF TIMESTAMP TIMESTAMP '2000-01-01 02:00:00'
```

Side note: Do note that Oracle’s flashback query needs to be properly configured:

• Not all data is “flashbackable”
• DDL tends to destroy the archive
• Proper grants are needed to access the flashback archive

Similar limitations may apply in SQL Server.

### What if the RDBMS can’t help us?

If again for some reason, we cannot use the RDBMS’s temporal table features, we’ll roll our own as we’ve seen. So, our query in SQL Server to access the snapshot at any given time may be this:

```SELECT
'{'
+ string_agg(
CASE type WHEN 0 THEN NULL ELSE
'"' + [key] + '": ' +
CASE type WHEN 1 THEN '"' + value + '"' ELSE value END
END, ', ')
+ '}'
FROM (
SELECT *, row_number() OVER (
PARTITION BY [key] ORDER BY ts DESC) rn
FROM history
OUTER APPLY openjson(delta)

-- Apply all deltas prior to any given snapshot
WHERE ts <= '2000-01-01 02:00:00'
) t
WHERE rn = 1;
```

What does this query do? Consider again our deltas at 04:00:00:

```TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York
04:00:00    John          Doe          -      Zurich
```

Observe how each value has some color encoding:

• Strong, red: The current snapshot’s attribute value, when the last delta was applied to any given attribute
• Strong, black: A previous snapshot’s attribute value, when a previous, superseded delta was applied to any given attribute
• Light grey: A previous snapshot’s attribute value that was inherited from another previous delta

For any given snapshot, we want to find the Strong, red values. E.g. at a previous snapshot time, the color encoding would have been:

At 03:00:00

```TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York

04:00:00    John          Doe          -      Zurich
```

At 02:00:00

```TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38

03:00:00    John          Doe          38     New York
04:00:00    John          Doe          -      Zurich
```

So, our query needs to find the delta that was applied last for any given attribute.

With SQL, we can find that easily. We can assign a row number to each delta per attribute in reverse order, something like this:

```TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John (1)      Doe (1)
01:00:00    John          Doe          37 (3)
02:00:00    John          Doe          38 (2)
03:00:00    John          Doe          38     New York (2)
04:00:00    John          Doe          - (1)  Zurich (1)
```

Once we have that row number, we just filter out only those deltas whose row number is 1. Something like:

```SELECT [key], value, row_number() OVER (
PARTITION BY [key] ORDER BY ts DESC) rn
FROM history OUTER APPLY openjson(delta)
ORDER BY [key], ts;
```

Notice the `OUTER APPLY openjson(delta)` syntax. This just expands the JSON structure into key/value/type columns, which we can use more easily in a SQL query. Other database systems may have similar syntax for similar purposes. The result of the above query is:

```key        |value    |rn
-----------|---------|---
age        |37       |3
age        |38       |2
age        |         |1
city       |New York |2
city       |Zurich   |1
first_name |John     |1
last_name  |Doe      |1
```

Filtering the ones whose row number is 1:

```SELECT [key], value
FROM (
SELECT ts, [key], value, row_number() OVER (
PARTITION BY [key] ORDER BY ts DESC) rn
FROM history OUTER APPLY openjson(delta)
) t
WHERE rn = 1
ORDER BY ts, [key]
```

This yields:

```key        |value
-----------|-------
first_name |John
last_name  |Doe
age        |
city       |Zurich
```

Exactly the data we wanted, in key/value form. Notice that this filtering step could have been done with `DISTINCT ON` in PostgreSQL, or with `KEEP (DENSE_RANK FIRST ORDER BY ..)` in Oracle – an exercise which I shall leave to the reader (feel free to leave the solution in the comments!)

And now, finally, just re-assemble the JSON using SQL Server 2017 `STRING_AGG`. PostgreSQL would offer us JSON_AGG here, Oracle has JSON_OBJECTAGG. With `STRING_AGG`, you have to take care of manually escaping all values according to JSON syntax rules, which is bad. In my example, I just replaced ” by \”. Other characters need escaping too, so if there is a built-in feature, use that instead of string processing.

The `STRING_AGG` function aggregates a `CASE` expression which translates different JSON data types into different formats, where:

• 0 is NULL (and nulls are not aggregated)
• 1 is string
• everything else can be taken at its value for simplicity, e.g. numbers or booleans

Every value (except nulls) are prefixed by the JSON object’s attribute name (“key”).

```SELECT
'{'
+ string_agg(
CASE type WHEN 0 THEN NULL ELSE
'"' + replace([key], '"', '\"') + '": ' +
CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
END, ', ')
+ '}'
FROM (
SELECT *, row_number() OVER (
PARTITION BY [key] ORDER BY ts DESC) rn
FROM history
OUTER APPLY openjson(delta)

-- Apply all deltas prior to any given snapshot
WHERE ts <= '2000-01-01 04:00:00'
) t
WHERE rn = 1;
```

This produces

```{"city": "Zurich", "first_name": "John", "last_name": "Doe"}
```

A final query, that gets us the entire history of snapshots (watch the performance on this one, could definitely be optimised):

```SELECT ts, (
SELECT
'{'
+ string_agg(
CASE type WHEN 0 THEN NULL ELSE
'"' + replace([key], '"', '\"') + '": ' +
CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
END, ', ')
+ '}'
FROM (
SELECT *, row_number() OVER (
PARTITION BY [key] ORDER BY ts DESC) rn
FROM history
OUTER APPLY openjson(delta)

-- Apply all deltas prior to any given snapshot
WHERE ts <= x.ts
) t
WHERE rn = 1
)
FROM history x
GROUP BY ts;
```

It yields:

```ts       |
---------|--------------------------------------------------------------------------
00:00:00 |{"first_name": "John", "last_name": "Doe"}
01:00:00 |{"age": 37, "first_name": "John", "last_name": "Doe"}
02:00:00 |{"age": 38, "first_name": "John", "last_name": "Doe"}
03:00:00 |{"age": 38, "city": "New York", "first_name": "John", "last_name": "Doe"}
04:00:00 |{"city": "Zurich", "first_name": "John", "last_name": "Doe"}
```

So, the complete history of all the snapshot versions of the Person with ID = 1.

Very cool, and definitely good enough for their archive / audit query requirements.

## Writing Custom Aggregate Functions in SQL Just Like a Java 8 Stream Collector

All SQL databases support the standard aggregate functions `COUNT()`, `SUM()`, `AVG()`, `MIN()`, `MAX()`.

Some databases support other aggregate functions, like:

• `EVERY()`
• `STDDEV_POP()`
• `STDDEV_SAMP()`
• `VAR_POP()`
• `VAR_SAMP()`
• `ARRAY_AGG()`
• `STRING_AGG()`

But what if you want to roll your own?

### Java 8 Stream Collector

When using Java 8 streams, we can easily roll our own aggregate function (i.e. a `Collector`). Let’s assume we want to find the second highest value in a stream. The highest value can be obtained like this:

```System.out.println(
Stream.of(1, 2, 3, 4)
.collect(Collectors.maxBy(Integer::compareTo))
) ;
```

Yielding:

```Optional[4]
```

Now, what about the second highest value? We can write the following collector:

```System.out.println(
Stream.of(1, 6, 2, 3, 4, 4, 5).parallel()
.collect(Collector.of(
() -> new int[] {
Integer.MIN_VALUE,
Integer.MIN_VALUE
},
(a, i) -> {
if (a[0] < i) {
a[1] = a[0];
a[0] = i;
}
else if (a[1] < i)
a[1] = i;
},
(a1, a2) -> {
if (a2[0] > a1[0]) {
a1[1] = a1[0];
a1[0] = a2[0];

if (a2[1] > a1[1])
a1[1] = a2[1];
}
else if (a2[0] > a1[1])
a1[1] = a2[0];

return a1;
},
a -> a[1]
))
) ;
```

It doesn’t do anything fancy. It has these 4 functions:

• Supplier<int[]>: A supplier that provides an intermediary `int[]` of length 2, initialised with `Integer.MIN_VALUE`, each. This array will remember the `MAX()` value in the stream at position 0 and the `SECOND_MAX()` value in the stream at position 1
• BiConsumer<int[], Integer>: A accumulator that accumulates new values from the stream into our intermediary data structure.
• BinaryOperator<int[]>: A combiner that combines two intermediary data structures. This is used for parallel streams only.
• Function<int[], Integer>: The finisher function that extracts the `SECOND_MAX()` function from the second position in our intermediary array.

The output is now:

```5
```

### How to do the same thing with SQL?

Many SQL databases offer a very similar way of calculating custom aggregate functions. Here’s how to do the exact same thing with…

Oracle:

With the usual syntactic ceremony…

```CREATE TYPE u_second_max AS OBJECT (

-- Intermediary data structure
MAX NUMBER,
SECMAX NUMBER,

-- Corresponds to the Collector.supplier() function
STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT u_second_max) RETURN NUMBER,

-- Corresponds to the Collector.accumulate() function
MEMBER FUNCTION ODCIAggregateIterate(self IN OUT u_second_max, value IN NUMBER) RETURN NUMBER,

-- Corresponds to the Collector.combineer() function
MEMBER FUNCTION ODCIAggregateMerge(self IN OUT u_second_max, ctx2 IN u_second_max) RETURN NUMBER,

-- Correspodns to the Collector.finisher() function
MEMBER FUNCTION ODCIAggregateTerminate(self IN u_second_max, returnValue OUT NUMBER, flags IN NUMBER) RETURN NUMBER
)
/

-- This is our "colletor" implementation
CREATE OR REPLACE TYPE BODY u_second_max IS
STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT u_second_max)
RETURN NUMBER IS
BEGIN
SCTX := U_SECOND_MAX(0, 0);
RETURN ODCIConst.Success;
END;

MEMBER FUNCTION ODCIAggregateIterate(self IN OUT u_second_max, value IN NUMBER) RETURN NUMBER IS
BEGIN
IF VALUE > SELF.MAX THEN
SELF.SECMAX := SELF.MAX;
SELF.MAX := VALUE;
ELSIF VALUE > SELF.SECMAX THEN
SELF.SECMAX := VALUE;
END IF;
RETURN ODCIConst.Success;
END;

MEMBER FUNCTION ODCIAggregateTerminate(self IN u_second_max, returnValue OUT NUMBER, flags IN NUMBER) RETURN NUMBER IS
BEGIN
RETURNVALUE := SELF.SECMAX;
RETURN ODCIConst.Success;
END;

MEMBER FUNCTION ODCIAggregateMerge(self IN OUT u_second_max, ctx2 IN u_second_max) RETURN NUMBER IS
BEGIN
IF CTX2.MAX > SELF.MAX THEN
SELF.SECMAX := SELF.MAX;
SELF.MAX := CTX2.MAX;

IF CTX2.SECMAX > SELF.SECMAX THEN
SELF.SECMAX := CTX2.SECMAX;
END IF;
ELSIF CTX2.MAX > SELF.SECMAX THEN
SELF.SECMAX := CTX2.MAX;
END IF;

RETURN ODCIConst.Success;
END;
END;
/

-- Finally, we have to give this aggregate function a name
CREATE FUNCTION SECOND_MAX (input NUMBER) RETURN NUMBER
PARALLEL_ENABLE AGGREGATE USING u_second_max;
/
```

We can now run the above on the Sakila database:

```SELECT
max(film_id),
second_max(film_id)
FROM film;
```

To get:

```MAX     SECOND_MAX
------------------
1000    999
```

And what’s even better, we can use the aggregate function as a window function for free!

```SELECT
film_id,
length,
max(film_id) OVER (PARTITION BY length),
second_max(film_id) OVER (PARTITION BY length)
FROM film
ORDER BY length, film_id;
```

The above yields:

```FILM_ID  LENGTH  MAX   SECOND_MAX
---------------------------------
15       46      730   505
469      46      730   505
504      46      730   505
505      46      730   505
730      46      730   505
237      47      869   784
247      47      869   784
393      47      869   784
398      47      869   784
407      47      869   784
784      47      869   784
869      47      869   784
2        48      931   866
410      48      931   866
575      48      931   866
630      48      931   866
634      48      931   866
657      48      931   866
670      48      931   866
753      48      931   866
845      48      931   866
866      48      931   866
931      48      931   866
```

Beautiful, right?

PostgreSQL

PostgreSQL supports a slightly more concise syntax in the `CREATE AGGREGATE` statement. If we don’t allow for parallelism, we can write this minimal implementation:

```CREATE FUNCTION second_max_sfunc (
state INTEGER[], data INTEGER
) RETURNS INTEGER[] AS
\$\$
BEGIN
IF state IS NULL THEN
RETURN ARRAY[data, NULL];
ELSE
RETURN CASE
WHEN state[1] > data
THEN CASE
WHEN state[2] > data
THEN state
ELSE ARRAY[state[1], data]
END
ELSE ARRAY[data, state[1]]
END;
END IF;
END;
\$\$ LANGUAGE plpgsql;
/

CREATE FUNCTION second_max_ffunc (
state INTEGER[]
) RETURNS INTEGER AS
\$\$
BEGIN
RETURN state[2];
END;
\$\$ LANGUAGE plpgsql;

CREATE AGGREGATE second_max (INTEGER) (
SFUNC     = second_max_sfunc,
STYPE     = INTEGER[],
FINALFUNC = second_max_ffunc
);
```

Here, we use the `STYPE` (`Collector.supplier()`), the `SFUNC` (`Collector.accumulator()`), and the `FINALFUNC` (`Collector.finisher()`) specifications.

Other databases

Many other databases allow for specifying user defined aggregate functions. Look up your database manual’s details to learn more. They always work in the same way as a Java 8 `Collector`.

## How to Use SQL UPDATE .. RETURNING to Run DML More Efficiently

At a customer site, I recently refactored a “slow-by-slow” PL/SQL loop and turned that into an efficient set based `UPDATE` statement saving many lines of code and running much faster. In this blog post, I will show how that can be done. The blog post will focus on Oracle and `UPDATE`, but rest assured, this technique can be implemented in other databases too, and also with other DML statements, such as `INSERT`, `DELETE`, and depending on the vendor, even `MERGE`.

### The Schema

The original logic that needed refactoring worked on the following data set (simplified for this blog post):

```-- Table definition
CREATE TABLE t (
id NUMBER(10) GENERATED ALWAYS AS IDENTITY NOT NULL PRIMARY KEY,
category NUMBER(10) NOT NULL,
counter NUMBER(10),
text VARCHAR2(10) NOT NULL
);

-- Sample data
INSERT INTO t (category, text)
SELECT dbms_random.value(1, 10), dbms_random.string('a', 10)
FROM dual
CONNECT BY level <= 100;

-- Output of data
SELECT *
FROM t
ORDER BY counter DESC NULLS LAST, category, id;
```

The sample data generated above might look like this:

```ID   CATEGORY   COUNTER   TEXT
16   1                    UIXSzJxDez
25   1                    hkvvrTRbTC
29   1                    IBOJYveDgf
44   1                    VhcwOugrWB
46   1                    gBJFJrPQYy
47   1                    bVzfHznOUj
10   2                    KpHHgsRXwR
11   2                    vpkhTrkaaU
14   2                    fDlNtRdvBE
```

So, there were certain records belonging to some category, and there’s a counter indicating how often each record has been encountered in some system.

### The “slow-by-slow” PL/SQL Logic

(“slow-by-slow” rhymes with “row-by-row”. You get the idea)

Every now and then, there was a message from another system that should:

• Fetch all the rows of a category
• Increase the counter on each element of that category
• Concatenate all the texts of that category and return those

Sounds like something that can be done very easily using a loop. In PL/SQL (but imagine you could be doing this in Java just the same):

```SET SERVEROUTPUT ON
DECLARE
v_text VARCHAR2(2000);
v_updated PLS_INTEGER := 0;
BEGIN
FOR r IN (
SELECT * FROM t WHERE category = 1
) LOOP
v_updated := v_updated + 1;

IF v_text IS NULL THEN
v_text := r.text;
ELSE
v_text := v_text || ', ' || r.text;
END IF;

IF r.counter IS NULL THEN
UPDATE t SET counter = 1 WHERE id = r.id;
ELSE
UPDATE t SET counter = counter + 1 WHERE id = r.id;
END IF;
END LOOP;

COMMIT;
dbms_output.put_line('Rows updated: ' || v_updated);
dbms_output.put_line('Returned:     ' || v_text);
END;
/
```

The result of this block would be:

```Rows updated: 6
Returned:     UIXSzJxDez, hkvvrTRbTC, IBOJYveDgf, VhcwOugrWB, gBJFJrPQYy, bVzfHznOUj
```

And the data is now:

```ID   CATEGORY   COUNTER   TEXT
16   1          1         UIXSzJxDez
25   1          1         hkvvrTRbTC
29   1          1         IBOJYveDgf
44   1          1         VhcwOugrWB
46   1          1         gBJFJrPQYy
47   1          1         bVzfHznOUj
10   2                    KpHHgsRXwR
11   2                    vpkhTrkaaU
14   2                    fDlNtRdvBE
```

Wonderful. What’s wrong with this? The logic is straightforward and runs quite quickly. Until you run this many many many times per second – then it suddenly starts to hurt.

### Thinking Set Based

Whenever you work with RDBMS, try to think in terms of data sets and try running a bulk operation on such a data set. (Exceptions exist, see caveats below). The modification of the data can be written in a single SQL statement, instead of updating the same table many times.

Here’s the SQL statement in Oracle, that does precisely the same thing:

```SET SERVEROUTPUT ON
DECLARE
v_text VARCHAR2(2000);
v_updated PLS_INTEGER := 0;
BEGIN
UPDATE t
SET counter = nvl(counter, 0) + 1
WHERE category = 1
RETURNING
listagg (text, ', ') WITHIN GROUP (ORDER BY text),
count(*)
INTO
v_text,
v_updated;

COMMIT;
dbms_output.put_line('Rows updated: ' || v_updated);
dbms_output.put_line('Returned:     ' || v_text);
END;
/
```

Again, the same output:

```Rows updated: 6
Returned:     UIXSzJxDez, hkvvrTRbTC, IBOJYveDgf, VhcwOugrWB, gBJFJrPQYy, bVzfHznOUj
```

And the data set is now:

```ID   CATEGORY   COUNTER   TEXT
16   1          2         UIXSzJxDez
25   1          2         hkvvrTRbTC
29   1          2         IBOJYveDgf
44   1          2         VhcwOugrWB
46   1          2         gBJFJrPQYy
47   1          2         bVzfHznOUj
10   2                    KpHHgsRXwR
11   2                    vpkhTrkaaU
14   2                    fDlNtRdvBE
```

Below, you can see each piece of logic of the original PL/SQL block, and the corresponding logic in the revised SQL statement

There are 4 areas of interest:

1. Red: The category predicate
In the PL/SQL version, this predicate is a simple access predicate for the `SELECT` statement, over whose implicit cursor we’re iterating. In the set based SQL version, that predicate has been moved into the single bulk `UPDATE` statement. Thus: we’re modifying the exact same set of rows.
2. Blue: The number of updated rows
Before, we had a count variable that counted the number of iterations over the implicit cursor. Now, we can simply count the number of rows being updated in the bulk update statement, conveniently in the `RETURNING` clause. An alternative (in Oracle) would have been to use `SQL%ROWCOUNT`, which is available for free after a single bulk `UPDATE` statement.
3. Orange: The string concatenation
The requirement was to concatenate all the texts which are being updated. In the “slow-by-slow” PL/SQL approach, we’re again keeping around a local variable and concatenate new values to it, doing some `NULL` handling, initially. In the set based SQL version, we can simply use `LISTAGG()` in the `RETURNING` clause. Notice, there seems to be a bug with this usage of `LISTAGG`. The `ORDER BY` clause has no effect.
4. Green: The actual update
In the “slow-by-slow” version, we run 1 `UPDATE` statement per row, which can turn out to be devastating, if we’re updating a lot of rows. Besides, in this particular case, the developer(s) have been unaware of the possibility of `NULL` handling using `NVL()` (or `COALESCE()` or similar). There is really only one `UPDATE` statement necessary here.

That already looks a lot neater.

### How does it perform?

In a quick test script, which I’ve linked here, I could observe the following times for the above test data set, when running each approach 5 x 10000 times:

```Run 1, Statement 1 : 2.63841 (avg : 2.43714)
Run 1, Statement 2 : 1.11019 (avg : 1.04562)
Run 2, Statement 1 : 2.35626 (avg : 2.43714)
Run 2, Statement 2 : 1.05716 (avg : 1.04562)
Run 3, Statement 1 : 2.38004 (avg : 2.43714)
Run 3, Statement 2 : 1.05153 (avg : 1.04562)
Run 4, Statement 1 : 2.47451 (avg : 2.43714)
Run 4, Statement 2 : 1.00921 (avg : 1.04562)
Run 5, Statement 1 : 2.33649 (avg : 2.43714)
Run 5, Statement 2 : 1.00000 (avg : 1.04562)
```

As always, I’m not publishing actual benchmark times, but relative times compared to the fastest run. The set based approach is consistently 2.5x faster on my machine (Oracle 18c on Docker on Windows 10 / SSD). This is updating 6 rows per execution.

When we remove the `WHERE category = 1` predicate, updating the entirety of the 100 rows each time, we get even more drastic results. I’m now running this 5 x 2000 times to get:

```Run 1, Statement 1 : 10.21833 (avg : 11.98154)
Run 1, Statement 2 : 1.219130 (avg : 1.739260)
Run 2, Statement 1 : 10.17014 (avg : 11.98154)
Run 2, Statement 2 : 3.027930 (avg : 1.739260)
Run 3, Statement 1 : 9.444620 (avg : 11.98154)
Run 3, Statement 2 : 1.000000 (avg : 1.739260)
Run 4, Statement 1 : 20.54692 (avg : 11.98154)
Run 4, Statement 2 : 1.193560 (avg : 1.739260)
Run 5, Statement 1 : 9.527690 (avg : 11.98154)
Run 5, Statement 2 : 2.255680 (avg : 1.739260)
```

At this point, no one needs to be convinced anymore that a set based approach is much better for updating your data than a row-by-row approach in a language like PL/SQL or Java, etc.

### Caveats

Bulk updates are much better than row-by-row (remember: “slow-by-slow”) updates, regardless if you’re using PL/SQL or Java or whatever client language. This is because the optimiser can plan the update much more efficiently when it knows which rows will be updated in bulk, rather than seeing each individual row update afresh, not being able to plan ahead for the remaining number of updates.

However, in situations where a lot of other processes are reading the same data while you’re bulk updating them, you need to be more careful. In such cases, a bulk update can cause trouble keeping locks and log files busy while you’re updating and while the other processes need to access the data prior to your update.

One size never fits all, but at least, in every situation where you loop over a result set to update some data (or fetch additional data), ask yourself: Could I have written that logic in a single SQL statement? The answer is very often: Yes.

### Other databases

A few other databases support similar language features. These include:

• DB2: Implements the SQL standard (see below)
• Firebird: Exactly like Oracle: `RETURNING`
• PostgreSQL: Exactly like Oracle: `RETURNING`
• SQL Server: Similar, a bit less powerful OUTPUT clause

The DB2 syntax is quite noteworthy, because:

• It is very elegant
• It corresponds to the SQL standard
```SELECT
listagg (text, ', ') WITHIN GROUP (ORDER BY id),
count(*)
FROM FINAL TABLE (
UPDATE t
SET counter = nvl(counter, 0) + 1
WHERE category = 1
)
```

## How to Write a Multiplication Aggregate Function in SQL

Everyone knows the SQL SUM() aggregate function (and many people also know its window function variant).

When querying the Sakila database, we can get the daily revenue (using PostgreSQL syntax):

```WITH p AS (
SELECT
CAST (payment_date AS DATE) AS date,
amount
FROM payment
)
SELECT
date,
SUM (amount) AS daily_revenue,
SUM (SUM (amount)) OVER (ORDER BY date) AS cumulative_revenue
FROM p
GROUP BY date
ORDER BY date
```

The result will look something like this:

```date       |daily_revenue |cumulative_revenue
-----------|--------------|-------------------
2005-05-24 |29.92         |29.92
2005-05-25 |573.63        |603.55
2005-05-26 |754.26        |1357.81
2005-05-27 |685.33        |2043.14
2005-05-28 |804.04        |2847.18
2005-05-29 |648.46        |3495.64
2005-05-30 |628.42        |4124.06
2005-05-31 |700.37        |4824.43
2005-06-14 |57.84         |4882.27
...
```

### Doing the same with multiplication

This is already quite useful. Very occasionally, however, we do not need to aggregate multiple values in a sum (through addition), but in a product (through multiplication). I’ve just stumbled upon such a case on Stack Overflow, recently.

The question wanted to achieve the following result:

```date        factor          accumulated
---------------------------------------
1986-01-10  null            1000
1986-01-13  -0.026595745    973.4042548
1986-01-14  0.005464481     978.7234036
1986-01-15  -0.016304348    962.7659569
1986-01-16  0               962.7659569
1986-01-17  0               962.7659569
1986-01-20  0               962.7659569
1986-01-21  0.005524862     968.0851061
1986-01-22  -0.005494506    962.765957
1986-01-23  0               962.765957
1986-01-24  -0.005524862    957.4468078
1986-01-27  0.005555556     962.7659569
1986-01-28  0               962.7659569
1986-01-29  0               962.7659569
1986-01-30  0               962.7659569
1986-01-31  0.027624309     989.3617013
1986-02-03  0.016129032     1005.319148
1986-02-04  0.042328041     1047.872338
1986-02-05  0.04568528      1095.744679
```

If this were a Microsoft Excel spreadsheet, the `ACCUMULATED` column would simply start with 1000 and have the following formula in all other rows:

```accumulated(i) = accumulated(i - 1) * (1 + factor)
```

In other words (values truncated for simplicity):

```1000.0 = start
973.4 = 1000.0 * (1 - 0.026)
978.7 =  973.4 * (1 + 0.005)
962.7 =  978.7 * (1 - 0.016)
962.7 =  962.7 * (1 - 0.000)
962.7 =  962.7 * (1 - 0.000)
962.7 =  962.7 * (1 - 0.000)
968.0 =  962.7 * (1 + 0.005)
...
```

This is exciting because we’re not only requiring multiplicative aggregation, but even cumulative multiplicative aggregation. So, another window function.

But regrettably, SQL doesn’t offer a `MUL()` aggregate function, even if it were relatively simple to implement. We have two options:

• Implementing a custom aggregate function (stay tuned for a future blog post)
• Using a trick by summing logarithms, rather than multiplying operands directly

We’re implementing the latter for now. Check out this cool Wikipedia website about logarithmic identities, which we are going to blindly trust. In the middle of it, we have:

```bx * by = bx + y
```

```logb(x * y) = logb(x) + logb(y)
```

How cool is that? And thus:

```x * y = blogb(x) + logb(y)
```

So, we can define any multiplication in terms of a bunch of exponentiation to some base (say e) and logarithms to some base (say e). Or, in SQL:

```x * y = EXP(LN(x) + LN(y))
```

Or, as an aggregate function:

```MUL(x) = EXP(SUM(LN(x)))
```

Heh!

Our original problem can thus be solved very easily using this, as shown in my stack overflow answer:

```SELECT
date,
factor,
EXP(SUM(LN(1000 * (1 + COALESCE(factor, 1))))
OVER (ORDER BY date)) AS accumulated
FROM t
```

And we get the nice result as previously shown. You may have to replace `LN()` by `LOG()` depending on your database.

### Caveat: Negative numbers

Try running this:

```SELECT LN(-1)
```

You’ll get:

```SQL Error [2201E]: ERROR: cannot take logarithm of a negative number
```

Logarithms are defined only for strictly positive numbers, unless your database is capable of handling complex numbers as well. In case of which a single zero value would still break the aggregation.

But if your data set is defined to contain only strictly positive numbers, you’ll be fine – give or take some floating point rounding errors. Or, you’ll do some sign handling, which looks like this:

```WITH v(i) AS (VALUES (-2), (-3), (-4))
SELECT
CASE
WHEN SUM (CASE WHEN i < 0 THEN -1 END) % 2 < 0
THEN -1
ELSE 1
END * EXP(SUM(LN(ABS(i)))) multiplication1
FROM v;

WITH v(i) AS (VALUES (-2), (-3), (-4), (-5))
SELECT
CASE
WHEN SUM (CASE WHEN i < 0 THEN -1 END) % 2 < 0
THEN -1
ELSE 1
END * EXP(SUM(LN(ABS(i)))) multiplication2
FROM v;
```

The above yielding

```multiplication1
--------------------
-23.999999999999993

multiplication2
-------------------
119.99999999999997
```

Close enough.

### Caveat: Zero

Try running this:

```SELECT LN(0)
```

You’ll get:

```SQL Error [2201E]: ERROR: cannot take logarithm of zero
```

Zero is different from negative numbers. A product that has a zero operand is always zero, so we should be able to handle this. We’ll do it in two steps:

• Exclude zero values from the actual aggregation that uses `EXP()` and `LN()`
• Add an additional `CASE` expression that checks if any of the operands is zero

The first step might not be necessary depending on how your database optimiser executes the second step.

```WITH v(i) AS (VALUES (2), (3), (0))
SELECT
CASE
WHEN SUM (CASE WHEN i = 0 THEN 1 END) > 0
THEN 0
WHEN SUM (CASE WHEN i < 0 THEN -1 END) % 2 < 0
THEN -1
ELSE 1
END * EXP(SUM(LN(ABS(NULLIF(i, 0))))) multiplication
FROM v;
```

### Extension: DISTINCT

Calculating the product of all `DISTINCT` values requires to repeat the `DISTINCT` keyword in 2 out of the above 3 sums:

```WITH v(i) AS (VALUES (2), (3), (3))
SELECT
CASE
WHEN SUM (CASE WHEN i = 0 THEN 1 END) > 0
THEN 0
WHEN SUM (DISTINCT CASE WHEN i < 0 THEN -1 END) % 2 < 0
THEN -1
ELSE 1
END * EXP(SUM(DISTINCT LN(ABS(NULLIF(i, 0))))) multiplication
FROM v;
```

The result is now:

```multiplication |
---------------|
6              |
```

Notice that the first `SUM()` that checks for the presence of `NULL` values doesn’t require a `DISTINCT` keyword, so we omit it to improve performance.

### Extension: Window functions

Of course, if we are able to emulate a `PRODUCT()` aggregate function, we’d love to turn it into a window function as well. This can be done simply by transforming each individual `SUM()` into a window function:

```WITH v(i, j) AS (
VALUES (1, 2), (2, -3), (3, 4),
(4, -5), (5, 0), (6, 0)
)
SELECT i, j,
CASE
WHEN SUM (CASE WHEN j = 0 THEN 1 END)
OVER (ORDER BY i) > 0
THEN 0
WHEN SUM (CASE WHEN j < 0 THEN -1 END)
OVER (ORDER BY i) % 2 < 0
THEN -1
ELSE 1
END * EXP(SUM(LN(ABS(NULLIF(j, 0))))
OVER (ORDER BY i)) multiplication
FROM v;
```

The result is now:

```i |j  |multiplication      |
--|---|--------------------|
1 | 2 |2                   |
2 |-3 |-6                  |
3 | 4 |-23.999999999999993 |
4 |-5 |119.99999999999997  |
5 | 0 |0                   |
6 | 1 |0                   |
```

So cool! The cumulative product gets bigger and bigger until it hits he first zero, from then on it stays zero.

### jOOQ support

jOOQ 3.12 will support this as well and emulate it correctly on all databases:
https://github.com/jOOQ/jOOQ/issues/5939

## How to Reduce Syntactic Overhead Using the SQL WINDOW Clause

SQL is a verbose language, and one of the most verbose features are window functions.

In a stack overflow question that I’ve encountered recently, someone asked to calculate the difference between the first and the last value in a time series for any given day:

Input

```volume  tstamp
---------------------------
29011   2012-12-28 09:00:00
28701   2012-12-28 10:00:00
28830   2012-12-28 11:00:00
28353   2012-12-28 12:00:00
28642   2012-12-28 13:00:00
28583   2012-12-28 14:00:00
28800   2012-12-29 09:00:00
28751   2012-12-29 10:00:00
28670   2012-12-29 11:00:00
28621   2012-12-29 12:00:00
28599   2012-12-29 13:00:00
28278   2012-12-29 14:00:00
```

Desired output

```first  last   difference  date
------------------------------------
29011  28583  428         2012-12-28
28800  28278  522         2012-12-29
```

### How to write the query?

Notice that the value and timestamp progression do not correlate as it may appear. So, there is not a rule that if `Timestamp2 > Timestamp1` then `Value2 < Value1`. Otherwise, this simple query would work (using PostgreSQL syntax):

```SELECT
max(volume)               AS first,
min(volume)               AS last,
max(volume) - min(volume) AS difference,
CAST(tstamp AS DATE)      AS date
FROM t
GROUP BY CAST(tstamp AS DATE);
```

There are several ways to find the first and last values within a group that do not involve window functions. For example:

• In Oracle, you can use the FIRST and LAST functions, which for some arcane reason are not written `FIRST(...) WITHIN GROUP (ORDER BY ...)` or `LAST(...) WITHIN GROUP (ORDER BY ...)`, like other sorted set aggregate functions, but `some_aggregate_function(...) KEEP (DENSE_RANK FIRST ORDER BY ...)`. Go figure
• In PostgreSQL, you could use the `DISTINCT ON` syntax along with `ORDER BY` and `LIMIT`

More details about the various approaches can be found here:
https://blog.jooq.org/2017/09/22/how-to-write-efficient-top-n-queries-in-sql

The best performing approach would be to use an aggregate function like Oracle’s, but few databases have this function. So, we’ll resort to using the `FIRST_VALUE` and `LAST_VALUE` window functions:

```SELECT DISTINCT
first_value(volume) OVER (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first,
last_value(volume) OVER (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last,
first_value(volume) OVER (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
- last_value(volume) OVER (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS diff,
CAST(tstamp AS DATE) AS date
FROM t
ORDER BY CAST(tstamp AS DATE)
```

Oops 🤔

That doesn’t look too readable. But it will yield the correct result. Granted, we could wrap the definition for the columns `FIRST` and `LAST` in a derived table, but that would still leave us with two repetitions of the window definition:

```PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
```

### WINDOW clause to the rescue

Luckily, at least 3 databases have implemented the SQL standard `WINDOW` clause:

• MySQL
• PostgreSQL
• Sybase SQL Anywhere

The above query can be refactored to this one:

```SELECT DISTINCT
first_value(volume) OVER w AS first,
last_value(volume) OVER w AS last,
first_value(volume) OVER w
- last_value(volume) OVER w AS diff,
CAST(tstamp AS DATE) AS date
FROM t
WINDOW w AS (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
ORDER BY CAST(tstamp AS DATE)
```

Notice how I can specify a window name with a window specification in a similar way as I can define a common table expression (`WITH` clause):

```WINDOW
<window-name> AS (<window-specification>)
{  ,<window-name> AS (<window-specification>)... }
```

Not only can I reuse entire specifications, I could also build a specification from a partial specification, and reuse only parts. My previous query could have been rewritten as such:

```SELECT DISTINCT
first_value(volume) OVER w3 AS first,
last_value(volume) OVER w3 AS last,
first_value(volume) OVER w3
- last_value(volume) OVER w3 AS diff,
CAST(tstamp AS DATE) AS date
FROM t
WINDOW
w1 AS (PARTITION BY CAST(tstamp AS DATE)),
w2 AS (w1 ORDER BY tstamp),
w3 AS (w2 ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
ORDER BY CAST(tstamp AS DATE)
```

Each window specification can be created from scratch, or be based on a previously defined window specification. Note this is also true when referencing the window definition. If I wanted to reuse the `PARTITION BY` clause and the `ORDER BY` clause, but change the `FRAME` clause (`ROWS ...`), then I could have written this:

```SELECT DISTINCT
first_value(volume) OVER (
w2 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS first,
last_value(volume) OVER (
w2 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) AS last,
first_value(volume) OVER (
w2 ROWS UNBOUNDED PRECEDING
) - last_value(volume) OVER (
w2 ROWS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING
) AS diff,
CAST(tstamp AS DATE) AS date
FROM t
WINDOW
w1 AS (PARTITION BY CAST(tstamp AS DATE)),
w2 AS (w1 ORDER BY tstamp)
ORDER BY CAST(tstamp AS DATE)
```

### What if my database doesn’t support the WINDOW clause?

In that case, you have to either manually write the window specification on each window function, or you use a SQL builder like jOOQ, which can emulate the window clause:

You can try this translation online on our website: https://www.jooq.org/translate