A customer of my popular SQL training (
which you should book!) has recently challenged me to optimise a hierarchical query that merges an archive log’s deltas in order to obtain a snapshot of some record at a given point in time. In this article, I will reproduce their problem statement in a simplified version and show how this can be done with SQL Server, using a few cool SQL features:
All of these are topics covered in the training, which were immediately applicable to this problem statement.
The problem statement
This was their archive design. They designed for uncertainty, meaning that for some entities in their system, they did not know what kinds of attributes will be part of the entity in the future. Given their application design, users could even add their own custom attributes to an entity.
This kind of thing is typically solved with the
EAV (Entity Attribute Value) model, a “workaround” to denormalise data sets in SQL databases in the event of such schema uncertainty.
EAV can be implemented in several ways:
Through classic SQL tables only
An example implementation is this:
CREATE TABLE eav_classic (
entity_type VARCHAR (100) NOT NULL,
entity_id BIGINT NOT NULL,
attribute_name VARCHAR (100) NOT NULL,
attribute_type VARCHAR (100) NOT NULL,
attribute_value VARCHAR (100) NULL,
CONSTRAINT eav_classic_pk
PRIMARY KEY (entity_type, entity_id, attribute_name)
);
The drawbacks of this non-normalised design are immediately obvious. Most specifically, there is no simple way to establish referential integrity. But this may be totally OK, especially for archive logs, and for smaller databases (
datomic does something similar)
Through tables containing JSON or XML data
Whenever you have
schema-on-read data, JSON or XML data types may be appropriate, so this is a perfectly valid alternative:
CREATE TABLE eav_json (
entity_type VARCHAR (100) NOT NULL,
entity_id BIGINT NOT NULL,
attributes VARCHAR (10000) NOT NULL
CHECK (ISJSON(attributes) = 1),
CONSTRAINT eav_json_pk
PRIMARY KEY (entity_type, entity_id)
);
If your database supports a JSON data type, obviously, you will prefer that over the above emulation
For the rest of this article, I will use the JSON
Versioning the EAV table
Versioning data in an EAV model is quite easier than in a normalised schema. We can just add a version number and/or timestamp to the record. In their case, something like this may make sense:
CREATE TABLE history (
id BIGINT IDENTITY (1, 1) NOT NULL PRIMARY KEY,
ts DATETIME NOT NULL,
entity_type VARCHAR(100) NOT NULL,
entity_id BIGINT NOT NULL,
delta VARCHAR(8000) NOT NULL
CHECK (ISJSON(delta) = 1)
);
INSERT INTO history (entity_type, entity_id, ts, delta)
VALUES ('Person', 1, '2000-01-01 00:00:00', '{"first_name": "John", "last_name": "Doe"}'),
('Person', 1, '2000-01-01 01:00:00', '{"age": 37}'),
('Person', 1, '2000-01-01 02:00:00', '{"age": 38}'),
('Person', 1, '2000-01-01 03:00:00', '{"city": "New York"}'),
('Person', 1, '2000-01-01 04:00:00', '{"city": "Zurich", "age": null}')
;
This table now contains a set of deltas applied to the Person entity with ID = 1. It corresponds to the following sequence of SQL statements on an ordinary entity:
INSERT INTO person (id, first_name, last_name)
VALUES ('John', 'Doe');
UPDATE person SET age = 37 WHERE id = 1;
UPDATE person SET age = 38 WHERE id = 1;
UPDATE person SET city = 'New York' WHERE id = 1;
UPDATE person SET city = 'Zurich', age = null WHERE id = 1;
You could even see their hand-written log like a transaction log of the database system, kinda like what you can extract using products like
Golden Gate or
Debezium. If you think of the transaction log as an event stream, the RDBMS’s current data representation is like a snapshot that you can get when applying any number of deltas to your tables.
Sometimes, you don’t want to completely change your architecture and go full “event sourcing”, but just need this kind of log for a specific set of auditable entities. And e.g. for reasons like still supporting very old SQL Server versions, as well as supporting other databases, you may choose also not to use the SQL:2011 temporal table feature,
which has also been implemented in SQL Server 2016 and more recent versions.
With that out of our way…
How to access any arbitrary snapshot version?
When we visually process our HISTORY table, we can see that Person ID = 1 had the following values at any given time:
TIME FIRST_NAME LAST_NAME AGE CITY
------------------------------------------------------
00:00:00 John Doe
01:00:00 John Doe 37
02:00:00 John Doe 38
03:00:00 John Doe 38 New York
04:00:00 John Doe Zurich
Remember, this is always the same record of Person ID = 1, its snapshots represented at different times in the time axis. The goal here is to be able to find the record of John Doe at any given time.
Again, if we had been using the SQL:2011 temporal table feature, we could write
-- SQL Server
SELECT *
FROM Person
FOR SYSTEM_TIME AS OF '2000-01-01 02:00:00.0000000';
-- Oracle (flashback query)
SELECT *
FROM Person
AS OF TIMESTAMP TIMESTAMP '2000-01-01 02:00:00'
Side note: Do note that Oracle’s flashback query needs to be properly configured:
- Not all data is “flashbackable”
- DDL tends to destroy the archive
- Proper grants are needed to access the flashback archive
Similar limitations may apply in SQL Server.
What if the RDBMS can’t help us?
If again for some reason, we cannot use the RDBMS’s temporal table features, we’ll roll our own as we’ve seen. So, our query in SQL Server to access the snapshot at any given time may be this:
SELECT
'{'
+ string_agg(
CASE type WHEN 0 THEN NULL ELSE
'"' + [key] + '": ' +
CASE type WHEN 1 THEN '"' + value + '"' ELSE value END
END, ', ')
+ '}'
FROM (
SELECT *, row_number() OVER (
PARTITION BY [key] ORDER BY ts DESC) rn
FROM history
OUTER APPLY openjson(delta)
-- Apply all deltas prior to any given snapshot
WHERE ts <= '2000-01-01 02:00:00'
) t
WHERE rn = 1;
What does this query do? Consider again our deltas at 04:00:00:
TIME FIRST_NAME LAST_NAME AGE CITY
------------------------------------------------------
00:00:00 John Doe
01:00:00 John Doe 37
02:00:00 John Doe 38
03:00:00 John Doe 38 New York
04:00:00 John Doe - Zurich
Observe how each value has some color encoding:
- Strong, red: The current snapshot’s attribute value, when the last delta was applied to any given attribute
- Strong, black: A previous snapshot’s attribute value, when a previous, superseded delta was applied to any given attribute
- Light grey: A previous snapshot’s attribute value that was inherited from another previous delta
For any given snapshot, we want to find the
Strong, red values. E.g. at a previous snapshot time, the color encoding would have been:
At 03:00:00
TIME FIRST_NAME LAST_NAME AGE CITY
------------------------------------------------------
00:00:00 John Doe
01:00:00 John Doe 37
02:00:00 John Doe 38
03:00:00 John Doe 38 New York
04:00:00 John Doe - Zurich
At 02:00:00
TIME FIRST_NAME LAST_NAME AGE CITY
------------------------------------------------------
00:00:00 John Doe
01:00:00 John Doe 37
02:00:00 John Doe 38
03:00:00 John Doe 38 New York
04:00:00 John Doe - Zurich
So, our query needs to find the delta that was applied
last for any given attribute.
With SQL, we can find that easily. We can assign a row number to each delta
per attribute in reverse order, something like this:
TIME FIRST_NAME LAST_NAME AGE CITY
------------------------------------------------------
00:00:00 John (1) Doe (1)
01:00:00 John Doe 37 (3)
02:00:00 John Doe 38 (2)
03:00:00 John Doe 38 New York (2)
04:00:00 John Doe - (1) Zurich (1)
Once we have that row number, we just filter out only those deltas whose row number is 1. Something like:
SELECT [key], value, row_number() OVER (
PARTITION BY [key] ORDER BY ts DESC) rn
FROM history OUTER APPLY openjson(delta)
ORDER BY [key], ts;
Notice the
OUTER APPLY openjson(delta)
syntax. This just expands the JSON structure into key/value/type columns, which we can use more easily in a SQL query. Other database systems may have similar syntax for similar purposes. The result of the above query is:
key |value |rn
-----------|---------|---
age |37 |3
age |38 |2
age | |1
city |New York |2
city |Zurich |1
first_name |John |1
last_name |Doe |1
Filtering the ones whose row number is 1:
SELECT [key], value
FROM (
SELECT ts, [key], value, row_number() OVER (
PARTITION BY [key] ORDER BY ts DESC) rn
FROM history OUTER APPLY openjson(delta)
) t
WHERE rn = 1
ORDER BY ts, [key]
This yields:
key |value
-----------|-------
first_name |John
last_name |Doe
age |
city |Zurich
Exactly the data we wanted, in key/value form. Notice that this filtering step could have been done with
DISTINCT ON
in PostgreSQL, or with
KEEP (DENSE_RANK FIRST ORDER BY ..)
in Oracle – an exercise which I shall leave to the reader (feel free to leave the solution in the comments!)
And now, finally, just re-assemble the JSON using SQL Server 2017
STRING_AGG
. PostgreSQL would offer us JSON_AGG here, Oracle has JSON_OBJECTAGG. With
STRING_AGG
, you have to take care of manually escaping all values according to JSON syntax rules, which is bad. In my example, I just replaced ” by \”. Other characters need escaping too, so if there is a built-in feature, use that instead of string processing.
The
STRING_AGG
function aggregates a
CASE
expression which translates different JSON data types into different formats, where:
- 0 is NULL (and nulls are not aggregated)
- 1 is string
- everything else can be taken at its value for simplicity, e.g. numbers or booleans
Every value (except nulls) are prefixed by the JSON object’s attribute name (“key”).
SELECT
'{'
+ string_agg(
CASE type WHEN 0 THEN NULL ELSE
'"' + replace([key], '"', '\"') + '": ' +
CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
END, ', ')
+ '}'
FROM (
SELECT *, row_number() OVER (
PARTITION BY [key] ORDER BY ts DESC) rn
FROM history
OUTER APPLY openjson(delta)
-- Apply all deltas prior to any given snapshot
WHERE ts <= '2000-01-01 04:00:00'
) t
WHERE rn = 1;
This produces
{"city": "Zurich", "first_name": "John", "last_name": "Doe"}
A final query, that gets us the entire history of snapshots (watch the performance on this one, could definitely be optimised):
SELECT ts, (
SELECT
'{'
+ string_agg(
CASE type WHEN 0 THEN NULL ELSE
'"' + replace([key], '"', '\"') + '": ' +
CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
END, ', ')
+ '}'
FROM (
SELECT *, row_number() OVER (
PARTITION BY [key] ORDER BY ts DESC) rn
FROM history
OUTER APPLY openjson(delta)
-- Apply all deltas prior to any given snapshot
WHERE ts <= x.ts
) t
WHERE rn = 1
)
FROM history x
GROUP BY ts;
It yields:
ts |
---------|--------------------------------------------------------------------------
00:00:00 |{"first_name": "John", "last_name": "Doe"}
01:00:00 |{"age": 37, "first_name": "John", "last_name": "Doe"}
02:00:00 |{"age": 38, "first_name": "John", "last_name": "Doe"}
03:00:00 |{"age": 38, "city": "New York", "first_name": "John", "last_name": "Doe"}
04:00:00 |{"city": "Zurich", "first_name": "John", "last_name": "Doe"}
So, the complete history of all the snapshot versions of the Person with ID = 1.
Very cool, and definitely good enough for their archive / audit query requirements.
Like this:
Like Loading...
Hehe yet another cool thing we can do with SQL :)
As a side note, writing JSON by hand should be avoided. Here the JSON object will be broken if the value contains a quote or a line break.
Yes of course. I’ve added escaping for quotes. Will ignore line breaks for now. I really wish SQL server had a JSON_AGG function. Without it, people are inevitably going to resort to string concatenation.