The SQL MERGE statement is a device whose mystery is only exceeded by its power.
A simple example shows its full power according to standard SQL. Imagine you have a production table for product prices, and a staging table from which you want to load the latest prices. For once, I’m using the Db2 LuW MERGE syntax, because that’s the most standards compliant syntax out there (among the dialects we support in jOOQ):
DROP TABLE IF EXISTS prices;
DROP TABLE IF EXISTS staging;
CREATE TABLE prices (
product_id BIGINT NOT NULL PRIMARY KEY,
price DECIMAL(10, 2) NOT NULL,
price_date DATE NOT NULL,
update_count BIGINT NOT NULL
);
CREATE TABLE staging (
product_id BIGINT NOT NULL PRIMARY KEY,
price DECIMAL(10, 2) NOT NULL
);
DELETE FROM prices;
DELETE FROM staging;
INSERT INTO staging
VALUES (1, 100.00),
(2, 125.00),
(3, 150.00);
So, we’ve loaded a few records in our staging table, which we now want to merge over to the prices table. We could just insert them, easily, but we’ll stage more prices later on, e.g. these ones:
DELETE FROM staging;
INSERT INTO staging
VALUES (1, 100.00),
(2, 99.00),
(4, 300.00);
We want our logic to be this:
- All new IDs in the staging table should just be inserted into the prices table.
- Existing IDs should be updated if and only if the price has changed. In that case, the update_count should increase.
- Prices that are no longer encountered in the staging table should be deleted from the prices table, to implement a full sync, rather than a delta sync, for the sake of the example. We could also add a “command” column that contains the instruction on whether data should be updated or deleted, to implement a delta sync.
So, this is the Db2 (and standards compliant) MERGE statement that we use for the job:
MERGE INTO prices AS p
USING (
SELECT COALESCE(p.product_id, s.product_id) AS product_id, s.price
FROM prices AS p
FULL JOIN staging AS s ON p.product_id = s.product_id
) AS s
ON (p.product_id = s.product_id)
WHEN MATCHED AND s.price IS NULL THEN DELETE
WHEN MATCHED AND p.price != s.price THEN UPDATE SET
price = s.price,
price_date = CURRENT_DATE,
update_count = update_count + 1
WHEN NOT MATCHED THEN INSERT
(product_id, price, price_date, update_count)
VALUES
(s.product_id, s.price, CURRENT_DATE, 0);
Easy eh?
Hm, not so simple if you haven’t written MERGE statements all your life. If so, don’t be scared. As most of SQL, the scary part is syntax (keywords, UPPER CASE, etc.). The underlying concepts are simpler than they may seem at first. Let’s go through it step by step. It has 4 parts:
1. The target table
Just like with INSERT statements, we can define where we want to MERGE the data INTO. This is the simple part:
MERGE INTO prices AS p
-- ...
2. The source table
The USING keyword wraps a source table that we want to merge. We could have just placed the staging table here, but I wanted to enrich the source data with some additional data first. I’m using a
FULL JOIN to produce the matching between old data (prices) and new data (staging). If, after filling the staging table for the second time, but before running the MERGE statement, we ran the USING clause alone (with some minor modifications for illustration):
SELECT
COALESCE(p.product_id, s.product_id) AS product_id,
p.price AS old_price,
s.price AS new_price
FROM prices AS p
FULL JOIN staging AS s ON p.product_id = s.product_id
ORDER BY product_id
Then we would get this:
PRODUCT_ID|OLD_PRICE|NEW_PRICE|
----------|---------|---------|
1| 100.00| 100.00| <-- same price
2| 125.00| 99.00| <-- updated price
3| 150.00| | <-- deleted price
4| | 300.00| <-- added price
Neat!
3. The ON clause
Next, we RIGHT JOIN the target table and source tables using an ON clause, just like with an ordinary JOIN:
ON (p.product_id = s.product_id)
MERGE always uses RIGHT JOIN semantics, which is why I placed a FULL JOIN in the source table, the USING clause. It is totally possible to write things a bit differently, such that we can avoid accessing the prices table twice, but I want to show the full power of this statement. Notice, SQL Server joins source and target tables using a FULL JOIN, as I’ll explain further down. I’ll also explain why RIGHT JOIN, right away.
4. The WHEN clauses
Now comes the interesting part! There can either be a match between the two tables (target and source), like when we get the result of an INNER JOIN, or no such match because the source table contains a row that is not matched by the target table (the RIGHT JOIN semantics). In our example, PRODUCT_ID IN (1, 2, 3) will produce a match (contained in both source and target tables), whereas PRODUCT_ID = 4 will not produce a match (not contained in the target table yet).
Colouring our source data set:
PRODUCT_ID|OLD_PRICE|NEW_PRICE|
----------|---------|---------|
1| 100.00| 100.00| <-- matched (but ignored)
2| 125.00| 99.00| <-- matched
3| 150.00| | <-- matched
4| | 300.00| <-- not matched
The following are a sequence of matching instructions that will be executed in the order of appearance,
for each row resulting from the previous RIGHT JOIN:
-- With my FULL JOIN, I've produced NULL price values
-- whenever a PRODUCT_ID is in the target table, but not
-- in the source table. These rows, we want to DELETE
WHEN MATCHED AND s.price IS NULL THEN DELETE
-- When there is a price change (and only then), we
-- want to update the price information in the target table.
WHEN MATCHED AND p.price != s.price THEN UPDATE SET
price = s.price,
price_date = CURRENT_DATE,
update_count = update_count + 1
-- Finally, when we don't have a match, i.e. a row is
-- in the source table, but not in the target table, then
-- we simply insert it into the target table.
WHEN NOT MATCHED THEN INSERT
(product_id, price, price_date, update_count)
VALUES
(s.product_id, s.price, CURRENT_DATE, 0);
It’s not too complicated, just a lot of keywords and syntax. So, after running this MERGE on the second set of data of the staging table, we’ll get this result in the price table:
PRODUCT_ID|PRICE |PRICE_DATE|UPDATE_COUNT|
----------|------|----------|------------|
1|100.00|2020-04-09| 0|
2| 99.00|2020-04-09| 1|
4|300.00|2020-04-09| 0|
The way I expressed this MERGE statement, it is idempotent, i.e. I can run it again on the same staging table content, and it won’t modify any data in the price table – because none of the WHEN clauses applies. Idempotence is not a property of MERGE, I just wrote my statement this way.
Dialect specifics
A few dialects support
MERGE. Among the ones that jOOQ 3.13 supports, there are at least:
- Db2
- Derby
- Firebird
- H2
- HSQLDB
- Oracle
- SQL Server
- Sybase SQL Anywhere
- Teradata
- Vertica
For once, regrettably, this list does not include PostgreSQL. But even the dialects in this list do not all agree on what MERGE really is. The SQL standard specifies 3 features, each one optional:
- F312 MERGE statement
- F313 Enhanced MERGE statement
- F314 MERGE statement with DELETE branch
But instead of looking at the standards and what they require, let’s look at what the dialects offer, and how it can be emulated if something is not available.
The AND clause
As you may have noticed, this post uses the syntax:
WHEN MATCHED AND <some predicate> THEN
It’s also possible to specify
WHEN NOT MATCHED AND <some predicate> THEN
With the exception of Teradata, most dialects support these AND clauses (Oracle has a specific syntax using WHERE for this, which I’ll get to later).
The point of these clauses is to be able to have several of these WHEN MATCHED or WHEN NOT MATCHED clauses, in fact an arbitrary number of them. Unfortunately, this isn’t supported by all dialects. Some dialects only support one clause of each type (INSERT, UPDATE, DELETE). It isn’t strictly necessary to support several clauses, but it is a lot more convenient as we’ll see below.
These dialects do not support multiple WHEN MATCHED or WHEN NOT MATCHED clauses:
- HSQLDB
- Oracle
- SQL Server
- Teradata
If a dialect does not support AND, or if it does not support multiple WHEN MATCHED clauses, simply translate the clauses to case expressions. Instead of our previous WHEN clauses, we’d get:
-- The DELETE clause doesn't make much sense without AND,
-- So there's not much we can do about this emulation in Teradata.
WHEN MATCHED AND s.price IS NULL THEN DELETE
-- Repeat the AND clause in every branch of the CASE
-- Expression where it applies
WHEN MATCHED THEN UPDATE SET
price = CASE
-- Update the price if the AND clause applies
WHEN p.price != s.price THEN s.price,
-- Otherwise, leave it untouched
ELSE p.price
END
-- Repeat for all columns
price_date = CASE
WHEN p.price != s.price THEN CURRENT_DATE
ELSE p.price_date
END,
update_count = CASE
WHEN p.price != s.price THEN update_count + 1
ELSE p.update_count
END
-- Unchanged, in this case
WHEN NOT MATCHED THEN INSERT
(product_id, price, price_date, update_count)
VALUES
(s.product_id, s.price, CURRENT_DATE, 0);
The formalism is this one:
If there is no AND, add AND
These are the same:
WHEN MATCHED THEN [ UPDATE | DELETE ]
WHEN MATCHED AND 1 = 1 THEN [ UPDATE | DELETE ]
This replacement may be needed in Firebird (which has a bug in this area) and SQL Server (which does not allow for WHEN MATCHED clauses after a WHEN MATCHED clause without AND clause, which is kind of a linting error). Instead of emulating stuff, you can just skip all the subsequent WHEN MATCHED branches, because they will not apply. Every row is updated only once, i.e. by only one WHEN clause:
Every row is updated only once
Make sure no row is updated more than once in the emulation, as required by the standard. When writing this:
WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND p2 THEN DELETE
WHEN MATCHED AND p3 THEN UPDATE SET c1 = 3, c2 = 3
WHEN MATCHED AND p4 THEN DELETE
This really means the same as:
WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND NOT p1 AND p2 THEN DELETE
WHEN MATCHED AND NOT p1 AND NOT p2 AND p3 THEN UPDATE SET c1 = 3,c2 = 3
WHEN MATCHED AND NOT p1 AND NOT p2 AND NOT p3 AND p4 THEN DELETE
To emulate the above, write this instead:
WHEN MATCHED AND
p1 OR
NOT p1 AND NOT p2 AND p3
THEN UPDATE SET
c1 = CASE
WHEN p1 THEN 1
WHEN NOT p1 AND NOT p2 AND p3 THEN 3
ELSE c1
END,
c2 = CASE
WHEN NOT p1 AND NOT p2 AND p3 THEN 3
ELSE c2
END
WHEN MATCHED AND
NOT p1 AND p2 OR
NOT p1 AND NOT p2 AND NOT p3 AND p4
THEN DELETE
Quite laborious, but that’s how it is.
H2 and HSQLDB
Notice that both H2 and HSQLDB do not follow the “every row is updated only once” rule. I’ve reported this to H2 already:
https://github.com/h2database/h2database/issues/2552. If you want to be standards compliant (jOOQ 3.14 will emulate this for you, don’t worry), then you have to do the above CASE expression madness in these dialects, or, in H2 (HSQLDB doesn’t support multiple WHEN MATCHED clauses of the same type) enhance all the WHEN MATCHED AND clauses as I’ve illustrated before:
WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND NOT p1 AND p2 THEN DELETE
WHEN MATCHED AND NOT p1 AND NOT p2 AND p3 THEN UPDATE SET c1 = 3,c2 = 3
WHEN MATCHED AND NOT p1 AND NOT p2 AND NOT p3 AND p4 THEN DELETE
Oracle
Oracle doesn’t support AND here, but some interesting vendor specific syntax. It looks reasonable at first, but it’s really funky.
- After UPDATE, you can add a WHERE clause, which is the same thing as AND. So far so good.
- You can also add a DELETE WHERE clause, but only together with an UPDATE. So you cannot DELETE without updating. Fine, we weren’t going to, in our example.
- However, the interesting thing is that the UPDATE / DELETE command is executed together, and DELETE happens after UPDATE. So the same row gets processed twice. If you use WHERE in UPDATE, only rows included in the UPDATE can also be included in DELETE. I mean, why would you update the rows first, prior to deletion?
This means that our standard clauses:
WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND p2 THEN DELETE
WHEN MATCHED AND p3 THEN UPDATE SET c1 = 3, c2 = 3
WHEN MATCHED AND p4 THEN DELETE
Would need to be emulated like this:
WHEN MATCHED
THEN UPDATE SET
c1 = CASE
WHEN p1 THEN 1 -- Normal update for WHEN MATCHED AND p1 clause
WHEN p2 THEN c1 -- "Touch" record for later deletion
WHEN p3 THEN 3 -- Normal update for WHEN MATCHED AND p3 clause
WHEN p4 THEN c1 -- "Touch" record for later deletion
ELSE c1
END,
c2 = CASE
WHEN p1 THEN c2 -- p1 is not affecting c2
WHEN p2 THEN c2 -- "Touch" record for later deletion
WHEN p3 THEN 3 -- Normal update for WHEN MATCHED AND p3 clause
WHEN p4 THEN c2 -- "Touch" record for later deletion
ELSE c2
END
-- Any predicate from any AND clause, regardless if UPDATE or DELETE
WHERE p1 OR p2 OR p3 OR p4
-- Repeat the predicates required for deletion
DELETE WHERE
NOT p1 AND p2 OR
NOT p1 AND NOT p2 AND NOT p3 AND p4
It was just a simple MERGE statement in standard SQL syntax!
There is an additional level of trickiness here, which I will not cover in this blog post (but we might handle it in jOOQ). In Oracle, the DELETE WHERE clause can already see the updates performed by the UPDATE clause. This means that if, for example, p2 depends on the value of c1:
c1 = CASE
...
WHEN p2 THEN c1 -- "Touch" record for later deletion
...
END,
Then the evaluation of p2 in DELETE WHERE will be affected by this
DELETE WHERE
NOT p1 AND p2 OR
NOT p1 AND NOT p2 AND NOT p3 AND p4
The c1 in these p2 expressions will not be the same c1 as the c1 in the UPDATE clause. It’s obviously possible to manage this as well to some extent, through variable substitution.
SQL Server BY SOURCE and BY TARGET
SQL Server has a useful extension to the WHEN NOT MATCHED clause, which I think belongs in the SQL standard!
With this extension, you can specify whether you want to perform INSERT actions WHEN NOT MATCHED [ BY TARGET ] (the default that everyone else supports as well), or WHEN NOT MATCHED BY SOURCE (in case of which you can perform another UPDATE or DELETE action.
The BY TARGET clause means that we found a row in the source table but not in the target table. The BY SOURCE clause means that we found a row in the target table but not in the source table. This means that in SQL Server, the target and source tables are FULL OUTER JOINed, not RIGHT OUTER JOINed, which would mean that our original statement can be greatly simplified:
MERGE INTO prices AS p
USING staging AS s
ON (p.product_id = s.product_id)
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN MATCHED AND p.price != s.price THEN UPDATE SET
price = s.price,
price_date = getdate(),
update_count = update_count + 1
WHEN NOT MATCHED BY TARGET THEN INSERT
(product_id, price, price_date, update_count)
VALUES
(s.product_id, s.price, getdate(), 0);
We could colour again the rows encountered here:
PRODUCT_ID| P.PRICE| S.PRICE|
----------|---------|---------|
1| 100.00| 100.00| <-- matched (but ignored)
2| 125.00| 99.00| <-- matched
3| 150.00| | <-- not matched by source
4| | 300.00| <-- not matched by target
As can be seen, this is really just how a FULL OUTER JOIN works.
The emulation of these clauses back into standard SQL are laborious too, as we’d have to emulate this FULL OUTER JOIN explicitly. I think it’s possible, but we might not implement it in jOOQ.
Vertica
Only Vertica seems to not support the DELETE branch, meaning you cannot use a MERGE statement to DELETE data from your target table. You can use it only to INSERT or UPDATE data, which is good enough in almost all cases. Curiously, Teradata supports DELETE, but not AND, which seems kinda pointless, as DELETE and UPDATE cannot be combined this way.
Conclusion
The MERGE statement is a device whose mystery is only exceeded by its power. In a simple form (no AND or WHERE clauses, no DELETE clauses), all dialects pretty much agree, and that’s already a very useful feature set, which jOOQ has supported for a long time. Starting from jOOQ 3.14, we’re tackling also all the other features listed in this article to help you write complex, vendor agnostic MERGE statements and emulate them on all the dialects that have some MERGE support.
Want to play around with it already now?
Check out our free online SQL translation tool.
Like this:
Like Loading...
Nice post! Few ideas of future research and support across multiple dialects:
– support for RETURNING/OUTPUT/… clause from MERGE
– “INSERT over DML”(SQL Server) or nesting MERGE with other operations:
INSERT INTO tab1(…)
SELECT …
FROM (
MERGE tab2
USING tab3 ON …
WHEN MATCHED THEN…
WHEN NOT MATCHED THEN …
OUTPUT $action, …
) s(…);
“Upsert/Merge race conditions” – for instance with SQL Server it is advisable to use HOLDLOCK
Yes, the RETURNING / OUTPUT / data change delta table support is on the roadmap, in fact. I haven’t finished investigating this yet, and I don’t think it’s strictly related to MERGE. I prefer covering this topic for all 4 types of DML statments where this applies. That second query is really funky! Very regrettable that it requires the INSERT. Would be useful otherwise as well, just like the data change delta table…
This is the best description of the merge statement I’ve seen in print. I would like to use it in the sixth edition of my “SQL for smarties”, with your permission, and credit (I am assuming it will have Sixth edition). May I have your permission? Sorry I can’t give you any royalties but I can give you credit
Sure, go ahead! :)
I don’t have the details to hand, but Oracle merge behaves like 2 DML statement (not 1) which has implications to when deferred constraints are checked, row locking/contention etc Seems a bit buggy and documentation is unclear, so beware and test carefully.
Excellent article / site. Many thanks.
Yes indeed. The Oracle emulations still aren’t fully correct in jOOQ because of those flaws. Such a pity. The standard SQL MERGE is so elegant…
But I do recall getting into confused by Oracle’s concept of key-preserved table(s) when attempting updating via join / through a view – think this is Oracle trying to protect developer. Merge saved the day; suspect Merge will happily update same target row twice so developer’s responsibility to ensure data model is correct (primary/uniqueness, correct ON clause etc) when using merge.
If a row is updated twice, I’d say that’s a bug in Oracle.
I wonder what are your thoughts about implicit column resolution on MERGE statement. Here I am refering to the language extension
UPDATE SET */INSERT *
Normally we treat
SELECT *
as antipattern. This one is quite handy in context of building a generic solutions. I have often seen a dynamic code build on metadata to do exactly the same and the above code looks quite neat. It’s not about saving time by not typing code but rather have runtime resolution and kind of schema change resiliency(as long as both source/target column list matches).Reference: https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html
I would have used a different syntax. What does an asterisk even mean? It certainly doesn’t have the semantics of the UNION CORRESPONDING or NATURAL JOIN syntaxes. Does it mean target.* or source.*? Or something else?
For example, it’s not really clear whether source and target tables need to have matching column names, or matching column indexes, or if only those columns that are matched by name are being considered.
Ultimately, however, I think this syntax is as treacherous as the aforementioned ones. You alter a table, and suddenly you update something you didn’t intend to in some random MERGE statement that you would never think of. This is why NATURAL JOIN (or even JOIN USING) are not popular. They break too easily, and so does this syntax.