How to Use SQL UPDATE .. RETURNING to Run DML More Efficiently

At a customer site, I recently refactored a “slow-by-slow” PL/SQL loop and turned that into an efficient set based UPDATE statement saving many lines of code and running much faster. In this blog post, I will show how that can be done. The blog post will focus on Oracle and UPDATE, but rest assured, this technique can be implemented in other databases too, and also with other DML statements, such as INSERT, DELETE, and depending on the vendor, even MERGE.

The Schema

The original logic that needed refactoring worked on the following data set (simplified for this blog post):


-- Table definition
CREATE TABLE t (
  id NUMBER(10) GENERATED ALWAYS AS IDENTITY NOT NULL PRIMARY KEY,
  category NUMBER(10) NOT NULL,
  counter NUMBER(10),
  text VARCHAR2(10) NOT NULL
);

-- Sample data
INSERT INTO t (category, text)
SELECT dbms_random.value(1, 10), dbms_random.string('a', 10)
FROM dual
CONNECT BY level &lt;= 100;

-- Output of data
SELECT *
FROM t
ORDER BY counter DESC NULLS LAST, category, id;

The sample data generated above might look like this:

ID   CATEGORY   COUNTER   TEXT
16   1                    UIXSzJxDez
25   1                    hkvvrTRbTC
29   1                    IBOJYveDgf
44   1                    VhcwOugrWB
46   1                    gBJFJrPQYy
47   1                    bVzfHznOUj
10   2                    KpHHgsRXwR
11   2                    vpkhTrkaaU
14   2                    fDlNtRdvBE

So, there were certain records belonging to some category, and there’s a counter indicating how often each record has been encountered in some system.

The “slow-by-slow” PL/SQL Logic

(“slow-by-slow” rhymes with “row-by-row”. You get the idea) Every now and then, there was a message from another system that should:

Fetch all the rows of a category
Increase the counter on each element of that category
Concatenate all the texts of that category and return those

Sounds like something that can be done very easily using a loop. In PL/SQL (but imagine you could be doing this in Java just the same):


SET SERVEROUTPUT ON
DECLARE
  v_text VARCHAR2(2000);
  v_updated PLS_INTEGER := 0;
BEGIN
  FOR r IN (
    SELECT * FROM t WHERE category = 1
  ) LOOP
    v_updated := v_updated + 1;
    
    IF v_text IS NULL THEN
      v_text := r.text;
    ELSE
      v_text := v_text || ', ' || r.text;
    END IF;
    
    IF r.counter IS NULL THEN
      UPDATE t SET counter = 1 WHERE id = r.id;
    ELSE
      UPDATE t SET counter = counter + 1 WHERE id = r.id;
    END IF;
  END LOOP;
  
  COMMIT;
  dbms_output.put_line('Rows updated: ' || v_updated);
  dbms_output.put_line('Returned:     ' || v_text);
END;
/

The result of this block would be:

Rows updated: 6
Returned:     UIXSzJxDez, hkvvrTRbTC, IBOJYveDgf, VhcwOugrWB, gBJFJrPQYy, bVzfHznOUj

And the data is now:

ID   CATEGORY   COUNTER   TEXT
16   1          1         UIXSzJxDez
25   1          1         hkvvrTRbTC
29   1          1         IBOJYveDgf
44   1          1         VhcwOugrWB
46   1          1         gBJFJrPQYy
47   1          1         bVzfHznOUj
10   2                    KpHHgsRXwR
11   2                    vpkhTrkaaU
14   2                    fDlNtRdvBE

Wonderful. What’s wrong with this? The logic is straightforward and runs quite quickly. Until you run this many many many times per second – then it suddenly starts to hurt.

Thinking Set Based

Whenever you work with RDBMS, try to think in terms of data sets and try running a bulk operation on such a data set. (Exceptions exist, see caveats below). The modification of the data can be written in a single SQL statement, instead of updating the same table many times. Here’s the SQL statement in Oracle, that does precisely the same thing:


SET SERVEROUTPUT ON
DECLARE
  v_text VARCHAR2(2000);
  v_updated PLS_INTEGER := 0;
BEGIN
  UPDATE t
  SET counter = nvl(counter, 0) + 1
  WHERE category = 1
  RETURNING
    listagg (text, ', ') WITHIN GROUP (ORDER BY text),
    count(*)
  INTO
    v_text,
    v_updated;
    
  COMMIT;
  dbms_output.put_line('Rows updated: ' || v_updated);
  dbms_output.put_line('Returned:     ' || v_text);
END;
/

Again, the same output:

Rows updated: 6
Returned:     UIXSzJxDez, hkvvrTRbTC, IBOJYveDgf, VhcwOugrWB, gBJFJrPQYy, bVzfHznOUj

And the data set is now:

ID   CATEGORY   COUNTER   TEXT
16   1          2         UIXSzJxDez
25   1          2         hkvvrTRbTC
29   1          2         IBOJYveDgf
44   1          2         VhcwOugrWB
46   1          2         gBJFJrPQYy
47   1          2         bVzfHznOUj
10   2                    KpHHgsRXwR
11   2                    vpkhTrkaaU
14   2                    fDlNtRdvBE

Below, you can see each piece of logic of the original PL/SQL block, and the corresponding logic in the revised SQL statement

There are 4 areas of interest:

Red: The category predicate
In the PL/SQL version, this predicate is a simple access predicate for the SELECT statement, over whose implicit cursor we’re iterating. In the set based SQL version, that predicate has been moved into the single bulk UPDATE statement. Thus: we’re modifying the exact same set of rows.
Blue: The number of updated rows
Before, we had a count variable that counted the number of iterations over the implicit cursor. Now, we can simply count the number of rows being updated in the bulk update statement, conveniently in the RETURNING clause. An alternative (in Oracle) would have been to use SQL%ROWCOUNT, which is available for free after a single bulk UPDATE statement.
Orange: The string concatenation
The requirement was to concatenate all the texts which are being updated. In the “slow-by-slow” PL/SQL approach, we’re again keeping around a local variable and concatenate new values to it, doing some NULL handling, initially. In the set based SQL version, we can simply use LISTAGG() in the RETURNING clause. Notice, there seems to be a bug with this usage of LISTAGG. The ORDER BY clause has no effect.
Green: The actual update
In the “slow-by-slow” version, we run 1 UPDATE statement per row, which can turn out to be devastating, if we’re updating a lot of rows. Besides, in this particular case, the developer(s) have been unaware of the possibility of NULL handling using NVL() (or COALESCE() or similar). There is really only one UPDATE statement necessary here.

That already looks a lot neater.

How does it perform?

In a quick test script, which I’ve linked here, I could observe the following times for the above test data set, when running each approach 5 x 10000 times:

Run 1, Statement 1 : 2.63841 (avg : 2.43714)
Run 1, Statement 2 : 1.11019 (avg : 1.04562)
Run 2, Statement 1 : 2.35626 (avg : 2.43714)
Run 2, Statement 2 : 1.05716 (avg : 1.04562)
Run 3, Statement 1 : 2.38004 (avg : 2.43714)
Run 3, Statement 2 : 1.05153 (avg : 1.04562)
Run 4, Statement 1 : 2.47451 (avg : 2.43714)
Run 4, Statement 2 : 1.00921 (avg : 1.04562)
Run 5, Statement 1 : 2.33649 (avg : 2.43714)
Run 5, Statement 2 : 1.00000 (avg : 1.04562)

As always, I’m not publishing actual benchmark times, but relative times compared to the fastest run. The set based approach is consistently 2.5x faster on my machine (Oracle 18c on Docker on Windows 10 / SSD). This is updating 6 rows per execution. When we remove the WHERE category = 1 predicate, updating the entirety of the 100 rows each time, we get even more drastic results. I’m now running this 5 x 2000 times to get:

Run 1, Statement 1 : 10.21833 (avg : 11.98154)
Run 1, Statement 2 : 1.219130 (avg : 1.739260)
Run 2, Statement 1 : 10.17014 (avg : 11.98154)
Run 2, Statement 2 : 3.027930 (avg : 1.739260)
Run 3, Statement 1 : 9.444620 (avg : 11.98154)
Run 3, Statement 2 : 1.000000 (avg : 1.739260)
Run 4, Statement 1 : 20.54692 (avg : 11.98154)
Run 4, Statement 2 : 1.193560 (avg : 1.739260)
Run 5, Statement 1 : 9.527690 (avg : 11.98154)
Run 5, Statement 2 : 2.255680 (avg : 1.739260)

At this point, no one needs to be convinced anymore that a set based approach is much better for updating your data than a row-by-row approach in a language like PL/SQL or Java, etc.

Caveats

Bulk updates are much better than row-by-row (remember: “slow-by-slow”) updates, regardless if you’re using PL/SQL or Java or whatever client language. This is because the optimiser can plan the update much more efficiently when it knows which rows will be updated in bulk, rather than seeing each individual row update afresh, not being able to plan ahead for the remaining number of updates. However, in situations where a lot of other processes are reading the same data while you’re bulk updating them, you need to be more careful. In such cases, a bulk update can cause trouble keeping locks and log files busy while you’re updating and while the other processes need to access the data prior to your update. One size never fits all, but at least, in every situation where you loop over a result set to update some data (or fetch additional data), ask yourself: Could I have written that logic in a single SQL statement? The answer is very often: Yes.

Other databases

A few other databases support similar language features. These include:

DB2: Implements the SQL standard (see below)
Firebird: Exactly like Oracle: RETURNING
PostgreSQL: Exactly like Oracle: RETURNING
SQL Server: Similar, a bit less powerful OUTPUT clause

The DB2 syntax is quite noteworthy, because:

It is very elegant
It corresponds to the SQL standard

The UPDATE statement would have been nested in a SELECT statement:


SELECT 
  listagg (text, ', ') WITHIN GROUP (ORDER BY id),
  count(*)
FROM FINAL TABLE (
  UPDATE t
  SET counter = nvl(counter, 0) + 1
  WHERE category = 1
)

4 thoughts on “How to Use SQL UPDATE .. RETURNING to Run DML More Efficiently”

Lukasz says:

September 26, 2018 at 16:43

SQL Server doesn’t allow to use aggregate functions in OUTPUT.

CREATE TABLE tab(i INT);
INSERT INTO tab(i) VALUES (1),(2);

UPDATE tab
SET i = 10
OUTPUT COUNT(*);
–>> An aggregate may not appear in the OUTPUT clause.

And for PostgreSQL syntax from Oracle won’t work too:

UPDATE tab
SET i = 10
RETURNING COUNT(*);
— >> aggregate functions are not allowed in RETURNING

But it could be as elegant as DB2:

WITH cte AS (
UPDATE tab
SET i =10 * i
RETURNING *
)
SELECT COUNT(*), STRING_AGG(i::text, ‘,’ ORDER BY i)
FROM cte;

https://dbfiddle.uk/?rdbms=postgres_11&fiddle=e0f62de3b550e8396412ba2b19ede294

As for Oracle RETURNING combined with aggregate functions is really nice feature, but it looks like it is undocummented: https://stackoverflow.com/questions/52337932/oracle-returning-combined-with-aggregate-functions

Loading...

1. lukaseder says:
  
  September 26, 2018 at 16:55
  
  Thanks a lot for looking up these things
  
  Loading...
  
Ben Brumm says:

September 26, 2018 at 21:48

Great article Lukas! I was aware that updating data row-by-row inside a loop was usually worse than performing an update as a set, but it’s good to have some examples with runtime comparisons. Also the side-by-side comparison showing how the logic has moved between both versions of code is very helpful. Thanks!

Loading...

1. lukaseder says:
  
  September 28, 2018 at 12:02
  
  Thanks for the nice words, Ben. Indeed, that side-by-side comparison is a thing I had wanted to try for a while. It’s very suitable to compare imperative and set-based algorithms.
  
  Loading...

How to Use SQL UPDATE .. RETURNING to Run DML More Efficiently

The Schema

The “slow-by-slow” PL/SQL Logic

Thinking Set Based

How does it perform?

Caveats

Other databases

Like this:

Published by lukaseder

4 thoughts on “How to Use SQL UPDATE .. RETURNING to Run DML More Efficiently”

Leave a Reply to LukaszCancel reply

The Schema

The “slow-by-slow” PL/SQL Logic

Thinking Set Based

How does it perform?

Caveats

Other databases

Like this:

Published by lukaseder

4 thoughts on “How to Use SQL UPDATE .. RETURNING to Run DML More Efficiently”

Leave a Reply to LukaszCancel reply

Discover more from Java, SQL and jOOQ.