Something that has been said many times, but needs constant repeating until every developer is aware of the importance of this is the performance difference between row-by-row updating and bulk updating. If you cannot guess which one will be much faster, remember that row-by-row kinda rhymes with slow-by-slow (hint hint).
Disclaimer: This article will discuss only non-concurrent updates, which are much easier to reason about. In a concurrent update situation, a lot of additional factors will add complexity to the problem, including the locking strategy, transaction isolation levels, or simply how the database vendor implements things in detail. For the sake of simplicity, I’ll assume no concurrent updates are being made.
Example query
Let’s say we have a simple table for our blog posts (using Oracle syntax, but the effect is the same on all databases):
CREATE TABLE post (
id INT NOT NULL PRIMARY KEY,
text VARCHAR2(1000) NOT NULL,
archived NUMBER(1) NOT NULL CHECK (archived IN (0, 1)),
creation_date DATE NOT NULL
);
CREATE INDEX post_creation_date_i ON post (creation_date);
Now, let’s add some 10000 rows:
INSERT INTO post
SELECT
level,
lpad('a', 1000, 'a'),
0 AS archived,
DATE '2017-01-01' + (level / 100)
FROM dual
CONNECT BY level <= 10000;
EXEC dbms_stats.gather_table_stats('TEST', 'POST');
Now imagine, we want to update this table and set all posts to
ARCHIVED = 1
if they are from last year, e.g.
CREATION_DATE < DATE '2018-01-01'
. There are various ways to do this, but you should have built an intuition that doing the update in one single
UPDATE
statement is probably better than looping over each individual row and updating each individual row explicitly. Right?
Right.
Then, why do we keep doing it?
Let me ask this differently:
Does it matter?
The best way to find out is to benchmark. I’m doing two benchmarks for this:
- One that is run in PL/SQL, showing the performance difference between different approaches that are available to PL/SQL (namely looping, the
FORALL
syntax, and a single bulk UPDATE
)
- One that is run in Java, doing JDBC calls, showing the performance difference between different approaches available to Java (namely looping, caching
PreparedStatement
but still looping, batching, and a single bulk UPDATE
)
Benchmarking PL/SQL
The code of the benchmark can be found in this gist. I will also include it at the bottom of this blog post. The results are:
Run 1, Statement 1 : .01457 (avg : .0098)
Run 1, Statement 2 : .0133 (avg : .01291)
Run 1, Statement 3 : .02351 (avg : .02519)
Run 2, Statement 1 : .00882 (avg : .0098)
Run 2, Statement 2 : .01159 (avg : .01291)
Run 2, Statement 3 : .02348 (avg : .02519)
Run 3, Statement 1 : .01012 (avg : .0098)
Run 3, Statement 2 : .01453 (avg : .01291)
Run 3, Statement 3 : .02544 (avg : .02519)
Run 4, Statement 1 : .00799 (avg : .0098)
Run 4, Statement 2 : .01346 (avg : .01291)
Run 4, Statement 3 : .02958 (avg : .02519)
Run 5, Statement 1 : .00749 (avg : .0098)
Run 5, Statement 2 : .01166 (avg : .01291)
Run 5, Statement 3 : .02396 (avg : .02519)
The difference between Statement 1 and 3 is a factor of 2.5x
Showing the time it takes for each statement type to complete, each time updating 3649 / 10000 rows. The winner is:
Statement 1, running a bulk update
It looks like this:
UPDATE post
SET archived = 1
WHERE archived = 0 AND creation_date < DATE '2018-01-01';
Runner-up (not too far away) is:
Statement 2, using the PL/SQL FORALL syntax
It works like this:
DECLARE
TYPE post_ids_t IS TABLE OF post.id%TYPE;
v_post_ids post_ids_t;
BEGIN
SELECT id
BULK COLLECT INTO v_post_ids
FROM post
WHERE archived = 0 AND creation_date < DATE '2018-01-01';
FORALL i IN 1 .. v_post_ids.count
UPDATE post
SET archived = 1
WHERE id = v_post_ids(i);
END;
Loser (by a factor of 2.5x on our specific data set) is:
Statement 3, using an ordinary LOOP and running row-by-row updates
FOR rec IN (
SELECT id
FROM post
WHERE archived = 0 AND creation_date < DATE '2018-01-01'
) LOOP
UPDATE post
SET archived = 1
WHERE id = rec.id;
END LOOP;
It does not really come as a surprise. We’re switching between the PL/SQL engine and the SQL engine many many times, and also, instead of running through the post table only once in
O(N)
time, we’re looking up individual ID values in
O(log N)
time, N times, so the complexity went from
O(N) -> O(N log N)
We’d get far worse results for larger tables!
What about doing this from Java?
The difference is much more drastic if each call to the SQL engine has to be done over the network from another process.
Again, the benchmark code is available from a gist, and I will paste it to the end of this blog post as well.
The result is (same time unit):
Run 0, Statement 1: PT4.546S
Run 0, Statement 2: PT3.52S
Run 0, Statement 3: PT0.144S
Run 0, Statement 4: PT0.028S
Run 1, Statement 1: PT3.712S
Run 1, Statement 2: PT3.185S
Run 1, Statement 3: PT0.138S
Run 1, Statement 4: PT0.025S
Run 2, Statement 1: PT3.481S
Run 2, Statement 2: PT3.007S
Run 2, Statement 3: PT0.122S
Run 2, Statement 4: PT0.026S
Run 3, Statement 1: PT3.518S
Run 3, Statement 2: PT3.077S
Run 3, Statement 3: PT0.113S
Run 3, Statement 4: PT0.027S
Run 4, Statement 1: PT3.54S
Run 4, Statement 2: PT2.94S
Run 4, Statement 3: PT0.123S
Run 4, Statement 4: PT0.03S
The difference between Statement 1 and 4 is a factor of 100x !!
So, who’s winning? Again (by far):
Statement 4, running the bulk update
In fact, the time is not too far away from the time taken by PL/SQL. With larger data sets being updated, the two results will converge. The code is:
try (Statement s = c.createStatement()) {
s.executeUpdate(
"UPDATE post\n" +
"SET archived = 1\n" +
"WHERE archived = 0\n" +
"AND creation_date < DATE '2018-01-01'\n");
}
Followed by the not
that much worse (but still 3.5x worse):
Statement 3, running the batch update
Batching can be compared to PL/SQL’s
FORALL
statement. While we’re running individual row-by-row updates, we’re sending all the update statements in one batch to the SQL engine. This does save a lot of time on the network and all the layers in between.
The code looks like this:
try (Statement s = c.createStatement();
ResultSet rs = s.executeQuery(
"SELECT id FROM post WHERE archived = 0\n"
+ "AND creation_date < DATE '2018-01-01'"
);
PreparedStatement u = c.prepareStatement(
"UPDATE post SET archived = 1 WHERE id = ?"
)) {
while (rs.next()) {
u.setInt(1, rs.getInt(1));
u.addBatch();
}
u.executeBatch();
}
Followed by the losers:
Statement 1 and 2, running row by row updates
The difference between statement 1 and 2 is that 2 caches the
PreparedStatement
, which allows for reusing
some resources. This can be a good thing, but didn’t have a very significant effect in our case, compared to the batch / bulk alternatives. The code is:
// Statement 1:
try (Statement s = c.createStatement();
ResultSet rs = s.executeQuery(
"SELECT id FROM post\n"
+ "WHERE archived = 0\n"
+ "AND creation_date < DATE '2018-01-01'"
)) {
while (rs.next()) {
try (PreparedStatement u = c.prepareStatement(
"UPDATE post SET archived = 1 WHERE id = ?"
)) {
u.setInt(1, rs.getInt(1));
u.executeUpdate();
}
}
}
// Statement 2:
try (Statement s = c.createStatement();
ResultSet rs = s.executeQuery(
"SELECT id FROM post\n"
+ "WHERE archived = 0\n"
+ "AND creation_date < DATE '2018-01-01'"
);
PreparedStatement u = c.prepareStatement(
"UPDATE post SET archived = 1 WHERE id = ?"
)) {
while (rs.next()) {
u.setInt(1, rs.getInt(1));
u.executeUpdate();
}
}
Conclusion
As shown previously on this blog, there is a significant cost of JDBC server roundtrips, which can be seen in the JDBC benchmark. This cost is much more severe if we unnecessarily create many server roundtrips for a task that could be done in a single roundtrip, namely by using a SQL bulk
UPDATE
statement.
This is not only true for updates, but also for all the other statements, including
SELECT
,
DELETE
,
INSERT
, and
MERGE
. If doing everything in a single statement isn’t possible due to the limitations of SQL, we can still save roundtrips by grouping statements in a block, either by using an anonymous block in databases that support them:
BEGIN
statement1;
statement2;
statement3;
END;
(you can easily send these anonymous blocks over JDBC, as well!)
Or, by emulating anonymous blocks using the JDBC batch API (has its limitations), or by writing stored procedures.
The performance gain is not always worth the trouble of moving logic from the client to the server, but very often (as in the above case), the move is a no-brainer and there’s absolutely no reason against it.
So, remember: Stop doing row-by-row (slow-by-slow) operations when you could run the same operation in bulk, in a single SQL statement.
Hint: Always know what your ORM (if you’re using one) is doing, because the ORM can help you with automatic batching / bulking in many cases. But it often cannot, or it is too difficult to make it do so, so resorting to SQL is the way to go.
Code
PL/SQL benchmark
SET SERVEROUTPUT ON
DROP TABLE post;
CREATE TABLE post (
id INT NOT NULL PRIMARY KEY,
text VARCHAR2(1000) NOT NULL,
archived NUMBER(1) NOT NULL CHECK (archived IN (0, 1)),
creation_date DATE NOT NULL
);
CREATE INDEX post_creation_date_i ON post (creation_date);
ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;
CREATE TABLE results (
run NUMBER(2),
stmt NUMBER(2),
elapsed NUMBER
);
DECLARE
v_ts TIMESTAMP WITH TIME ZONE;
PROCEDURE reset_post IS
BEGIN
EXECUTE IMMEDIATE 'TRUNCATE TABLE post';
INSERT INTO post
SELECT
level AS id,
lpad('a', 1000, 'a') AS text,
0 AS archived,
DATE '2017-01-01' + (level / 100) AS creation_date
FROM dual
CONNECT BY level <= 10000;
dbms_stats.gather_table_stats('TEST', 'POST');
END reset_post;
BEGIN
-- Repeat the whole benchmark several times to avoid warmup penalty
FOR r IN 1..5 LOOP
reset_post;
v_ts := SYSTIMESTAMP;
UPDATE post
SET archived = 1
WHERE archived = 0 AND creation_date < DATE '2018-01-01';
INSERT INTO results VALUES (r, 1, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
reset_post;
v_ts := SYSTIMESTAMP;
DECLARE
TYPE post_ids_t IS TABLE OF post.id%TYPE;
v_post_ids post_ids_t;
BEGIN
SELECT id
BULK COLLECT INTO v_post_ids
FROM post
WHERE archived = 0 AND creation_date < DATE '2018-01-01';
FORALL i IN 1 .. v_post_ids.count
UPDATE post
SET archived = 1
WHERE id = v_post_ids(i);
END;
INSERT INTO results VALUES (r, 2, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
reset_post;
v_ts := SYSTIMESTAMP;
FOR rec IN (
SELECT id
FROM post
WHERE archived = 0 AND creation_date < DATE '2018-01-01'
) LOOP
UPDATE post
SET archived = 1
WHERE id = rec.id;
END LOOP;
INSERT INTO results VALUES (r, 3, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
END LOOP;
FOR rec IN (
SELECT
run, stmt,
CAST(elapsed AS NUMBER(10, 5)) ratio,
CAST(AVG(elapsed) OVER (PARTITION BY stmt) AS NUMBER(10, 5)) avg_ratio
FROM results
ORDER BY run, stmt
)
LOOP
dbms_output.put_line('Run ' || rec.run ||
', Statement ' || rec.stmt ||
' : ' || rec.ratio || ' (avg : ' || rec.avg_ratio || ')');
END LOOP;
dbms_output.put_line('');
dbms_output.put_line('Copyright Data Geekery GmbH');
dbms_output.put_line('https://www.jooq.org/benchmark');
END;
/
DROP TABLE results;
JDBC benchmark
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.time.Duration;
import java.time.Instant;
import java.util.Properties;
public class OracleUpdate {
public static void main(String[] args) throws Exception {
Class.forName("oracle.jdbc.OracleDriver");
String url = "jdbc:oracle:thin:@192.168.99.100:1521:ORCLCDB";
String user = "TEST";
String password = "TEST";
Properties properties = new Properties();
properties.setProperty("user", user);
properties.setProperty("password", password);
try (Connection c = DriverManager.getConnection(url, properties)) {
for (int i = 0; i < 5; i++) {
Instant ts;
resetPost(c);
ts = Instant.now();
try (Statement s = c.createStatement();
ResultSet rs = s.executeQuery(
"SELECT id FROM post WHERE archived = 0 AND creation_date < DATE '2018-01-01'"
)) {
while (rs.next()) {
try (PreparedStatement u = c.prepareStatement(
"UPDATE post SET archived = 1 WHERE id = ?"
)) {
u.setInt(1, rs.getInt(1));
u.executeUpdate();
}
}
}
System.out.println("Run " + i + ", Statement 1: " + Duration.between(ts, Instant.now()));
resetPost(c);
ts = Instant.now();
try (Statement s = c.createStatement();
ResultSet rs = s.executeQuery(
"SELECT id FROM post WHERE archived = 0 AND creation_date < DATE '2018-01-01'"
);
PreparedStatement u = c.prepareStatement(
"UPDATE post SET archived = 1 WHERE id = ?"
)) {
while (rs.next()) {
u.setInt(1, rs.getInt(1));
u.executeUpdate();
}
}
System.out.println("Run " + i + ", Statement 2: " + Duration.between(ts, Instant.now()));
resetPost(c);
ts = Instant.now();
try (Statement s = c.createStatement();
ResultSet rs = s.executeQuery(
"SELECT id FROM post WHERE archived = 0 AND creation_date < DATE '2018-01-01'"
);
PreparedStatement u = c.prepareStatement(
"UPDATE post SET archived = 1 WHERE id = ?"
)) {
while (rs.next()) {
u.setInt(1, rs.getInt(1));
u.addBatch();
}
u.executeBatch();
}
System.out.println("Run " + i + ", Statement 3: " + Duration.between(ts, Instant.now()));
resetPost(c);
ts = Instant.now();
try (Statement s = c.createStatement()) {
s.executeUpdate("UPDATE post\n" +
"SET archived = 1\n" +
"WHERE archived = 0 AND creation_date < DATE '2018-01-01'\n");
}
System.out.println("Run " + i + ", Statement 4: " + Duration.between(ts, Instant.now()));
}
}
}
static void resetPost(Connection c) throws SQLException {
try (Statement s = c.createStatement()) {
s.executeUpdate("TRUNCATE TABLE post");
s.executeUpdate("INSERT INTO post\n" +
" SELECT \n" +
" level,\n" +
" lpad('a', 1000, 'a'),\n" +
" 0,\n" +
" DATE '2017-01-01' + (level / 10)\n" +
" FROM dual\n" +
" CONNECT BY level <= 10000");
s.executeUpdate("BEGIN dbms_stats.gather_table_stats('TEST', 'POST'); END;");
}
}
}
Like this:
Like Loading...
With jooq there are 2 variants of a batch operation (several vs. single queries). Is there a relevant performance difference between these variants
Yes, of course. The difference is the same as in JDBC directly. The first approach will batch several static statements (no bind variables), whereas the second one will batch a set of bind variables on a single
PreparedStatement
. If the queries in the first approach are identical, then the second approach should always be preferred. See this article for details:https://blog.jooq.org/2018/04/12/why-sql-bind-variables-are-important-for-performance
Very nice performance diff, Lukas!
Using a bulk update is very efficient when we’re updating few thousand rows, but what if we’re updating a very large table with millions rows?
I mean, in a concurrent scenario where this same table is actively being accessed and updated by other users/apps, executing one single update statement can causes a very long lock in the table that can freeze those users of accessing it. How to deal with this kind of scenario? I think batch updating is one of the best approaches, be it via PLSQL or even Java.
What’s your opinion?
I deliberately didn’t touch concurrency in this article, because that complicates things drastically, and the answer will depend on too many things, including the locking strategy (MVCC, 2PL) and transaction isolation level, as well as on the vendor and the actual data set being processed.
There’s a reason why JavaScript is so easy: It’s single-threaded :)
I’ll add a disclaimer to the article.
Hi Lukas
Of course bulk update is to be preferred to batch updates if possible. I actually didn’t expect the batch update to be that much slower in this example (only 10k rows to update). As the fetch size is not explicitly defined the default of 10 is used which has quite some impact I thought.
I ran your example with default fetch size 10 and with 1000. Here’s what I got:
Fetch size 1000:
Run 0, Statement 3: PT0.142S
Run 0, Statement 4: PT0.08S
Run 1, Statement 3: PT0.124S
Run 1, Statement 4: PT0.078S
Run 2, Statement 3: PT0.116S
Run 2, Statement 4: PT0.081S
Run 3, Statement 3: PT0.169S
Run 3, Statement 4: PT0.094S
Run 4, Statement 3: PT0.189S
Run 4, Statement 4: PT0.249S
Fetch size 10:
Run 0, Statement 3: PT0.276S
Run 0, Statement 4: PT0.078S
Run 1, Statement 3: PT0.33S
Run 1, Statement 4: PT0.089S
Run 2, Statement 3: PT0.301S
Run 2, Statement 4: PT0.273S
Run 3, Statement 3: PT0.409S
Run 3, Statement 4: PT0.092S
Run 4, Statement 3: PT0.377S
Run 4, Statement 4: PT0.259S
So with fetch size 10 the bulk update was on average ~3 times faster than the batch update. With fetch size 1000 it was only ~1.5 times faster, so a lot of runtime is actually spent fetching the data.
Comparing the two methods (bulk vs batch), with larger data sets and having to use multiple batches resulting in more round trips, the runtime difference between these methods become more visible again.
Very interesting, thanks for following up on this. Terrible mistake on my side to overlook the overhead produced by the SELECT statement only, I was focusing only on the update statement.
You seem to have much smaller differences than me, though. In my case, the Oracle database runs inside of Docker on the same laptop. There might be some latency overhead due to Docker’s virtualisation and the fact that my laptop is rather crappy. You probably have a better setup, so the latency per call is less severe.
Any thoughts about using DBMS_PARALLEL_EXECUTE https://docs.oracle.com/database/121/ARPLS/d_parallel_ex.htm#ARPLS233 for massive update?
Sample scenario: chunk by rowid, chunk_size 10000, 8 workers.
I’ve never tried it, but with my past experience around parallel DML and other things parallel in Oracle, I wouldn’t trust this feature too much:
https://twitter.com/AndrewSayer_/status/1017085117725016064
Parallelism is really hard, especially in an OLTP system. It is often not even worth it, as scheduling overhead might cost more than the benefit you might be getting out of it.