Oracle’s BINARY_DOUBLE Can Be Much Faster Than NUMBER

Using the right data type for some calculation sounds like some obvious advice.

There are many blogs about using temporal data types for temporal data, instead of strings. An obvious reason is data integrity and correctness. We don’t gain much in storing dates as 2019-09-10 in one record, and as Nov 10, 2019 in the next one. Performance is also important in this case – for example, because of statistics being more optimal (an often overlooked side effect of unfit data types).

But at a customer site, I’ve recently discovered a surprising (not surprising in hindsight) performance issue when using NUMBER instead of BINARY_DOUBLE in an Oracle database.

NUMBER for monetary amounts

The NUMBER type (or DECIMAL, NUMERIC in other RDBMS) is perfectly suited for all decimal numbers, which require the correct precision and rounding. I mean, if you ever encounter this kind of display in an invoice, such as when I purchase video games on steam:

… your trust level for the platform immediately goes down by factor 1.0e+1 (even if technically, it does not matter in this case). So, the default, e.g. in banking systems for monetary amounts is always to use NUMBER or an equivalent type, e.g. java.math.BigDecimal.

Calculations on numbers

But now, let’s assume we want to do some statistics on these amounts. Some aggregations of various numeric values, maybe even of their logarithms. We can establish that these two expressions are equal:

ln(a * b) = ln(a) + ln(b)

Or in other words, for positive numbers:

a * b = exp(ln(a) + ln(b))

We’ve already blogged about how this approach can be used to emulate a PRODUCT() aggregate function in SQL, which is very useful for some cases, but none of the SQL databases jOOQ supports has built-in support for this yet. Notice, the blog post also takes care of zero and negative numbers.

But which number type to choose?

Now, we might be tempted to just calculate the LN(SOME_NUMBER) value, and sum that up using SUM(LN(SOME_NUMBER)) for this use-case. This turns out to be terribly slow in Oracle. We were thinking of bad indexes, first, even doubted aggregation in general, until I suggested we try using BINARY_DOUBLE instead, in this case. In our case, we didn’t care about the exact numeric value of the amount. A IEEE 754 floating point number with double precision was going to be good enough.

The results were surprising. In a simple benchmark, we compared 10 approaches to calculating this sum of logarithms:

  1. Using NUMBER(20, 10) and SUM(LN(the_number))
  2. Using NUMBER(20, 10) and SUM(LN(CAST(the_number AS BINARY_DOUBLE)))
  3. Using NUMBER(20, 10) and SUM(LN(TO_BINARY_DOUBLE(the_number)))
  4. Using NUMBER(20, 10), containing a pre-calculated LN value
  5. Using DOUBLE PRECISION and SUM(LN(the_double))
  6. Using DOUBLE PRECISION and SUM(LN(CAST(the_double AS BINARY_DOUBLE)))
  7. Using DOUBLE PRECISION and SUM(LN(TO_BINARY_DOUBLE(the_double)))
  8. Using DOUBLE PRECISION, containing a pre-calculated LN value
  9. Using BINARY_DOUBLE and SUM(LN(the_binary))
  10. Using BINARY_DOUBLE, containing a pre-calculated LN value

These were the thoughts:

  • We tried the above 3 possible numeric data types, expecting BINARY_DOUBLE to be the fastest
  • We tried to pre-calculate the LN() value for the benchmark, to see how much effort goes into summing, and how much effort goes into the LN() calculation with each type. While in general, in this system, such precalculation is impractical, we still wanted to have a benchmark comparison, in case a materialised view or other technique would be feasible.
  • We tried casting and converting each type to BINARY_DOUBLE prior to passing the value to the LN() function. Instead of migrating all the data (with possible side effects), we wanted to see if we can solve this to a reasonable extent “on the fly”

The benchmark that I’m posting here uses this table and example data (full benchmark code at the end of the article):

CREATE TABLE data (
  n1 NUMBER(20, 10),
  n2 NUMBER(20, 10),
  d1 DOUBLE PRECISION,
  d2 DOUBLE PRECISION,
  b1 BINARY_DOUBLE,
  b2 BINARY_DOUBLE
);

INSERT INTO data
SELECT level, ln(level), level, ln(level), level, ln(level)
FROM dual
CONNECT BY level <= 100000;

So, we have 100000 records, whose SUM(LN(x)) we want to calculate in the above 10 different ways. N1 contains the raw numeric value, and N2 contains the pre-calculated LN(N1).

The benchmark technique is described here. Do note it has a lot of caveats and is only useful for a limited number of verifications. Please always be careful when running such benchmarks, they often do not test production-like situations – and never use such benchmarks to compare different RDBMS products directly. They are only useful to compare two approaches on the same RDBMS product.

The results as run on Oracle 18c XE in Docker on development hardware are below. Times are compared relative to the fastest run, as actual time spent in each execution is not interesting for comparison. We did have similar results on production-like hardware, though, and I’m sure you can produce similar results in other RDBMS:

NUMBER(20, 10)
-------------------------------------------------
Run 3, Statement 1 : 280.34143  (avg : 280.75347)
Run 3, Statement 2 : 7.99402    (avg : 8.03506)
Run 3, Statement 3 : 7.71383    (avg : 7.73017)
Run 3, Statement 4 : 1.05456    (avg : 1.11735)

DOUBLE PRECISION
------------------------------------------------
Run 3, Statement 5 : 278.89476 (avg : 279.72981)
Run 3, Statement 6 : 8.06512   (avg : 8.07033)
Run 3, Statement 7 : 7.81873   (avg : 7.80063)
Run 3, Statement 8 : 1.5315    (avg : 1.54347)

BINARY_DOUBLE
------------------------------------------------
Run 3, Statement 9 : 2.4963    (avg : 2.57184)
Run 3, Statement 10: 1         (avg : 1.02943)

How to read these results?

  • Statement 10 is the fastest one, unsurprisingly, as it aggregates pre-calculated LN(binary_double) values. The precalculation of the function means that all the work has been done already before the report, and the data type is the one we expected to perform best in general
  • Statements 4 and 8 are almost as fast (precalculated LN() values). Being only slight factors off, we can attribute the difference to the usual benchmark flaws, although it’s interesting to see that DOUBLE PRECISION seems 1.5x slower to sum than BINARY_DOUBLE and even NUMBER
  • Statements 1, 5, 9 are the ones where no data type conversion is applied and SUM(LN(the_value)) is being calculated. It is staggering how much slower both NUMBER and DOUBLE PRECISION are than BINARY_DOUBLE. Statements 1 and 5 are a factor of 112x slower than statement 9!
  • Statements 2-3, 6-7 prove that converting the NUMBER or DOUBLE PRECISION vales to BINARY_DOUBLE on the fly provides a sufficiently performant workaround, which made statements 2-3, 6-7 only 3x slower than statement 9
  • Statements 2-3, 6-7 show that casting and converting are about equivalent

Note, we found similar results with other functions, such as EXP()

Analysis

The order of magnitude may seem surprising at first, but thinking about it, it is not. We would never do CPU intensive computation with java.math.BigDecimal in Java. The BigDecimal type is there for numeric accuracy, e.g. when it really matters what the monetary amount is, exactly. When we run analytics on monetary amounts, using double is sufficient in Java as well.

If our data is BigDecimal, and we cannot reasonably change that, it might still be better to use the BigDecimal::doubleValue conversion prior to further processing using e.g. Math::log. So, this translates directly to SQL, whose LN() implementations are data type specific. IEEE 754 having been designed precisely for this purpose.

When doing CPU intensive computations both in Java, or in the database, we should always evaluate our various options of

  • Quick fixing our data sets for the report (ad-hoc conversion prior to calculation)
  • Thoroughly fixing our data sets in the schema (migration towards a better data type)
  • Preprocessing our data sets (precalculating some very commonly used computations)

Benchmark logic

Just run the below on your own hardware. I’m curious to see your results:

-- Copyright Data Geekery GmbH
--
-- Licensed under the Apache License, Version 2.0 (the "License");
-- you may not use this file except in compliance with the License.
-- You may obtain a copy of the License at
--
--     http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing, software
-- distributed under the License is distributed on an "AS IS" BASIS,
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- See the License for the specific language governing permissions and
-- limitations under the License.
--
-- This version displays relative execution times (fastest execution = 1)
SET SERVEROUTPUT ON

ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;

CREATE TABLE data (
  n1 NUMBER(20, 10),
  n2 NUMBER(20, 10),
  d1 DOUBLE PRECISION,
  d2 DOUBLE PRECISION,
  b1 BINARY_DOUBLE,
  b2 BINARY_DOUBLE
);

INSERT INTO data
SELECT level, ln(level), level, ln(level), level, ln(level)
FROM dual
CONNECT BY level <= 100000;

CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 10;
  v_stmt NUMBER;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
    v_stmt := 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(n1)) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(cast(n1 as binary_double))) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(to_binary_double(n1))) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(n2) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
    
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(d1)) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
    
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(cast(d1 as binary_double))) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
    
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(to_binary_double(d1))) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(d2) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(ln(b1)) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
  
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
    v_stmt := v_stmt + 1;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT sum(b2) FROM data
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio,
      CAST(AVG(elapsed) OVER (PARTITION BY stmt) / MIN(elapsed) OVER() AS NUMBER(10, 5)) avg_ratio
    FROM results
    ORDER BY run, stmt
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio || ' (avg : ' || rec.avg_ratio || ')');
  END LOOP;
  
  dbms_output.put_line('');
  dbms_output.put_line('Copyright Data Geekery GmbH');
  dbms_output.put_line('https://www.jooq.org/benchmark');
END;
/

DROP TABLE data;
DROP TABLE results;

How to Fetch All Current Identity Values in Oracle

Oracle 12c has introduced the useful SQL standard IDENTITY feature, which is essentially just syntax sugar for binding a sequence to a column default. We can use it like this:

create table t1 (col1 number generated always as identity);
create table t2 (col2 number generated always as identity);

insert into t1 values (default);
insert into t1 values (default);
insert into t1 values (default);
insert into t2 values (default);

select * from t1;
select * from t2;

Which produces

COL1
----
  1
  2
  3

COL2
----
  1

For unit testing against our database, we might want to know what “state” our identities are in. For each table, we would like to know the next value such an identity would produce. If we knew all the backing sequence names, we could query their seq.currval, but we don’t know those sequence names as they are generated.

However, we can query the dictionary views to get this information as follows:

select data_default
from user_tab_cols
where data_default is not null
and identity_column = 'YES'
and table_name in ('T1', 'T2');

An alternative is to query user_tab_identity_cols

This would produce:

"TEST"."ISEQ$$_116601".nextval
"TEST"."ISEQ$$_116603".nextval

Now, if we’re lazy, we could just run EXECUTE IMMEDIATE on each of those expressions and we’re done:

set serveroutput on
declare
  v_current number;
begin
  for rec in (
    select table_name, data_default
    from user_tab_cols
    where data_default is not null
    and identity_column = 'YES'
    and table_name in ('T1', 'T2')
  ) loop
    execute immediate replace(
      'select ' || rec.data_default || ' from dual', 
      '.nextval', 
      '.currval'
    ) into v_current;
    dbms_output.put_line(
      'Table : ' || rec.table_name || 
      ', currval : ' || v_current
    );
  end loop;
end;
/

This would produce:

Table : T1, currval : 3
Table : T2, currval : 1

Alternatively, if you want this result to be a SQL result instead of DBMS_OUTPUT content, you could run this:

with
  function current_value(p_table_name varchar2) return number is
    v_current number;
  begin
    for rec in (
      select data_default
      from user_tab_cols
      where table_name = p_table_name
      and data_default is not null
      and identity_column = 'YES'
    )
    loop
      execute immediate replace(
        'select ' || rec.data_default || ' from dual', 
        '.nextval', 
        '.currval'
      ) into v_current;
      return v_current;
    end loop;
    
    return null;
  end;
select *
from (
  select table_name, current_value(table_name) current_value
  from user_tables
  where table_name in ('T1', 'T2')
)
where current_value is not null
order by table_name;
/

The alternative using user_tab_identity_cols would look like this:

with
  function current_value(p_table_name varchar2) return number is
    v_current number;
  begin
    for rec in (
      select sequence_name
      from user_tab_identity_cols
      where table_name = p_table_name
    )
    loop
      execute immediate 
        'select ' || rec.sequence_name || '.currval from dual'
      into v_current;
      return v_current;
    end loop;
     
    return null;
  end;
select *
from (
  select table_name, current_value(table_name) current_value
  from user_tables
)
where current_value is not null
order by table_name;
/

The result is now a nice SQL result set:

TABLE_NAME   CURRENT_VALUE
--------------------------
T1           3
T2           1

Using IGNORE NULLS With SQL Window Functions to Fill Gaps

I found a very interesting SQL question on Twitter recently:

Rephrasing the question: We have a set of sparse data points:

+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 |
| 2019-01-05 |   125 |
| 2019-01-06 |   128 |
| 2019-01-10 |   130 |
+------------+-------+

Since dates can be listed as discrete, continuous data points, why not fill in the gaps between 2019-01-02 and 2019-01-05 or 2019-01-06 and 2019-01-10? The desired output would be:

+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 | <-+
| 2019-01-03 |   120 |   | -- Generated
| 2019-01-04 |   120 |   | -- Generated
| 2019-01-05 |   125 |
| 2019-01-06 |   128 | <-+
| 2019-01-07 |   128 |   | -- Generated
| 2019-01-08 |   128 |   | -- Generated
| 2019-01-09 |   128 |   | -- Generated
| 2019-01-10 |   130 |
+------------+-------+

In the generated columns, we’ll just repeat the most recent value.

How to do this with SQL?

For the sake of this example, I’m using Oracle SQL, as the OP was expecting to do this with Oracle. The idea is to do this in two steps:

  1. Generate all the dates between the first and the last data points
  2. For each date, find either the current data point, or the most recent one

But first, let’s create the data:

create table t (value_date, value) as
  select date '2019-01-01', 100 from dual union all
  select date '2019-01-02', 120 from dual union all
  select date '2019-01-05', 125 from dual union all
  select date '2019-01-06', 128 from dual union all
  select date '2019-01-10', 130 from dual;

1. Generating all the dates

In Oracle, we can use the convenient CONNECT BY syntax for this. We could also use some other tool to generate dates to fill the gaps, including SQL standard recursion using WITH, or some PIPELINED function, but I like CONNECT BY for this purpose.

We’ll write:

select (
  select min(t.value_date) 
  from t
) + level - 1 as value_date
from dual
connect by level <= (
  select max(t.value_date) - min(t.value_date) + 1
  from t
)

This produces:

VALUE_DATE|
----------|
2019-01-01|
2019-01-02|
2019-01-03|
2019-01-04|
2019-01-05|
2019-01-06|
2019-01-07|
2019-01-08|
2019-01-09|
2019-01-10|

Now we wrap the above query in a derived table and left join the actual data set:

select 
  d.value_date,
  t.value
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level <= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

The date gaps are now filled, but our values column is still sparse:

VALUE_DATE|VALUE|
----------|-----|
2019-01-01|  100|
2019-01-02|  120|
2019-01-03|     |
2019-01-04|     |
2019-01-05|  125|
2019-01-06|  128|
2019-01-07|     |
2019-01-08|     |
2019-01-09|     |
2019-01-10|  130|

2. Fill the value gaps

On each row, the VALUE column should either contain the actual value, or the “last_value” preceding the current row, ignoring all the nulls. Note that I specifically wrote this requirement using specific English language. We can now translate that sentence directly to SQL:

last_value (t.value) ignore nulls over (order by d.value_date)

Since we have added an ORDER BY clause to the window function, the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW applies, which colloquially means “all the preceding rows”. (Technically, that’s not accurate. It means all rows with values less than or equal to the value of the current row – see Kim Berg Hansen’s comment)

Convenient! We’re trying to find the last value in the window of all the preceding rows, ignoring the nulls.

This is standard SQL, but unfortunately not all RDBMS support IGNORE NULLS. Among the ones supported by jOOQ, currently these ones support the syntax:

  • DB2
  • H2
  • Informix
  • Oracle
  • Redshift
  • Sybase SQL Anywhere
  • Teradata

Sometimes, not the exact standard syntax is supported, but the standard feature. Use https://www.jooq.org/translate to see different syntax variants.

The full query now reads:

select 
  d.value_date,
  last_value (t.value) ignore nulls over (order by d.value_date)
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level <= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

… and it yields the desired result:

VALUE_DATE         |VALUE|
-------------------|-----|
2019-01-01 00:00:00|  100|
2019-01-02 00:00:00|  120|
2019-01-03 00:00:00|  120|
2019-01-04 00:00:00|  120|
2019-01-05 00:00:00|  125|
2019-01-06 00:00:00|  128|
2019-01-07 00:00:00|  128|
2019-01-08 00:00:00|  128|
2019-01-09 00:00:00|  128|
2019-01-10 00:00:00|  130|

Other RDBMS

This solution made use of some Oracle specific features such as CONNECT BY. In other RDBMS, the same idea can be implemented by using a different way of generating data. This article focuses only on using IGNORE NULLS. If you’re interested, feel free to post an alternative solution in the comments for your RDBMS.

Calling an Oracle Function with PL/SQL BOOLEAN Type from SQL

One of the most wanted features in the Oracle database is the BOOLEAN type. The SQL standard specified it a while ago, and RDBMS like PostgreSQL show how powerful it can be, e.g. when using the EVERY() aggregate function.

The PL/SQL language already has support for boolean types. We can write:

CREATE OR REPLACE FUNCTION number_to_boolean (i NUMBER) 
RETURN BOOLEAN 
IS
BEGIN
  RETURN NOT i = 0;
END number_to_boolean;
/

CREATE OR REPLACE FUNCTION boolean_to_number (b BOOLEAN) 
RETURN NUMBER 
IS
BEGIN
  RETURN CASE WHEN b THEN 1 WHEN NOT b THEN 0 END;
END boolean_to_number;
/

From PL/SQL, we can now easily call the above functions:

SET SERVEROUTPUT ON
BEGIN
  IF number_to_boolean(1) THEN
    dbms_output.put_line('1 is true');
  END IF;
  IF NOT number_to_boolean(0) THEN
    dbms_output.put_line('0 is false');
  END IF;
  IF number_to_boolean(NULL) IS NULL THEN
    dbms_output.put_line('null is null');
  END IF;
END;
/

The above prints

1 is true
0 is false
null is null

But we cannot do the same from the SQL engine:

SELECT 
  number_to_boolean(1), 
  number_to_boolean(0), 
  number_to_boolean(null) 
FROM dual;

This yields:

ORA-00902: invalid datatype

Eventually, Oracle will fix this by supporting boolean types in the SQL engine (show your love to Oracle here).

The WITH clause

Until then, we can make use of a nice workaround using new functionality from Oracle 12c. We can declare functions in the WITH clause! Run this:

WITH
  FUNCTION f RETURN NUMBER IS 
  BEGIN 
    RETURN 1; 
  END f;
SELECT f
FROM dual;

You’ll get

 F
---
 1

That’s wonderful, and what’s even better, this part of the WITH clause is written in PL/SQL, where we can use the BOOLEAN type again. So we can define bridge functions for each function call. Instead of this:

SELECT 
  number_to_boolean(1), 
  number_to_boolean(0), 
  number_to_boolean(null) 
FROM dual;

We can write this:

WITH
  FUNCTION number_to_boolean_(i NUMBER)
  RETURN NUMBER
  IS
    b BOOLEAN;
  BEGIN
    -- Actual function call
    b := number_to_boolean(i);
    
    -- Translation to numeric result
    RETURN CASE b WHEN TRUE THEN 1 WHEN FALSE THEN 0 END;
  END number_to_boolean_;
SELECT 
  number_to_boolean_(1) AS a, 
  number_to_boolean_(0) AS b, 
  number_to_boolean_(null) AS c
FROM dual;

This now yields:

 A   B   C
-------------
 1   0   null

Of course, we don’t get an actual boolean type back in the result set, as the SQL engine cannot process that. But if you’re calling this function from JDBC, 1/0/null can be translated transparently to true/false/null.

It also works for chaining. Instead of the following, which still yields ORA-00902:

SELECT 
  boolean_to_number(number_to_boolean(1)), 
  boolean_to_number(number_to_boolean(0)), 
  boolean_to_number(number_to_boolean(null))
FROM dual;

We can write this:

WITH
  FUNCTION number_to_boolean_(i NUMBER)
  RETURN NUMBER
  IS
    b BOOLEAN;
  BEGIN
    -- Actual function call
    b := number_to_boolean(i);
    
    -- Translation to numeric result
    RETURN CASE b WHEN TRUE THEN 1 WHEN FALSE THEN 0 END;
  END number_to_boolean_;
  
  FUNCTION boolean_to_number_(b NUMBER)
  RETURN NUMBER
  IS
  BEGIN
    -- Actual function call
    RETURN boolean_to_number(NOT b = 0);
  END boolean_to_number_;
SELECT 
  boolean_to_number_(number_to_boolean_(1)) AS a, 
  boolean_to_number_(number_to_boolean_(0)) AS b, 
  boolean_to_number_(number_to_boolean_(null)) AS c
FROM dual;

… which again yields

 A   B   C
-------------
 1   0   null

And now, the 1/0/null integers are the actual desired result types.

This technique can be automated for any type of PL/SQL function that accepts and/or returns a PL/SQL BOOLEAN type, or even for functions that accept %ROWTYPE parameters, which we’ll work into jOOQ soon, in the near future.

A more real world example can be seen in this Stack Overflow question.

jOOQ 3.12 support

In jOOQ 3.12, we will add native support for using such functions in SQL through #8522. We have already supported PL/SQL boolean types in standalone procedure calls since jOOQ 3.8. With the next version, we can call a function like this one:

FUNCTION f_bool (i BOOLEAN) RETURN BOOLEAN;

From anywhere within a jOOQ statement, e.g.

Record1<Integer> r =
create()
    .select(one())
    .where(PlsObjects.fBool(false))
    .fetchOne();

assertNull(r);

When the above is called, the following SQL statement is generated by jOOQ 3.12, behind the scenes:

with
  function "F_BOOL_"(I integer)
  return integer
  is
    "r" boolean;
  begin
    "r" := "TEST"."PLS_OBJECTS"."F_BOOL"(not I = 0);
    return case when "r" then 1 when not "r" then 0 end;
  end "F_BOOL_";
  select 1
from dual
where (F_BOOL_(0) = 1)

Notice how the boolean expression codes like a true boolean / predicate?

Calculate Percentiles to Learn About Data Set Skew in SQL

B-Tree indexes are perfect when your data is uniformly distributed. They are not really useful, when you have skewed data. I’ll explain later why this is the case, but let’s first learn how to detect “skew”

What is skew?

Skew is a term from statistics when a normal distribution is not symmetric. The example given on Wikipedia shows a distribution like this:

In RDBMS, we sometimes use the term skew colloquially to mean the same thing as non-uniform distribution, i.e. a normal distribution would also be skewed. We simply mean that some values appear more often than others. Thus, I will put the term “skew” in double quotes in this article. While your RDBMS’s statistics contain this information once they are calculated, we can also detect such “skew” manually in ad-hoc queries using percentiles, which are defined in the SQL standard and supported in a variety of databases, as ordinary aggregate functions, including:

  • Oracle
  • PostgreSQL
  • SQL Server (regrettably, only as window functions)

Uniform distribution

Let’s look at the FILM_ID values in the Sakila database:

SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY film_id) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY film_id) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY film_id) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY film_id) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY film_id) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY film_id) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY film_id) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY film_id) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY film_id) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY film_id) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY film_id) AS "100%"
FROM film;

What are we calculating here? We’re trying to find 11 different values for which we can say that:

  • 0% of the film_ids are lower than the “0%” value
  • 10% of the film_ids are lower than the “10%” value

Or in other words:

  • 0% is the MIN(film_id) value
  • 50% is the MEDIAN(film_id) value
  • 100% is the MAX(film_id) value

The result shows an unsurprisingly uniform distribution:

0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
---|----|----|----|----|----|----|----|----|----|-----|
1  |100 |200 |300 |400 |500 |600 |700 |800 |900 |1000 |

We can plot this in Microsoft Excel or some other tool to get this nice curve:

This is not surprising, as the IDs are just consecutive values, which is a desired property of surrogate keys.

“Skewed” distribution

It’s a different story when we look at the distribution of amounts in the payment table:

SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY amount) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY amount) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY amount) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY amount) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY amount) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY amount) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY amount) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY amount) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY amount) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY amount) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY amount) AS "100%"
FROM payment;

We’re now getting:

0%   |10%  |20%  |30%  |40%  |50%  |60%  |70%  |80%  |90%  |100% 
-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----
0.00 |0.99 |1.99 |2.99 |2.99 |3.99 |4.99 |4.99 |5.99 |6.99 |11.99

This looks … “skewed”, although clearly the bias is mainly caused by the fact that this data is generated. When we plot the above, we’re getting:

The slope is less steep at the beginning of this curve, which essentially means that more values exist at the lower end of the range than at the upper end. We can validate this with another query:

SELECT amount, count(*)
FROM (
  SELECT trunc(amount) AS amount
  FROM payment
) t 
GROUP BY amount
ORDER BY amount;

… which yields:

amount |count |
-------|------|
0      |3003  |
1      |641   |
2      |3542  |
3      |1117  |
4      |3789  |
5      |1306  |
6      |1119  |
7      |675   |
8      |486   |
9      |257   |
10     |104   |
11     |10    |

Plotted:

When plotting this, we can see that there are more amounts in the lower half of the range than in the upper half, which leads to percentiles growing slower.

Correlations

This technique can also be applied to detect correlations in data. We can, for instance, try to find the percentiles of the length of films, and group data sets by rating. I’m using a GROUPING SETS function here, the ROLLUP() function, to calculate the grand total as well. Just check out the query and its results, and you’ll see:

SELECT
  rating,
  count(*),
  percentile_disc(0.0) WITHIN GROUP (ORDER BY length) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY length) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY length) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY length) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY length) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY length) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY length) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY length) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY length) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY length) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY length) AS "100%"
FROM film
GROUP BY ROLLUP(rating);

This yields:

rating |count |0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
-------|------|---|----|----|----|----|----|----|----|----|----|-----|
G      |178   |47 |57  |67  |80  |93  |107 |121 |138 |156 |176 |185  |
PG     |194   |46 |58  |72  |85  |99  |113 |122 |137 |151 |168 |185  |
PG-13  |223   |46 |61  |76  |92  |110 |125 |138 |150 |162 |176 |185  |
R      |195   |49 |68  |82  |90  |104 |115 |129 |145 |160 |173 |185  |
NC-17  |210   |46 |58  |74  |84  |97  |112 |125 |138 |153 |174 |184  |
       |1000  |46 |60  |74  |86  |102 |114 |128 |142 |156 |173 |185  |

So, the GROUP BY clause produced one row per rating, and an additional grand total column at the bottom. For illustration purposes, I’ve added the COUNT(*) column, to show how many films are in each group. The 5 first rows sum up to 1000, which is again the grand total at the bottom.

Let’s plot the percentiles now as line and bar charts:

We can “see” that there is no strong correlation between the two data points. Both data sets are close to uniformly distributed, quite independently of the rating, with the exception of PG-13, which is just slightly skewed towards longer film lengths.

Again, this isn’t terribly interesting as the data set was generated, probably using some randomness to avoid perfectly uniform distribution. In real world scenarios, the above data would have been more “skewed”.

How does this help with performance?

A balanced tree index is very useful when data is quite uniformly distributed, because in that case, it can help access data points or ranges of data in O(log(N)) time. This is quite a useful property for queries that look for film_id values, e.g.

SELECT *
FROM film
WHERE film_id = 1

When accessing “skewed” data, some values are more equal than others. This means that for example if we’re looking for amounts in the payment table, these two queries are not the same:

-- A lot of rows returned (3644)
SELECT * FROM payment WHERE amount BETWEEN 0 AND 2;

-- Few rows returned (361)
SELECT * FROM payment WHERE amount BETWEEN 9 AND 11;

An index on the amount column could have been useful for the second query, but maybe not for the first one.

There are several things we can do to make sure optimal index usage is being applied for all sorts of queries. In case of uniformly distributed data, we usually don’t have to do anything as SQL developers. In case of “skewed” data sets, it may be worth thinking about:

  • Using histogram statistics
  • Hinting the optimiser (in Oracle or SQL Server)
  • Avoiding bind variables (only in extreme cases)

Conclusion

Not all data sets are equal. They are often “skewed”. By “skewed”, in SQL, we don’t mean the statistical meaning of a normal distribution being skewed asymmetrically. We mean that a distribution is not uniform, so even a normal distribution is “skewed”. When it is, then some values appear way more often than others. Some examples are:

Uniform distribution

  • Surrogate keys generated from sequences (consecutive)
  • Surrogate keys generated from UUIDs (random)
  • Foreign keys on one-to-one relationships

Slight “skew”

Possibly significant “skew”

This really depends on the actual data set, but do expect significant “skew” in these data types

  • Foreign keys on to-many relationships (e.g. some customers have more assets than others)
  • Numeric values (e.g. amount)
  • Codes and other discrete values (e.g. film rating, payment settlement codes, etc.)

This article has shown how we can use simple SQL aggregate functions, including the percentiles, to calculate and visualise such “skew”.

How to Use SQL UPDATE .. RETURNING to Run DML More Efficiently

At a customer site, I recently refactored a “slow-by-slow” PL/SQL loop and turned that into an efficient set based UPDATE statement saving many lines of code and running much faster. In this blog post, I will show how that can be done. The blog post will focus on Oracle and UPDATE, but rest assured, this technique can be implemented in other databases too, and also with other DML statements, such as INSERT, DELETE, and depending on the vendor, even MERGE.

The Schema

The original logic that needed refactoring worked on the following data set (simplified for this blog post):

-- Table definition
CREATE TABLE t (
  id NUMBER(10) GENERATED ALWAYS AS IDENTITY NOT NULL PRIMARY KEY,
  category NUMBER(10) NOT NULL,
  counter NUMBER(10),
  text VARCHAR2(10) NOT NULL
);

-- Sample data
INSERT INTO t (category, text)
SELECT dbms_random.value(1, 10), dbms_random.string('a', 10)
FROM dual
CONNECT BY level <= 100;

-- Output of data
SELECT *
FROM t
ORDER BY counter DESC NULLS LAST, category, id;

The sample data generated above might look like this:

ID   CATEGORY   COUNTER   TEXT
16   1                    UIXSzJxDez
25   1                    hkvvrTRbTC
29   1                    IBOJYveDgf
44   1                    VhcwOugrWB
46   1                    gBJFJrPQYy
47   1                    bVzfHznOUj
10   2                    KpHHgsRXwR
11   2                    vpkhTrkaaU
14   2                    fDlNtRdvBE

So, there were certain records belonging to some category, and there’s a counter indicating how often each record has been encountered in some system.

The “slow-by-slow” PL/SQL Logic

(“slow-by-slow” rhymes with “row-by-row”. You get the idea)

Every now and then, there was a message from another system that should:

  • Fetch all the rows of a category
  • Increase the counter on each element of that category
  • Concatenate all the texts of that category and return those

Sounds like something that can be done very easily using a loop. In PL/SQL (but imagine you could be doing this in Java just the same):

SET SERVEROUTPUT ON
DECLARE
  v_text VARCHAR2(2000);
  v_updated PLS_INTEGER := 0;
BEGIN
  FOR r IN (
    SELECT * FROM t WHERE category = 1
  ) LOOP
    v_updated := v_updated + 1;
    
    IF v_text IS NULL THEN
      v_text := r.text;
    ELSE
      v_text := v_text || ', ' || r.text;
    END IF;
    
    IF r.counter IS NULL THEN
      UPDATE t SET counter = 1 WHERE id = r.id;
    ELSE
      UPDATE t SET counter = counter + 1 WHERE id = r.id;
    END IF;
  END LOOP;
  
  COMMIT;
  dbms_output.put_line('Rows updated: ' || v_updated);
  dbms_output.put_line('Returned:     ' || v_text);
END;
/

The result of this block would be:

Rows updated: 6
Returned:     UIXSzJxDez, hkvvrTRbTC, IBOJYveDgf, VhcwOugrWB, gBJFJrPQYy, bVzfHznOUj

And the data is now:

ID   CATEGORY   COUNTER   TEXT
16   1          1         UIXSzJxDez
25   1          1         hkvvrTRbTC
29   1          1         IBOJYveDgf
44   1          1         VhcwOugrWB
46   1          1         gBJFJrPQYy
47   1          1         bVzfHznOUj
10   2                    KpHHgsRXwR
11   2                    vpkhTrkaaU
14   2                    fDlNtRdvBE

Wonderful. What’s wrong with this? The logic is straightforward and runs quite quickly. Until you run this many many many times per second – then it suddenly starts to hurt.

Thinking Set Based

Whenever you work with RDBMS, try to think in terms of data sets and try running a bulk operation on such a data set. (Exceptions exist, see caveats below). The modification of the data can be written in a single SQL statement, instead of updating the same table many times.

Here’s the SQL statement in Oracle, that does precisely the same thing:

SET SERVEROUTPUT ON
DECLARE
  v_text VARCHAR2(2000);
  v_updated PLS_INTEGER := 0;
BEGIN
  UPDATE t
  SET counter = nvl(counter, 0) + 1
  WHERE category = 1
  RETURNING
    listagg (text, ', ') WITHIN GROUP (ORDER BY text),
    count(*)
  INTO
    v_text,
    v_updated;
    
  COMMIT;
  dbms_output.put_line('Rows updated: ' || v_updated);
  dbms_output.put_line('Returned:     ' || v_text);
END;
/

Again, the same output:

Rows updated: 6
Returned:     UIXSzJxDez, hkvvrTRbTC, IBOJYveDgf, VhcwOugrWB, gBJFJrPQYy, bVzfHznOUj

And the data set is now:

ID   CATEGORY   COUNTER   TEXT
16   1          2         UIXSzJxDez
25   1          2         hkvvrTRbTC
29   1          2         IBOJYveDgf
44   1          2         VhcwOugrWB
46   1          2         gBJFJrPQYy
47   1          2         bVzfHznOUj
10   2                    KpHHgsRXwR
11   2                    vpkhTrkaaU
14   2                    fDlNtRdvBE

Below, you can see each piece of logic of the original PL/SQL block, and the corresponding logic in the revised SQL statement

There are 4 areas of interest:

  1. Red: The category predicate
    In the PL/SQL version, this predicate is a simple access predicate for the SELECT statement, over whose implicit cursor we’re iterating. In the set based SQL version, that predicate has been moved into the single bulk UPDATE statement. Thus: we’re modifying the exact same set of rows.
  2. Blue: The number of updated rows
    Before, we had a count variable that counted the number of iterations over the implicit cursor. Now, we can simply count the number of rows being updated in the bulk update statement, conveniently in the RETURNING clause. An alternative (in Oracle) would have been to use SQL%ROWCOUNT, which is available for free after a single bulk UPDATE statement.
  3. Orange: The string concatenation
    The requirement was to concatenate all the texts which are being updated. In the “slow-by-slow” PL/SQL approach, we’re again keeping around a local variable and concatenate new values to it, doing some NULL handling, initially. In the set based SQL version, we can simply use LISTAGG() in the RETURNING clause. Notice, there seems to be a bug with this usage of LISTAGG. The ORDER BY clause has no effect.
  4. Green: The actual update
    In the “slow-by-slow” version, we run 1 UPDATE statement per row, which can turn out to be devastating, if we’re updating a lot of rows. Besides, in this particular case, the developer(s) have been unaware of the possibility of NULL handling using NVL() (or COALESCE() or similar). There is really only one UPDATE statement necessary here.

That already looks a lot neater.

How does it perform?

In a quick test script, which I’ve linked here, I could observe the following times for the above test data set, when running each approach 5 x 10000 times:

Run 1, Statement 1 : 2.63841 (avg : 2.43714)
Run 1, Statement 2 : 1.11019 (avg : 1.04562)
Run 2, Statement 1 : 2.35626 (avg : 2.43714)
Run 2, Statement 2 : 1.05716 (avg : 1.04562)
Run 3, Statement 1 : 2.38004 (avg : 2.43714)
Run 3, Statement 2 : 1.05153 (avg : 1.04562)
Run 4, Statement 1 : 2.47451 (avg : 2.43714)
Run 4, Statement 2 : 1.00921 (avg : 1.04562)
Run 5, Statement 1 : 2.33649 (avg : 2.43714)
Run 5, Statement 2 : 1.00000 (avg : 1.04562)

As always, I’m not publishing actual benchmark times, but relative times compared to the fastest run. The set based approach is consistently 2.5x faster on my machine (Oracle 18c on Docker on Windows 10 / SSD). This is updating 6 rows per execution.

When we remove the WHERE category = 1 predicate, updating the entirety of the 100 rows each time, we get even more drastic results. I’m now running this 5 x 2000 times to get:

Run 1, Statement 1 : 10.21833 (avg : 11.98154)
Run 1, Statement 2 : 1.219130 (avg : 1.739260)
Run 2, Statement 1 : 10.17014 (avg : 11.98154)
Run 2, Statement 2 : 3.027930 (avg : 1.739260)
Run 3, Statement 1 : 9.444620 (avg : 11.98154)
Run 3, Statement 2 : 1.000000 (avg : 1.739260)
Run 4, Statement 1 : 20.54692 (avg : 11.98154)
Run 4, Statement 2 : 1.193560 (avg : 1.739260)
Run 5, Statement 1 : 9.527690 (avg : 11.98154)
Run 5, Statement 2 : 2.255680 (avg : 1.739260)

At this point, no one needs to be convinced anymore that a set based approach is much better for updating your data than a row-by-row approach in a language like PL/SQL or Java, etc.

Caveats

Bulk updates are much better than row-by-row (remember: “slow-by-slow”) updates, regardless if you’re using PL/SQL or Java or whatever client language. This is because the optimiser can plan the update much more efficiently when it knows which rows will be updated in bulk, rather than seeing each individual row update afresh, not being able to plan ahead for the remaining number of updates.

However, in situations where a lot of other processes are reading the same data while you’re bulk updating them, you need to be more careful. In such cases, a bulk update can cause trouble keeping locks and log files busy while you’re updating and while the other processes need to access the data prior to your update.

One size never fits all, but at least, in every situation where you loop over a result set to update some data (or fetch additional data), ask yourself: Could I have written that logic in a single SQL statement? The answer is very often: Yes.

Other databases

A few other databases support similar language features. These include:

The DB2 syntax is quite noteworthy, because:

  • It is very elegant
  • It corresponds to the SQL standard

The UPDATE statement would have been nested in a SELECT statement:

SELECT 
  listagg (text, ', ') WITHIN GROUP (ORDER BY id),
  count(*)
FROM FINAL TABLE (
  UPDATE t
  SET counter = nvl(counter, 0) + 1
  WHERE category = 1
)

Beware of Hidden PL/SQL to SQL Context Switches

I recently stumbled upon a curious query on a customer’s productive Oracle database:

SELECT USER FROM SYS.DUAL

Two things caught my attention:

  • The query was executed many billions of times per month, accounting for about 0.3% of that system’s load. That’s 0.3% for something extremely silly!
  • I don’t think that customer would ever qualify the DUAL table as SYS.DUAL, which hints at some system functionality

I found it in Oracle Enterprise Manager, but you could also find it using a query like this one:

SELECT 
  sql_id, 
  executions, 
  elapsed_time, 
  ratio_to_report(elapsed_time) over() p, 
  sql_text
FROM v$sql
ORDER BY p DESC;

Why was this query being run so often? In Enterprise Manager, the query’s statistics overview displayed that the query originated from a function called STANDARD.USER (I don’t know yet where I could find this information in the dictionary views, manually).

Naively, I had always thought that the USER pseudo column or pseudo constant is some value from the context, but like many other functions, it’s really just a function in that package.

What does STANDARD.USER() do?

Now, I’m not 100% sure if that source code is something that I am allowed to reproduce from a legal perspective, this being Oracle and all. But if you run this query here, which I am freely allowing you to:

WITH s AS (
  SELECT s.*,
    MIN(CASE 
      WHEN upper(text) LIKE '%FUNCTION USER%' 
      THEN line END
    ) OVER () s
  FROM all_source s
  WHERE owner = 'SYS' 
  AND name = 'STANDARD'
  AND type = 'PACKAGE BODY'
)
SELECT text
FROM s
WHERE line >= s AND line < s + 6;

Then you might be able to see something like this:

  function USER return varchar2 is
  c varchar2(255);
  begin
        select user into c from sys.dual;
        return c;
  end;

This is just the result of some SQL query I’ve shown you. Any correspondence with actual source code is merely coincidental.

Let’s assume this were the actual source code of the STANDARD.USER() function. We can now clearly see that this very SQL query that I’ve observed before is being executed! Want to verify this?

Let’s benchmark

As always, I’m using the benchmark technique described here. The full benchmark logic is at the end of the article.

In essence, I’m comparing the performances of 500000 executions of this loop:

FOR i IN 1 .. v_repeat LOOP
  v := USER;
END LOOP;

With this one:

FOR i IN 1 .. v_repeat LOOP
  SELECT USER INTO v FROM dual;
END LOOP;

And this one:

FOR i IN 1 .. v_repeat LOOP
  -- Note: According to the doc, use 'SESSION_USER' instead
  v := sys_context('USERENV', 'CURRENT_USER');
END LOOP;

The result of this benchmark is:

Run 1, Statement 1 : 2.40509 (avg : 2.43158)
Run 1, Statement 2 : 2.13208 (avg : 2.11816)
Run 1, Statement 3 : 1.01452 (avg : 1.02081)

Run 2, Statement 1 : 2.41889 (avg : 2.43158)
Run 2, Statement 2 : 2.09753 (avg : 2.11816)
Run 2, Statement 3 : 1.00203 (avg : 1.02081)

Run 3, Statement 1 : 2.45384 (avg : 2.43158)
Run 3, Statement 2 : 2.09060 (avg : 2.11816)
Run 3, Statement 3 : 1.02239 (avg : 1.02081)

Run 4, Statement 1 : 2.39516 (avg : 2.43158)
Run 4, Statement 2 : 2.14140 (avg : 2.11816)
Run 4, Statement 3 : 1.06512 (avg : 1.02081)

Run 5, Statement 1 : 2.48493 (avg : 2.43158)
Run 5, Statement 2 : 2.12922 (avg : 2.11816)
Run 5, Statement 3 : 1.00000 (avg : 1.02081)

How to read this benchmark result? These aren’t actual times, which are not interesting, but relative times compared to the fastest run (run 5, statement 3 = 1). The explicit SELECT USER FROM DUAL is about half as fast as the SYS_CONTEXT call, and the USER call is a bit slower, even.

When re-running this query:

SELECT 
  sql_id, 
  executions, 
  ratio_to_report(elapsed_time) over() p, 
  sql_text
FROM v$sql
ORDER BY p DESC;

We can see:

SQL_ID          EXECUTIONS  P     SQL_TEXT
6r9s58qfu339c   1           0.26  DECLARE ...
1v717nvrhgbn9   2500000     0.14  SELECT USER FROM SYS.DUAL
...

So, this query has definitely been run way too many times, including the PL/SQL to SQL context switch that is involved.

I’m running this benchmark in Oracle 18.0.0.0.0 in Docker on a Windows machine. More close-to-the-metal and less virtualised setups might achieve more drastic results. See, e.g. Connor McDonald got a much better improvement from using SYS_CONTEXT:

In this particular case, The STANDARD.USER() reference was used very often in triggers to fill in audit columns of many tables. Very easy to fix. Just use sys_context('USERENV', 'CURRENT_USER') instead.

Full benchmark logic

SET SERVEROUTPUT ON

ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;

CREATE TABLE results (
  run     NUMBER(2),
  stmt    NUMBER(2),
  elapsed NUMBER
);

DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 500000;
  v NUMBER;
BEGIN

  -- Repeat the whole benchmark several times to 
  -- avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1 .. v_repeat LOOP
      v := v + length(USER);
    END LOOP;
  
    INSERT INTO results VALUES (r, 1, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1 .. v_repeat LOOP
      SELECT v + length(USER) INTO v FROM dual;
    END LOOP;
      
    INSERT INTO results VALUES (r, 2, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1 .. v_repeat LOOP
      -- Note: According to the doc, use 'SESSION_USER' instead
      v := v + length(sys_context('USERENV', 'CURRENT_USER'));
    END LOOP;
      
    INSERT INTO results VALUES (r, 3, 
      SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
  END LOOP;
  
  FOR rec IN (
    SELECT 
      run, stmt, 
      CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio,
      CAST(AVG(elapsed) OVER (PARTITION BY stmt) / 
           MIN(elapsed) OVER() AS NUMBER(10, 5)) avg_ratio
    FROM results
    ORDER BY run, stmt
  )
  LOOP
    dbms_output.put_line('Run ' || rec.run || 
      ', Statement ' || rec.stmt || 
      ' : ' || rec.ratio || ' (avg : ' || rec.avg_ratio || ')');
  END LOOP;
  
  dbms_output.put_line('');
  dbms_output.put_line('Copyright Data Geekery GmbH');
  dbms_output.put_line('https://www.jooq.org/benchmark');
END;
/

DROP TABLE results;