Using IGNORE NULLS With SQL Window Functions to Fill Gaps

I found a very interesting SQL question on Twitter recently:

Rephrasing the question: We have a set of sparse data points:

+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 |
| 2019-01-05 |   125 |
| 2019-01-06 |   128 |
| 2019-01-10 |   130 |
+------------+-------+

Since dates can be listed as discrete, continuous data points, why not fill in the gaps between 2019-01-02 and 2019-01-05 or 2019-01-06 and 2019-01-10? The desired output would be:

+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 | <-+
| 2019-01-03 |   120 |   | -- Generated
| 2019-01-04 |   120 |   | -- Generated
| 2019-01-05 |   125 |
| 2019-01-06 |   128 | <-+
| 2019-01-07 |   128 |   | -- Generated
| 2019-01-08 |   128 |   | -- Generated
| 2019-01-09 |   128 |   | -- Generated
| 2019-01-10 |   130 |
+------------+-------+

In the generated columns, we’ll just repeat the most recent value.

How to do this with SQL?

For the sake of this example, I’m using Oracle SQL, as the OP was expecting to do this with Oracle. The idea is to do this in two steps:

  1. Generate all the dates between the first and the last data points
  2. For each date, find either the current data point, or the most recent one

But first, let’s create the data:

create table t (value_date, value) as
  select date '2019-01-01', 100 from dual union all
  select date '2019-01-02', 120 from dual union all
  select date '2019-01-05', 125 from dual union all
  select date '2019-01-06', 128 from dual union all
  select date '2019-01-10', 130 from dual;

1. Generating all the dates

In Oracle, we can use the convenient CONNECT BY syntax for this. We could also use some other tool to generate dates to fill the gaps, including SQL standard recursion using WITH, or some PIPELINED function, but I like CONNECT BY for this purpose.

We’ll write:

select (
  select min(t.value_date) 
  from t
) + level - 1 as value_date
from dual
connect by level <= (
  select max(t.value_date) - min(t.value_date) + 1
  from t
)

This produces:

VALUE_DATE|
----------|
2019-01-01|
2019-01-02|
2019-01-03|
2019-01-04|
2019-01-05|
2019-01-06|
2019-01-07|
2019-01-08|
2019-01-09|
2019-01-10|

Now we wrap the above query in a derived table and left join the actual data set:

select 
  d.value_date,
  t.value
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level <= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

The date gaps are now filled, but our values column is still sparse:

VALUE_DATE|VALUE|
----------|-----|
2019-01-01|  100|
2019-01-02|  120|
2019-01-03|     |
2019-01-04|     |
2019-01-05|  125|
2019-01-06|  128|
2019-01-07|     |
2019-01-08|     |
2019-01-09|     |
2019-01-10|  130|

2. Fill the value gaps

On each row, the VALUE column should either contain the actual value, or the “last_value” preceding the current row, ignoring all the nulls. Note that I specifically wrote this requirement using specific English language. We can now translate that sentence directly to SQL:

last_value (t.value) ignore nulls over (order by d.value_date)

Since we have added an ORDER BY clause to the window function, the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW applies, which colloquially means “all the preceding rows”. (Technically, that’s not accurate. It means all rows with values less than or equal to the value of the current row – see Kim Berg Hansen’s comment)

Convenient! We’re trying to find the last value in the window of all the preceding rows, ignoring the nulls.

This is standard SQL, but unfortunately not all RDBMS support IGNORE NULLS. Among the ones supported by jOOQ, currently these ones support the syntax:

  • DB2
  • H2
  • Informix
  • Oracle
  • Redshift
  • Sybase SQL Anywhere
  • Teradata

Sometimes, not the exact standard syntax is supported, but the standard feature. Use https://www.jooq.org/translate to see different syntax variants.

The full query now reads:

select 
  d.value_date,
  last_value (t.value) ignore nulls over (order by d.value_date)
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level <= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

… and it yields the desired result:

VALUE_DATE         |VALUE|
-------------------|-----|
2019-01-01 00:00:00|  100|
2019-01-02 00:00:00|  120|
2019-01-03 00:00:00|  120|
2019-01-04 00:00:00|  120|
2019-01-05 00:00:00|  125|
2019-01-06 00:00:00|  128|
2019-01-07 00:00:00|  128|
2019-01-08 00:00:00|  128|
2019-01-09 00:00:00|  128|
2019-01-10 00:00:00|  130|

Other RDBMS

This solution made use of some Oracle specific features such as CONNECT BY. In other RDBMS, the same idea can be implemented by using a different way of generating data. This article focuses only on using IGNORE NULLS. If you’re interested, feel free to post an alternative solution in the comments for your RDBMS.

Calling an Oracle Function with PL/SQL BOOLEAN Type from SQL

One of the most wanted features in the Oracle database is the BOOLEAN type. The SQL standard specified it a while ago, and RDBMS like PostgreSQL show how powerful it can be, e.g. when using the EVERY() aggregate function.

The PL/SQL language already has support for boolean types. We can write:

CREATE OR REPLACE FUNCTION number_to_boolean (i NUMBER) 
RETURN BOOLEAN 
IS
BEGIN
  RETURN NOT i = 0;
END number_to_boolean;
/

CREATE OR REPLACE FUNCTION boolean_to_number (b BOOLEAN) 
RETURN NUMBER 
IS
BEGIN
  RETURN CASE WHEN b THEN 1 WHEN NOT b THEN 0 END;
END boolean_to_number;
/

From PL/SQL, we can now easily call the above functions:

SET SERVEROUTPUT ON
BEGIN
  IF number_to_boolean(1) THEN
    dbms_output.put_line('1 is true');
  END IF;
  IF NOT number_to_boolean(0) THEN
    dbms_output.put_line('0 is false');
  END IF;
  IF number_to_boolean(NULL) IS NULL THEN
    dbms_output.put_line('null is null');
  END IF;
END;
/

The above prints

1 is true
0 is false
null is null

But we cannot do the same from the SQL engine:

SELECT 
  number_to_boolean(1), 
  number_to_boolean(0), 
  number_to_boolean(null) 
FROM dual;

This yields:

ORA-00902: invalid datatype

Eventually, Oracle will fix this by supporting boolean types in the SQL engine (show your love to Oracle here).

The WITH clause

Until then, we can make use of a nice workaround using new functionality from Oracle 12c. We can declare functions in the WITH clause! Run this:

WITH
  FUNCTION f RETURN NUMBER IS 
  BEGIN 
    RETURN 1; 
  END f;
SELECT f
FROM dual;

You’ll get

 F
---
 1

That’s wonderful, and what’s even better, this part of the WITH clause is written in PL/SQL, where we can use the BOOLEAN type again. So we can define bridge functions for each function call. Instead of this:

SELECT 
  number_to_boolean(1), 
  number_to_boolean(0), 
  number_to_boolean(null) 
FROM dual;

We can write this:

WITH
  FUNCTION number_to_boolean_(i NUMBER)
  RETURN NUMBER
  IS
    b BOOLEAN;
  BEGIN
    -- Actual function call
    b := number_to_boolean(i);
    
    -- Translation to numeric result
    RETURN CASE b WHEN TRUE THEN 1 WHEN FALSE THEN 0 END;
  END number_to_boolean_;
SELECT 
  number_to_boolean_(1) AS a, 
  number_to_boolean_(0) AS b, 
  number_to_boolean_(null) AS c
FROM dual;

This now yields:

 A   B   C
-------------
 1   0   null

Of course, we don’t get an actual boolean type back in the result set, as the SQL engine cannot process that. But if you’re calling this function from JDBC, 1/0/null can be translated transparently to true/false/null.

It also works for chaining. Instead of the following, which still yields ORA-00902:

SELECT 
  boolean_to_number(number_to_boolean(1)), 
  boolean_to_number(number_to_boolean(0)), 
  boolean_to_number(number_to_boolean(null))
FROM dual;

We can write this:

WITH
  FUNCTION number_to_boolean_(i NUMBER)
  RETURN NUMBER
  IS
    b BOOLEAN;
  BEGIN
    -- Actual function call
    b := number_to_boolean(i);
    
    -- Translation to numeric result
    RETURN CASE b WHEN TRUE THEN 1 WHEN FALSE THEN 0 END;
  END number_to_boolean_;
  
  FUNCTION boolean_to_number_(b NUMBER)
  RETURN NUMBER
  IS
  BEGIN
    -- Actual function call
    RETURN boolean_to_number(NOT b = 0);
  END boolean_to_number_;
SELECT 
  boolean_to_number_(number_to_boolean_(1)) AS a, 
  boolean_to_number_(number_to_boolean_(0)) AS b, 
  boolean_to_number_(number_to_boolean_(null)) AS c
FROM dual;

… which again yields

 A   B   C
-------------
 1   0   null

And now, the 1/0/null integers are the actual desired result types.

This technique can be automated for any type of PL/SQL function that accepts and/or returns a PL/SQL BOOLEAN type, or even for functions that accept %ROWTYPE parameters, which we’ll work into jOOQ soon, in the near future.

A more real world example can be seen in this Stack Overflow question.

jOOQ 3.12 support

In jOOQ 3.12, we will add native support for using such functions in SQL through #8522. We have already supported PL/SQL boolean types in standalone procedure calls since jOOQ 3.8. With the next version, we can call a function like this one:

FUNCTION f_bool (i BOOLEAN) RETURN BOOLEAN;

From anywhere within a jOOQ statement, e.g.

Record1<Integer> r =
create()
    .select(one())
    .where(PlsObjects.fBool(false))
    .fetchOne();

assertNull(r);

When the above is called, the following SQL statement is generated by jOOQ 3.12, behind the scenes:

with
  function "F_BOOL_"(I integer)
  return integer
  is
    "r" boolean;
  begin
    "r" := "TEST"."PLS_OBJECTS"."F_BOOL"(not I = 0);
    return case when "r" then 1 when not "r" then 0 end;
  end "F_BOOL_";
  select 1
from dual
where (F_BOOL_(0) = 1)

Notice how the boolean expression codes like a true boolean / predicate?

The Difference Between SQL’s JOIN .. ON Clause and the Where Clause

A question that is frequently occurring among my SQL training‘s participants is:

What’s the difference between putting a predicate in the JOIN .. ON clause and the WHERE clause?

I can definitely see how that’s confusing some people, as there seems to be no difference at first sight, when running queries like these, e.g. in Oracle. I’m using the Sakila database, as always:

-- First query
SELECT a.actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE fa.film_id < 10
GROUP BY a.actor_id, a.first_name, a.last_name
ORDER BY count(fa.film_id) DESC;

This will yield something like:

ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT
--------------------------------------
108       WARREN      NOLTE      3
162       OPRAH       KILMER     3
19        BOB         FAWCETT    2
10        CHRISTIAN   GABLE      2
53        MENA        TEMPLE     2
137       MORGAN      WILLIAMS   1
2         NICK        WAHLBERG   1

Of course, we could have written this instead, and received the same result:

-- Second query
SELECT a.actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
  AND fa.film_id < 10
GROUP BY a.actor_id, a.first_name, a.last_name
ORDER BY count(fa.film_id) DESC;

Now, I’ve moved the FILM_ID < 10 filter from the WHERE clause to the ON clause. But the execution plan is the same for both queries:

---------------------------------------------------------
| Id  | Operation               | Name          | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT        |               |    49 |
|   1 |  SORT ORDER BY          |               |    49 |
|   2 |   HASH GROUP BY         |               |    49 |
|*  3 |    HASH JOIN            |               |    49 |
|*  4 |     INDEX FAST FULL SCAN| PK_FILM_ACTOR |    49 |
|   5 |     VIEW                | VW_GBF_7      |   200 |
|   6 |      TABLE ACCESS FULL  | ACTOR         |   200 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ITEM_1"="FA"."ACTOR_ID")
   4 - filter("FA"."FILM_ID"<10)

It does not seem matter at all. Both queries yield the same result as well as the same plan. So…

Are ON and WHERE really the same thing?

They are when you run an inner join. But they are not when you run an outer join.

And now, let’s compare these two queries here:

-- First query
SELECT a.actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE fa.film_id < 10
GROUP BY a.actor_id, a.first_name, a.last_name
ORDER BY count(fa.film_id) ASC;

Yielding

ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT
--------------------------------------
194       MERYL       ALLEN      1
198       MARY        KEITEL     1
30        SANDRA      PECK       1
85        MINNIE      ZELLWEGER  1
123       JULIANNE    DENCH      1

Notice that with this syntax, we’re not getting any actors that have no films with FILM_ID < 10. We should get dozens! How about this:

-- Second query
SELECT a.actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
  AND fa.film_id < 10
GROUP BY a.actor_id, a.first_name, a.last_name
ORDER BY count(fa.film_id) ASC;

This used to produce the same result for an (INNER) JOIN, but given the LEFT JOIN, we’re now not getting Susan Davis in the result:

ACTOR_ID  FIRST_NAME  LAST_NAME     COUNT
-----------------------------------------
3         ED          CHASE         0
4         JENNIFER    DAVIS         0
5         JOHNNY      LOLLOBRIGIDA  0
6         BETTE       NICHOLSON	    0
...
1         PENELOPE    GUINESS       1
200       THORA       TEMPLE        1
2         NICK        WAHLBERG      1
198       MARY        KEITEL        1

The plans are also different:

---------------------------------------------------------
| Id  | Operation               | Name          | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT        |               |    49 |
|   1 |  SORT ORDER BY          |               |    49 |
|   2 |   HASH GROUP BY         |               |    49 |
|*  3 |    HASH JOIN            |               |    49 |
|*  4 |     INDEX FAST FULL SCAN| PK_FILM_ACTOR |    49 |
|   5 |     VIEW                | VW_GBF_7      |   200 |
|   6 |      TABLE ACCESS FULL  | ACTOR         |   200 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ITEM_1"="FA"."ACTOR_ID")
   4 - filter("FA"."FILM_ID"<10)

No outer join here! Versus

---------------------------------------------------------------
| Id  | Operation                     | Name          | Rows  |
---------------------------------------------------------------
|   0 | SELECT STATEMENT              |               |   200 |
|   1 |  SORT ORDER BY                |               |   200 |
|   2 |   MERGE JOIN OUTER            |               |   200 |
|   3 |    TABLE ACCESS BY INDEX ROWID| ACTOR         |   200 |
|   4 |     INDEX FULL SCAN           | PK_ACTOR      |   200 |
|*  5 |    SORT JOIN                  |               |    44 |
|   6 |     VIEW                      | VW_GBC_5      |    44 |
|   7 |      HASH GROUP BY            |               |    44 |
|*  8 |       INDEX FAST FULL SCAN    | PK_FILM_ACTOR |    49 |
---------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - access("A"."ACTOR_ID"="ITEM_1"(+))
       filter("A"."ACTOR_ID"="ITEM_1"(+))
   8 - filter("FILM_ID"(+)<10)

The first query did not produce an outer join operation, the second one did!

What’s the difference?

The difference is:

  • An INNER JOIN produces all the actors who played in at least one film, filtering out the actors who did not play in a film. That’s the very definition of an inner join. If we filter the films with FILM_ID < 10, that simply means we don’t want any actors without such films either.
  • A LEFT JOIN will produce all the rows from the left side of the join, regardless if there is a matching row on the right side of the join.

In both cases, the matching rows are determined by the ON clause. If two rows don’t match, then:

  • The INNER JOIN removes them both from the result
  • The LEFT JOIN retains the left row in the result

But regardless what the JOIN produces, the WHERE clause will again remove rows that do not satisfy the filter. So,

  • In the INNER JOIN case, it does not matter if we remove actors with no films, and then actors without films with FILM_ID < 10, OR if we remove actors with no films with FILM_ID < 10 directly. They’re going to be removed anyway.
  • In the LEFT JOIN case, it does matter if we retain actors with no films, and then remove actors without films with FILM_ID < 10 (in case of which actors without films will be removed again), OR if we retain actors without films with FILM_ID < 10, and then not apply any further filters.

Conclusion

For INNER JOIN, WHERE predicates and ON predicates have the same effect.

For OUTER JOIN, WHERE predicates and ON predicates have a different effect.

In general, it is always best to put a predicate where it belongs, logically. If the predicate is related to a JOIN operation, it belongs in the ON clause. If a predicate is related to a filter applied to the entire FROM clause, it belongs in the WHERE clause.