Using IGNORE NULLS With SQL Window Functions to Fill Gaps

Posted on April 24, 2019April 24, 2019 by lukaseder

I found a very interesting SQL question on Twitter recently:

Hi @sfonplsql we have some scenario, Let us 01Jan Mkt Value 100, 02Jan 120, next entry available 25th Jan 125, from 3rd Jan 24 Jan, our value should be 120. How to arrive ? Thanks @oraclebase
— Vikki (@vikkiarul) April 23, 2019

Rephrasing the question: We have a set of sparse data points:

+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 |
| 2019-01-05 |   125 |
| 2019-01-06 |   128 |
| 2019-01-10 |   130 |
+------------+-------+

Since dates can be listed as discrete, continuous data points, why not fill in the gaps between 2019-01-02 and 2019-01-05 or 2019-01-06 and 2019-01-10? The desired output would be:

+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 | <-+
| 2019-01-03 |   120 |   | -- Generated
| 2019-01-04 |   120 |   | -- Generated
| 2019-01-05 |   125 |
| 2019-01-06 |   128 | <-+
| 2019-01-07 |   128 |   | -- Generated
| 2019-01-08 |   128 |   | -- Generated
| 2019-01-09 |   128 |   | -- Generated
| 2019-01-10 |   130 |
+------------+-------+

In the generated columns, we’ll just repeat the most recent value.

How to do this with SQL?

For the sake of this example, I’m using Oracle SQL, as the OP was expecting to do this with Oracle. The idea is to do this in two steps:

Generate all the dates between the first and the last data points
For each date, find either the current data point, or the most recent one

But first, let’s create the data:


create table t (value_date, value) as
  select date '2019-01-01', 100 from dual union all
  select date '2019-01-02', 120 from dual union all
  select date '2019-01-05', 125 from dual union all
  select date '2019-01-06', 128 from dual union all
  select date '2019-01-10', 130 from dual;

1. Generating all the dates In Oracle, we can use the convenient CONNECT BY syntax for this. We could also use some other tool to generate dates to fill the gaps, including SQL standard recursion using WITH, or some PIPELINED function, but I like CONNECT BY for this purpose. We’ll write:


select (
  select min(t.value_date) 
  from t
) + level - 1 as value_date
from dual
connect by level &lt;= (
  select max(t.value_date) - min(t.value_date) + 1
  from t
)

This produces:

VALUE_DATE|
----------|
2019-01-01|
2019-01-02|
2019-01-03|
2019-01-04|
2019-01-05|
2019-01-06|
2019-01-07|
2019-01-08|
2019-01-09|
2019-01-10|

Now we wrap the above query in a derived table and left join the actual data set:


select 
  d.value_date,
  t.value
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level &lt;= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

The date gaps are now filled, but our values column is still sparse:

VALUE_DATE|VALUE|
----------|-----|
2019-01-01|  100|
2019-01-02|  120|
2019-01-03|     |
2019-01-04|     |
2019-01-05|  125|
2019-01-06|  128|
2019-01-07|     |
2019-01-08|     |
2019-01-09|     |
2019-01-10|  130|

2. Fill the value gaps On each row, the VALUE column should either contain the actual value, or the “last_value” preceding the current row, ignoring all the nulls. Note that I specifically wrote this requirement using specific English language. We can now translate that sentence directly to SQL:


last_value (t.value) ignore nulls over (order by d.value_date)

Since we have added an ORDER BY clause to the window function, the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW applies, which colloquially means “all the preceding rows”. (Technically, that’s not accurate. It means all rows with values less than or equal to the value of the current row – see Kim Berg Hansen’s comment) Convenient! We’re trying to find the last value in the window of all the preceding rows, ignoring the nulls. This is standard SQL, but unfortunately not all RDBMS support IGNORE NULLS. Among the ones supported by jOOQ, currently these ones support the syntax:

DB2
H2
Informix
Oracle
Redshift
Sybase SQL Anywhere
Teradata

Sometimes, not the exact standard syntax is supported, but the standard feature. Use https://www.jooq.org/translate to see different syntax variants. The full query now reads:


select 
  d.value_date,
  last_value (t.value) ignore nulls over (order by d.value_date)
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level &lt;= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

… and it yields the desired result:

VALUE_DATE         |VALUE|
-------------------|-----|
2019-01-01 00:00:00|  100|
2019-01-02 00:00:00|  120|
2019-01-03 00:00:00|  120|
2019-01-04 00:00:00|  120|
2019-01-05 00:00:00|  125|
2019-01-06 00:00:00|  128|
2019-01-07 00:00:00|  128|
2019-01-08 00:00:00|  128|
2019-01-09 00:00:00|  128|
2019-01-10 00:00:00|  130|

Other RDBMS

This solution made use of some Oracle specific features such as CONNECT BY. In other RDBMS, the same idea can be implemented by using a different way of generating data. This article focuses only on using IGNORE NULLS. If you’re interested, feel free to post an alternative solution in the comments for your RDBMS.

Published by lukaseder

I made jOOQ View all posts by lukaseder

15 thoughts on “Using IGNORE NULLS With SQL Window Functions to Fill Gaps”

Kim Berg Hansen says:

April 24, 2019 at 10:53

I know that in this case it matters not, but I dislike this explanation:

“…the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW applies, which means “all the preceding rows”.”

Nope, ROWS BETWEEN means “all the preceding rows” – RANGE BETWEEN means “all rows with the value of the ORDER BY column less than or equal to this row” (which might include following rows.)

I’m sure you know this – it’s just a pet peeve of mine that I always stomp on whenever I see it, so that newcomers who read it won’t be deceived ;-)

Loading...

Reply
1. lukaseder says:
  
  April 24, 2019 at 11:01
  
  I indeed omitted it for sake of not distracting, but you’re right. The reading should be accurate.
  
  Loading...
  
  Reply
chunkyks says:

April 24, 2019 at 21:04
SQLite is my normal weapon of choice. To do something like this, I use a recursive CTE to create the dates and a correlated subquery to pick the values:
```
WITH RECURSIVE
  daterange(s,e) AS (SELECT min(date), max(date) FROM tbl),
  alldates(d) AS (SELECT s FROM daterange UNION ALL
    SELECT date(d, '+ 1 day') FROM alldates, daterange WHERE d<=e)
  SELECT alldates.d, outertbl.value FROM alldates, tbl outertbl
    WHERE outertbl.date=(SELECT innertbl.date FROM tbl innertbl WHERE innertbl.date<=alldates.d ORDER BY innertbl.date LIMIT 1)
```
Loading...
Reply
1. lukaseder says:
  
  April 25, 2019 at 10:02
  Nice. Works well with these two fixes:
  - Interval should be '+1 day' (no space between + and 1)
  - Correlated subquery shopuld ORDER BY innertbl.date DESC
  Loading...
  Reply
Albert Heart says:

May 1, 2019 at 02:54
Below are two Postgres SQL solutions — i think twitter question is best answered by the complex using case when logic with subquery and outer join; I am reading the twitter comment– “from 3rd Jan 24 Jan, our value should be 120” – to mean that EACH day Jan 3,4,5,6 – Jan 24 all have value= 120; I could easily set the value in this range to null, zero, -999, etc. ;
```
--how to deal with gaps simple--
 SELECT 100 as "value1", value_dt::date 
                FROM generate_series(
              timestamp without time zone '2019-01-01',
              timestamp without time zone '2019-01-01',
              '1 day') as value_dt
union all
 SELECT 120 as "value1", value_dt::date 
                FROM generate_series(
              timestamp without time zone '2019-01-02',
              timestamp without time zone '2019-01-02',
              '1 day') as value_dt
union all
 SELECT 125 as "value1", value_dt::date 
                FROM generate_series(
              timestamp without time zone '2019-01-25',
              timestamp without time zone '2019-01-25',
              '1 day') as value_dt
--gaps above
--fill gaps
union all
 SELECT 120 as "value1", value_dt::date 
                FROM generate_series(
              timestamp without time zone '2019-01-03',
              timestamp without time zone '2019-01-24',
              '1 day') as value_dt
order by value_dt
--end how to deal with gaps simple--;

--how to deal with gaps complex--
select   value_dtx, case when value1 is null then value2 else value1 end as finalval  from (
         SELECT 100 as "value1", value_dt::date 
                        FROM generate_series(
                      timestamp without time zone '2019-01-01',
                      timestamp without time zone '2019-01-01',
                      '1 day') as value_dt
        union all
         SELECT 120 as "value1", value_dt::date 
                        FROM generate_series(
                      timestamp without time zone '2019-01-02',
                      timestamp without time zone '2019-01-02',
                      '1 day') as value_dt
        union all
         SELECT 125 as "value1", value_dt::date 
                        FROM generate_series(
                      timestamp without time zone '2019-01-25',
                      timestamp without time zone '2019-01-25',
                      '1 day') as value_dt
    )x
--table above has date gaps
right outer join (
     SELECT 120 as "value2", value_dtx::date 
                    FROM generate_series(
                  timestamp without time zone '2019-01-01',
                  timestamp without time zone '2019-01-25',
                  '1 day') as value_dtx 
    )lj on lj.value_dtx = value_dt
--fill date gaps with right outer join
order by value_dtx
--end how to deal with gaps complex--;
```
Loading...
Reply
1. lukaseder says:
  
  May 1, 2019 at 11:11
  
  Albert! :-) You just hard coded the values that you wanted to fill in the gaps. That doesn’t count!
  
  Loading...
  
  Reply
  1. Albert Heart says:
    
    May 1, 2019 at 20:49
    
    lukaseder- sure, but the logic can be manipulated to be more dynamic.
    see below – the range of dates in the first query can be anything; IT IS NOT hardcoded now. I only coded one end in where clause value_dtx <= value_dt but that can also be changed. The other subquery is the same as creating a table which is also hardcoded in the example
    
    select value_dtx, min (value1) as finalvalue from (
    SELECT 1 as grp_id, 120 as “value2”, date’1900-12-31′ as _last_dt_min2, date’2050-12-31′ as _last_dt_max2, value_dtx::date
    FROM generate_series(
    timestamp without time zone ‘2019-01-01’,
    timestamp without time zone ‘2019-01-31’,
    ‘1 day’) as value_dtx
    order by 1
    )z
    left join lateral
    (select * from (
    SELECT 1 as grp_id, 100 as “value1”, value_dt::date
    FROM generate_series(
    timestamp without time zone ‘2019-01-01’,
    timestamp without time zone ‘2019-01-01’,
    ‘1 day’) as value_dt
    union all
    SELECT 1 as grp_id, 120 as “value1”, value_dt::date
    FROM generate_series(
    timestamp without time zone ‘2019-01-02’,
    timestamp without time zone ‘2019-01-02’,
    ‘1 day’) as value_dt
    union all
    SELECT 1 as grp_id, 125 as “value1”, value_dt::date
    FROM generate_series(
    timestamp without time zone ‘2019-01-25’,
    timestamp without time zone ‘2019-01-25’,
    ‘1 day’) as value_dt
    )x
    left outer join (select 1 as grp_id,
    min(value_dt) as last_dt_min,
    max(value_dt) as last_dt_max
    from (
    SELECT 100 as “value1”, value_dt::date
    FROM generate_series(
    timestamp without time zone ‘2019-01-01’,
    timestamp without time zone ‘2019-01-01’,
    ‘1 day’) as value_dt
    union all
    SELECT 120 as “value1”, value_dt::date
    FROM generate_series(
    timestamp without time zone ‘2019-01-02’,
    timestamp without time zone ‘2019-01-02’,
    ‘1 day’) as value_dt
    union all
    SELECT 125 as “value1”, value_dt::date
    FROM generate_series(
    timestamp without time zone ‘2019-01-25’,
    timestamp without time zone ‘2019-01-25’,
    ‘1 day’) as value_dt
    )y
    group by 1
    order by 1
    )p on true
    where value_dtx <= value_dt
    )qq on true
    where value1 is not null
    group by 1
    order by 1
    
    Loading...
    
    Reply

in case it is not obvious here is the full dynamic code

select value_dtx, min (value1) as finalvalue  from (
SELECT   1 as grp_id, 120 as "value2", date'1900-12-31' as _last_dt_min2, date'2050-12-31' as _last_dt_max2, value_dtx::date 
                    FROM generate_series(
                  timestamp without time zone '2018-12-21',
                  timestamp without time zone '2019-01-31',
                  '1 day') as value_dtx 
order by 1
)z
--the dates in z can be any range the important requirement is to fill in gaps --
left join lateral
(select * from (
             SELECT 1 as grp_id, 100 as "value1", value_dt::date 
                            FROM generate_series(
                          timestamp without time zone '2019-01-01',
                          timestamp without time zone '2019-01-01',
                          '1 day') as value_dt
            union all
             SELECT  1 as grp_id, 120 as "value1", value_dt::date 
                            FROM generate_series(
                          timestamp without time zone '2019-01-02',
                          timestamp without time zone '2019-01-02',
                          '1 day') as value_dt
            union all
             SELECT  1 as grp_id, 125 as "value1", value_dt::date 
                            FROM generate_series(
                          timestamp without time zone '2019-01-25',
                          timestamp without time zone '2019-01-25',
                          '1 day') as value_dt
    )x
    left outer join (select    1 as grp_id,
    min(value_dt) as last_dt_min,
    max(value_dt) as last_dt_max
    from (
             SELECT 100 as "value1", value_dt::date 
                            FROM generate_series(
                          timestamp without time zone '2019-01-01',
                          timestamp without time zone '2019-01-01',
                          '1 day') as value_dt
            union all
             SELECT 120 as "value1", value_dt::date 
                            FROM generate_series(
                          timestamp without time zone '2019-01-02',
                          timestamp without time zone '2019-01-02',
                          '1 day') as value_dt
            union all
             SELECT 125 as "value1", value_dt::date 
                            FROM generate_series(
                          timestamp without time zone '2019-01-25',
                          timestamp without time zone '2019-01-25',
                          '1 day') as value_dt
        )y
    group by 1
    order by 1
    )p on true
where value_dtx <= value_dt  and value_dtx >=last_dt_min and value_dtx <=last_dt_max
)qq on true
where value1 is not null
group by 1
order by 1

Lukas, Thanks for this inspiring post.
Albert, PostgreSQL aficionados: This is a solution for Postgres based on chunkyks solution:

  WITH RECURSIVE 
  daterange(vdate_min, vdate_max) AS (
    SELECT min(vdate), max(vdate) FROM tbl
  ),
  alldates(vdate2) AS (
    SELECT vdate_min FROM daterange 
    UNION ALL
    SELECT vdate2 + interval '1' day FROM alldates, daterange WHERE vdate2 <= vdate_max
  )
  SELECT alldates.vdate2, outertbl.v AS v2
  FROM alldates, tbl AS outertbl
  WHERE outertbl.vdate = (
    SELECT innertbl.vdate FROM tbl AS innertbl WHERE innertbl.vdate <= alldates.vdate2 
    ORDER BY innertbl.vdate DESC
    LIMIT 1
  )
  ORDER BY vdate2;

Albert Heart says:

July 1, 2019 at 20:01

Recursive solution is interesting, unnecessary, but a nice trick

Loading...

Reply

— Microsoft SQL Server

drop table if exists #t;
create table #t (value_date datetime, [value] int, index t_index_vd clustered (value_date desc));
insert #t (value_date, [value]) output inserted.*
select * from (values ('20190101',100),('20190102', 120),('20190105', 125),('20190106', 128),('20190110', 130)) a(d,i);
GO
drop table if exists #tWithoutIndex;
create table #tWithoutIndex (value_date datetime, [value] int);
insert #tWithoutIndex (value_date, [value]) output inserted.*
select * from (values ('20190101',100),('20190102', 120),('20190105', 125),('20190106', 128),('20190110', 130)) a(d,i);
GO
declare @date_min date = (select (select min(value_date) from #t)), @date_max date = (select (select max(value_date) from #t));
with d(value_date,i) as (select @date_min,1 union all select dateadd(day,1,d.value_date),i+1 from d where d.value_date < @date_max and d.i < 100) 
select
  d.value_date, t.value
from d cross apply (select top 1 [value] from #t where d.value_date >= value_date order by value_date desc) t;
with d(value_date,i) as (select @date_min,1 union all select dateadd(day,1,d.value_date),i+1 from d where d.value_date < @date_max and d.i < 100) 
select
  d.value_date, t.value
from d cross apply (select top 1 [value] from #tWithoutIndex where d.value_date >= value_date order by value_date desc) t;
GO

Vitaly Molokov says:

June 25, 2019 at 12:32
Lukas, thank you for the post.
Postgres solution might build on existing window function:
```
select generate_series::date as continuous_date, value
from (
    SELECT daterange(value_date, lead(value_date) over (order by value_date)) span, value
    FROM t
  ) foo
  , generate_series(lower(span), coalesce((upper(span) - 'P1D'::interval)::date, lower(span)) , 'P1D'::interval)
```
Loading...
Reply
1. Albert Heart says:
  
  July 1, 2019 at 19:59
  
  Vitally : we’ve solved above with Postgres and your solution isn’t offering anything new. SF Keller at least added recursive sql which is a neat trick. You don’t need to use the recursion as my solution provides the correct answer when output;
  
  Loading...
  
  Reply
Dave says:

February 8, 2020 at 08:11
BigQuery doesn’t include WITH RECURSIVE or CONNECT BY , so instead I generated the dates using UNNEST(GENERATE_DATE_ARRAY()). Thankfully LAST_VALUE and IGNORE NULLS both work fine, so I used that part of Luka’s query unchanged. In the query that follows, the BiqQuery table myproj.mydataset.mytable is assumed to have a column of dates called date_orig and a column of floats called value. I suspect that some of the ORDER BY clauses are superfluous, but I threw them in anyway because BigQuery sometimes surprises me by returning data out of order.
```
WITH
  dt AS (
  SELECT
    date,
    mt.value
  FROM
    UNNEST(GENERATE_DATE_ARRAY((
        SELECT
          MIN(date_orig)
        FROM
          myproj.mydataset.mytable), (
        SELECT
          MAX(date_orig)
        FROM
          myproj.mydataset.mytable))) AS date
  JOIN
    myproj.mydataset.mytable AS mt
  ON
    date = mt.date_orig
  ORDER BY
    date)
SELECT
  dt.date,
  LAST_VALUE(dt.value IGNORE NULLS) OVER (ORDER BY dt.date)
FROM
  dt
ORDER BY
  dt.date;
```
Loading...
Reply

The equivalent of LEAD/LAG IGNORE NULLS by using windowed ARRAY_AGG function, skipping NULL values and accessing last element of cumulative ARRAY

Below PostgreSQL version:

WITH source_data(id, value) AS (
VALUES (1, 100), (2, NULL), (3, NULL),
       (4, NULL), (5, NULL), (6, NULL),
       (7, NULL), (8, 200), (9, NULL)
)
, cte AS (
  SELECT *,
    ARRAY_REMOVE((ARRAY_AGG(value) OVER(/*PARTITION BY any_group*/ ORDER BY id)), NULL) 
       AS value_arr
  FROM source_data
)
SELECT id, value,
    value_arr[array_upper(value_arr, 1)] AS lag_ignore_nulls,
    value_arr -- debug
FROM cte
ORDER BY id;

Live demo: https://dbfiddle.uk/etvmKktH
Original post: https://stackoverflow.com/a/75634777/5070879

Using IGNORE NULLS With SQL Window Functions to Fill Gaps

How to do this with SQL?

Other RDBMS

Like this:

Published by lukaseder

15 thoughts on “Using IGNORE NULLS With SQL Window Functions to Fill Gaps”

Leave a ReplyCancel reply

How to do this with SQL?

Other RDBMS

Like this:

Published by lukaseder

15 thoughts on “Using IGNORE NULLS With SQL Window Functions to Fill Gaps”

Leave a ReplyCancel reply

Discover more from Java, SQL and jOOQ.