Using IGNORE NULLS With SQL Window Functions to Fill Gaps

I found a very interesting SQL question on Twitter recently:
Rephrasing the question: We have a set of sparse data points:
+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 |
| 2019-01-05 |   125 |
| 2019-01-06 |   128 |
| 2019-01-10 |   130 |
+------------+-------+
Since dates can be listed as discrete, continuous data points, why not fill in the gaps between 2019-01-02 and 2019-01-05 or 2019-01-06 and 2019-01-10? The desired output would be:
+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 | <-+
| 2019-01-03 |   120 |   | -- Generated
| 2019-01-04 |   120 |   | -- Generated
| 2019-01-05 |   125 |
| 2019-01-06 |   128 | <-+
| 2019-01-07 |   128 |   | -- Generated
| 2019-01-08 |   128 |   | -- Generated
| 2019-01-09 |   128 |   | -- Generated
| 2019-01-10 |   130 |
+------------+-------+
In the generated columns, we’ll just repeat the most recent value.

How to do this with SQL?

For the sake of this example, I’m using Oracle SQL, as the OP was expecting to do this with Oracle. The idea is to do this in two steps:
  1. Generate all the dates between the first and the last data points
  2. For each date, find either the current data point, or the most recent one
But first, let’s create the data:

create table t (value_date, value) as
  select date '2019-01-01', 100 from dual union all
  select date '2019-01-02', 120 from dual union all
  select date '2019-01-05', 125 from dual union all
  select date '2019-01-06', 128 from dual union all
  select date '2019-01-10', 130 from dual;

1. Generating all the dates In Oracle, we can use the convenient CONNECT BY syntax for this. We could also use some other tool to generate dates to fill the gaps, including SQL standard recursion using WITH, or some PIPELINED function, but I like CONNECT BY for this purpose. We’ll write:

select (
  select min(t.value_date) 
  from t
) + level - 1 as value_date
from dual
connect by level <= (
  select max(t.value_date) - min(t.value_date) + 1
  from t
)

This produces:
VALUE_DATE|
----------|
2019-01-01|
2019-01-02|
2019-01-03|
2019-01-04|
2019-01-05|
2019-01-06|
2019-01-07|
2019-01-08|
2019-01-09|
2019-01-10|
Now we wrap the above query in a derived table and left join the actual data set:

select 
  d.value_date,
  t.value
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level <= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

The date gaps are now filled, but our values column is still sparse:
VALUE_DATE|VALUE|
----------|-----|
2019-01-01|  100|
2019-01-02|  120|
2019-01-03|     |
2019-01-04|     |
2019-01-05|  125|
2019-01-06|  128|
2019-01-07|     |
2019-01-08|     |
2019-01-09|     |
2019-01-10|  130|
2. Fill the value gaps On each row, the VALUE column should either contain the actual value, or the “last_value” preceding the current row, ignoring all the nulls. Note that I specifically wrote this requirement using specific English language. We can now translate that sentence directly to SQL:

last_value (t.value) ignore nulls over (order by d.value_date)

Since we have added an ORDER BY clause to the window function, the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW applies, which colloquially means “all the preceding rows”. (Technically, that’s not accurate. It means all rows with values less than or equal to the value of the current row – see Kim Berg Hansen’s comment) Convenient! We’re trying to find the last value in the window of all the preceding rows, ignoring the nulls. This is standard SQL, but unfortunately not all RDBMS support IGNORE NULLS. Among the ones supported by jOOQ, currently these ones support the syntax:
  • DB2
  • H2
  • Informix
  • Oracle
  • Redshift
  • Sybase SQL Anywhere
  • Teradata
Sometimes, not the exact standard syntax is supported, but the standard feature. Use https://www.jooq.org/translate to see different syntax variants. The full query now reads:

select 
  d.value_date,
  last_value (t.value) ignore nulls over (order by d.value_date)
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level <= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

… and it yields the desired result:
VALUE_DATE         |VALUE|
-------------------|-----|
2019-01-01 00:00:00|  100|
2019-01-02 00:00:00|  120|
2019-01-03 00:00:00|  120|
2019-01-04 00:00:00|  120|
2019-01-05 00:00:00|  125|
2019-01-06 00:00:00|  128|
2019-01-07 00:00:00|  128|
2019-01-08 00:00:00|  128|
2019-01-09 00:00:00|  128|
2019-01-10 00:00:00|  130|

Other RDBMS

This solution made use of some Oracle specific features such as CONNECT BY. In other RDBMS, the same idea can be implemented by using a different way of generating data. This article focuses only on using IGNORE NULLS. If you’re interested, feel free to post an alternative solution in the comments for your RDBMS.

15 thoughts on “Using IGNORE NULLS With SQL Window Functions to Fill Gaps

  1. I know that in this case it matters not, but I dislike this explanation:

    “…the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW applies, which means “all the preceding rows”.”

    Nope, ROWS BETWEEN means “all the preceding rows” – RANGE BETWEEN means “all rows with the value of the ORDER BY column less than or equal to this row” (which might include following rows.)

    I’m sure you know this – it’s just a pet peeve of mine that I always stomp on whenever I see it, so that newcomers who read it won’t be deceived ;-)

  2. SQLite is my normal weapon of choice. To do something like this, I use a recursive CTE to create the dates and a correlated subquery to pick the values:

    WITH RECURSIVE
      daterange(s,e) AS (SELECT min(date), max(date) FROM tbl),
      alldates(d) AS (SELECT s FROM daterange UNION ALL
        SELECT date(d, '+ 1 day') FROM alldates, daterange WHERE d<=e)
      SELECT alldates.d, outertbl.value FROM alldates, tbl outertbl
        WHERE outertbl.date=(SELECT innertbl.date FROM tbl innertbl WHERE innertbl.date<=alldates.d ORDER BY innertbl.date LIMIT 1)
    
    1. Nice. Works well with these two fixes:

      • Interval should be '+1 day' (no space between + and 1)
      • Correlated subquery shopuld ORDER BY innertbl.date DESC
  3. Below are two Postgres SQL solutions — i think twitter question is best answered by the complex using case when logic with subquery and outer join; I am reading the twitter comment– “from 3rd Jan 24 Jan, our value should be 120” – to mean that EACH day Jan 3,4,5,6 – Jan 24 all have value= 120; I could easily set the value in this range to null, zero, -999, etc. ;

    --how to deal with gaps simple--
     SELECT 100 as "value1", value_dt::date 
                    FROM generate_series(
                  timestamp without time zone '2019-01-01',
                  timestamp without time zone '2019-01-01',
                  '1 day') as value_dt
    union all
     SELECT 120 as "value1", value_dt::date 
                    FROM generate_series(
                  timestamp without time zone '2019-01-02',
                  timestamp without time zone '2019-01-02',
                  '1 day') as value_dt
    union all
     SELECT 125 as "value1", value_dt::date 
                    FROM generate_series(
                  timestamp without time zone '2019-01-25',
                  timestamp without time zone '2019-01-25',
                  '1 day') as value_dt
    --gaps above
    --fill gaps
    union all
     SELECT 120 as "value1", value_dt::date 
                    FROM generate_series(
                  timestamp without time zone '2019-01-03',
                  timestamp without time zone '2019-01-24',
                  '1 day') as value_dt
    order by value_dt
    --end how to deal with gaps simple--;
    
    --how to deal with gaps complex--
    select   value_dtx, case when value1 is null then value2 else value1 end as finalval  from (
             SELECT 100 as "value1", value_dt::date 
                            FROM generate_series(
                          timestamp without time zone '2019-01-01',
                          timestamp without time zone '2019-01-01',
                          '1 day') as value_dt
            union all
             SELECT 120 as "value1", value_dt::date 
                            FROM generate_series(
                          timestamp without time zone '2019-01-02',
                          timestamp without time zone '2019-01-02',
                          '1 day') as value_dt
            union all
             SELECT 125 as "value1", value_dt::date 
                            FROM generate_series(
                          timestamp without time zone '2019-01-25',
                          timestamp without time zone '2019-01-25',
                          '1 day') as value_dt
        )x
    --table above has date gaps
    right outer join (
         SELECT 120 as "value2", value_dtx::date 
                        FROM generate_series(
                      timestamp without time zone '2019-01-01',
                      timestamp without time zone '2019-01-25',
                      '1 day') as value_dtx 
        )lj on lj.value_dtx = value_dt
    --fill date gaps with right outer join
    order by value_dtx
    --end how to deal with gaps complex--;
    
      1. lukaseder- sure, but the logic can be manipulated to be more dynamic.
        see below – the range of dates in the first query can be anything; IT IS NOT hardcoded now. I only coded one end in where clause value_dtx <= value_dt but that can also be changed. The other subquery is the same as creating a table which is also hardcoded in the example

        select value_dtx, min (value1) as finalvalue from (
        SELECT 1 as grp_id, 120 as “value2”, date’1900-12-31′ as _last_dt_min2, date’2050-12-31′ as _last_dt_max2, value_dtx::date
        FROM generate_series(
        timestamp without time zone ‘2019-01-01’,
        timestamp without time zone ‘2019-01-31’,
        ‘1 day’) as value_dtx
        order by 1
        )z
        left join lateral
        (select * from (
        SELECT 1 as grp_id, 100 as “value1”, value_dt::date
        FROM generate_series(
        timestamp without time zone ‘2019-01-01’,
        timestamp without time zone ‘2019-01-01’,
        ‘1 day’) as value_dt
        union all
        SELECT 1 as grp_id, 120 as “value1”, value_dt::date
        FROM generate_series(
        timestamp without time zone ‘2019-01-02’,
        timestamp without time zone ‘2019-01-02’,
        ‘1 day’) as value_dt
        union all
        SELECT 1 as grp_id, 125 as “value1”, value_dt::date
        FROM generate_series(
        timestamp without time zone ‘2019-01-25’,
        timestamp without time zone ‘2019-01-25’,
        ‘1 day’) as value_dt
        )x
        left outer join (select 1 as grp_id,
        min(value_dt) as last_dt_min,
        max(value_dt) as last_dt_max
        from (
        SELECT 100 as “value1”, value_dt::date
        FROM generate_series(
        timestamp without time zone ‘2019-01-01’,
        timestamp without time zone ‘2019-01-01’,
        ‘1 day’) as value_dt
        union all
        SELECT 120 as “value1”, value_dt::date
        FROM generate_series(
        timestamp without time zone ‘2019-01-02’,
        timestamp without time zone ‘2019-01-02’,
        ‘1 day’) as value_dt
        union all
        SELECT 125 as “value1”, value_dt::date
        FROM generate_series(
        timestamp without time zone ‘2019-01-25’,
        timestamp without time zone ‘2019-01-25’,
        ‘1 day’) as value_dt
        )y
        group by 1
        order by 1
        )p on true
        where value_dtx <= value_dt
        )qq on true
        where value1 is not null
        group by 1
        order by 1

  4. in case it is not obvious here is the full dynamic code

    select value_dtx, min (value1) as finalvalue  from (
    SELECT   1 as grp_id, 120 as "value2", date'1900-12-31' as _last_dt_min2, date'2050-12-31' as _last_dt_max2, value_dtx::date 
                        FROM generate_series(
                      timestamp without time zone '2018-12-21',
                      timestamp without time zone '2019-01-31',
                      '1 day') as value_dtx 
    order by 1
    )z
    --the dates in z can be any range the important requirement is to fill in gaps --
    left join lateral
    (select * from (
                 SELECT 1 as grp_id, 100 as "value1", value_dt::date 
                                FROM generate_series(
                              timestamp without time zone '2019-01-01',
                              timestamp without time zone '2019-01-01',
                              '1 day') as value_dt
                union all
                 SELECT  1 as grp_id, 120 as "value1", value_dt::date 
                                FROM generate_series(
                              timestamp without time zone '2019-01-02',
                              timestamp without time zone '2019-01-02',
                              '1 day') as value_dt
                union all
                 SELECT  1 as grp_id, 125 as "value1", value_dt::date 
                                FROM generate_series(
                              timestamp without time zone '2019-01-25',
                              timestamp without time zone '2019-01-25',
                              '1 day') as value_dt
        )x
        left outer join (select    1 as grp_id,
        min(value_dt) as last_dt_min,
        max(value_dt) as last_dt_max
        from (
                 SELECT 100 as "value1", value_dt::date 
                                FROM generate_series(
                              timestamp without time zone '2019-01-01',
                              timestamp without time zone '2019-01-01',
                              '1 day') as value_dt
                union all
                 SELECT 120 as "value1", value_dt::date 
                                FROM generate_series(
                              timestamp without time zone '2019-01-02',
                              timestamp without time zone '2019-01-02',
                              '1 day') as value_dt
                union all
                 SELECT 125 as "value1", value_dt::date 
                                FROM generate_series(
                              timestamp without time zone '2019-01-25',
                              timestamp without time zone '2019-01-25',
                              '1 day') as value_dt
            )y
        group by 1
        order by 1
        )p on true
    where value_dtx <= value_dt  and value_dtx >=last_dt_min and value_dtx <=last_dt_max
    )qq on true
    where value1 is not null
    group by 1
    order by 1
    
  5. Lukas, Thanks for this inspiring post.
    Albert, PostgreSQL aficionados: This is a solution for Postgres based on chunkyks solution:

      WITH RECURSIVE 
      daterange(vdate_min, vdate_max) AS (
        SELECT min(vdate), max(vdate) FROM tbl
      ),
      alldates(vdate2) AS (
        SELECT vdate_min FROM daterange 
        UNION ALL
        SELECT vdate2 + interval '1' day FROM alldates, daterange WHERE vdate2 <= vdate_max
      )
      SELECT alldates.vdate2, outertbl.v AS v2
      FROM alldates, tbl AS outertbl
      WHERE outertbl.vdate = (
        SELECT innertbl.vdate FROM tbl AS innertbl WHERE innertbl.vdate <= alldates.vdate2 
        ORDER BY innertbl.vdate DESC
        LIMIT 1
      )
      ORDER BY vdate2;
    
  6. — Microsoft SQL Server

    drop table if exists #t;
    create table #t (value_date datetime, [value] int, index t_index_vd clustered (value_date desc));
    insert #t (value_date, [value]) output inserted.*
    select * from (values ('20190101',100),('20190102', 120),('20190105', 125),('20190106', 128),('20190110', 130)) a(d,i);
    GO
    drop table if exists #tWithoutIndex;
    create table #tWithoutIndex (value_date datetime, [value] int);
    insert #tWithoutIndex (value_date, [value]) output inserted.*
    select * from (values ('20190101',100),('20190102', 120),('20190105', 125),('20190106', 128),('20190110', 130)) a(d,i);
    GO
    declare @date_min date = (select (select min(value_date) from #t)), @date_max date = (select (select max(value_date) from #t));
    with d(value_date,i) as (select @date_min,1 union all select dateadd(day,1,d.value_date),i+1 from d where d.value_date < @date_max and d.i < 100) 
    select
      d.value_date, t.value
    from d cross apply (select top 1 [value] from #t where d.value_date >= value_date order by value_date desc) t;
    with d(value_date,i) as (select @date_min,1 union all select dateadd(day,1,d.value_date),i+1 from d where d.value_date < @date_max and d.i < 100) 
    select
      d.value_date, t.value
    from d cross apply (select top 1 [value] from #tWithoutIndex where d.value_date >= value_date order by value_date desc) t;
    GO
    
  7. Lukas, thank you for the post.
    Postgres solution might build on existing window function:

    select generate_series::date as continuous_date, value
    from (
        SELECT daterange(value_date, lead(value_date) over (order by value_date)) span, value
        FROM t
      ) foo
      , generate_series(lower(span), coalesce((upper(span) - 'P1D'::interval)::date, lower(span)) , 'P1D'::interval)
    
    1. Vitally : we’ve solved above with Postgres and your solution isn’t offering anything new. SF Keller at least added recursive sql which is a neat trick. You don’t need to use the recursion as my solution provides the correct answer when output;

  8. BigQuery doesn’t include WITH RECURSIVE or CONNECT BY , so instead I generated the dates using UNNEST(GENERATE_DATE_ARRAY()). Thankfully LAST_VALUE and IGNORE NULLS both work fine, so I used that part of Luka’s query unchanged. In the query that follows, the BiqQuery table myproj.mydataset.mytable is assumed to have a column of dates called date_orig and a column of floats called value. I suspect that some of the ORDER BY clauses are superfluous, but I threw them in anyway because BigQuery sometimes surprises me by returning data out of order.

    WITH
      dt AS (
      SELECT
        date,
        mt.value
      FROM
        UNNEST(GENERATE_DATE_ARRAY((
            SELECT
              MIN(date_orig)
            FROM
              myproj.mydataset.mytable), (
            SELECT
              MAX(date_orig)
            FROM
              myproj.mydataset.mytable))) AS date
      JOIN
        myproj.mydataset.mytable AS mt
      ON
        date = mt.date_orig
      ORDER BY
        date)
    SELECT
      dt.date,
      LAST_VALUE(dt.value IGNORE NULLS) OVER (ORDER BY dt.date)
    FROM
      dt
    ORDER BY
      dt.date;
    
  9. The equivalent of LEAD/LAG IGNORE NULLS by using windowed ARRAY_AGG function, skipping NULL values and accessing last element of cumulative ARRAY

    Below PostgreSQL version:

    WITH source_data(id, value) AS (
    VALUES (1, 100), (2, NULL), (3, NULL),
           (4, NULL), (5, NULL), (6, NULL),
           (7, NULL), (8, 200), (9, NULL)
    )
    , cte AS (
      SELECT *,
        ARRAY_REMOVE((ARRAY_AGG(value) OVER(/*PARTITION BY any_group*/ ORDER BY id)), NULL) 
           AS value_arr
      FROM source_data
    )
    SELECT id, value,
        value_arr[array_upper(value_arr, 1)] AS lag_ignore_nulls,
        value_arr -- debug
    FROM cte
    ORDER BY id;
    

    Live demo: https://dbfiddle.uk/etvmKktH
    Original post: https://stackoverflow.com/a/75634777/5070879

Leave a Reply