Find the Next Non-NULL Row in a Series With SQL

I’ve stumbled across this fun SQL question on reddit, recently. The question was looking at a time series of data points where some events happened. For each event, we have the start time and the end time

timestamp             start    end
-----------------------------------
2018-09-03 07:00:00   1        null
2018-09-03 08:00:00   null     null
2018-09-03 09:00:00   null     null
2018-09-03 10:00:00   null     1
2018-09-03 12:00:00   null     null
2018-09-03 13:00:00   null     null
2018-09-03 14:00:00   1        null
2018-09-03 15:00:00   null     1

The desired output of the query should be this additional count column:

timestamp             start    end    count
-------------------------------------------
2018-09-03 07:00:00   1        null   4
2018-09-03 08:00:00   null     null   null
2018-09-03 09:00:00   null     null   null
2018-09-03 10:00:00   null     1      null
2018-09-03 12:00:00   null     null   null
2018-09-03 13:00:00   null     null   null
2018-09-03 14:00:00   1        null   2
2018-09-03 15:00:00   null     1      null

So, the rule is simple. Whenever an event starts, we would like to know how many consecutive entries it takes until the event stops again. We can visually see how that makes sense:

timestamp             start    end    count
-------------------------------------------
2018-09-03 07:00:00   1        null   4     -- 4 Rows in this event
2018-09-03 08:00:00   null     null   null
2018-09-03 09:00:00   null     null   null
2018-09-03 10:00:00   null     1      null

2018-09-03 12:00:00   null     null   null  -- No event here
2018-09-03 13:00:00   null     null   null

2018-09-03 14:00:00   1        null   2     -- 2 Rows in this event
2018-09-03 15:00:00   null     1      null

Some observations and assumptions about the problem at hand:

No two events will ever overlap
The time series does not progress monotonously, i.e. even if most data points are 1h apart, there can be larger or smaller gaps between data points
There are, however, no two identical timestamps in the series

How can we solve this problem?

Create the data set, first

We’re going to be using PostgreSQL for this example, but it will work with any database that supports window functions, which are most databases these days. In PostgreSQL, we can use the VALUES() clause to generate data in memory easily. For the sake of simplicity, we’re not going to use timestamps, but integer representations of the timestamps. I’ve included the same out-of-the-ordinary gap between 4 and 6:


values (1, 1, null),
       (2, null, null),
       (3, null, null),
       (4, null, 1),
       (6, null, null),
       (7, null, null),
       (8, 1, null),
       (9, null, 1)

If we run this statement (yes, this is a standalone statement in PostgreSQL!), then the database will simply echo back the values we’ve sent it:

column1 |column2 |column3 |
--------|--------|--------|
1       |1       |        |
2       |        |        |
3       |        |        |
4       |        |1       |
6       |        |        |
7       |        |        |
8       |1       |        |
9       |        |1       |

How to deal with non-monotonously growing series

The fact that column1 is not growing monotonously means that we cannot use it / trust it as a means to calculate the length of an event. We need to calculate an additional column that has a guaranteed monotonously growing set of integers in it. The ROW_NUMBER() window function is perfect for that. Consider this SQL statement:


with 
  d(a, b, c) as (
	values (1, 1, null),
	       (2, null, null),
	       (3, null, null),
	       (4, null, 1),
	       (6, null, null),
	       (7, null, null),
	       (8, 1, null),
	       (9, null, 1)
  ),
  t as (
    select 
      row_number() over (order by a) as rn, a, b, c
    from d
  )
select * from t;

The new rn column is a row number calculated for each row based on the ordering of a. For simplicity, I’ve aliased:

a = timestamp
b = start
c = end

The result of this query is:

rn |a |b |c |
---|--|--|--|
1  |1 |1 |  |
2  |2 |  |  |
3  |3 |  |  |
4  |4 |  |1 |
5  |6 |  |  |
6  |7 |  |  |
7  |8 |1 |  |
8  |9 |  |1 |

Nothing fancy yet.

Now, how to use this rn column to find the length of an event?

Visually, we can get the idea quickly, seeing that an event’s length can be calculated using the formula RN2 - RN1 + 1:

rn |a |b |c |
---|--|--|--|
1  |1 |1 |  | RN1 = 1
2  |2 |  |  |
3  |3 |  |  |
4  |4 |  |1 | RN2 = 4

5  |6 |  |  |
6  |7 |  |  |

7  |8 |1 |  | RN1 = 7
8  |9 |  |1 | RN2 = 8

We have two events:

4 – 1 + 1 = 4
8 – 7 + 1 = 2

So, all we have to do is for each starting point of an event at RN1, find the corresponding RN2, and run the arithmetic. This is quite a bit of syntax, but it isn’t so hard, so bear with me while I explain:


with 
  d(a, b, c) as (
	values (1, 1, null),
	       (2, null, null),
	       (3, null, null),
	       (4, null, 1),
	       (6, null, null),
	       (7, null, null),
	       (8, 1, null),
	       (9, null, 1)
  ),
  t as (
    select 
      row_number() over (order by a) as rn, a, b, c
    from d
  )

-- Interesting bit here:
select
  a, b, c,
  case 
    when b is not null then 
      min(case when c is not null then rn end) 
        over (order by rn 
          rows between 1 following and unbounded following) 
      - rn + 1 
  end cnt
from t;

Let’s look at this new cnt column, step by step. First, the easy part: The CASE expression There’s a case expression that goes like this:

case 
  when b is not null then 
    ...
end cnt

All this does is check if b is not null and if this is true, then calculate something. Remember, b = start, so we’re putting a calculated value in the row where an event started. That was the requirement. The new window function So, what do we calculate there?

min(...) over (...) ...

A window function that finds the minimum value over a window of data. That minimum value is RN2, the next row number value where the event ends. So, what do we put in the min() function to get that value?

min(case when c is not null then rn end) 
over (...) 
...

Another case expression. When c is not null, we know the event has ended (remember, c = end). And if the event has ended, we want to find that row’s rn value. So that would be the minimum value of that case expression for all the rows after the row that started the event. Visually:

rn |a |b |c | case expr | minimum "next" value
---|--|--|--|-----------|---------------------
1  |1 |1 |  | null      | 4
2  |2 |  |  | null      | null
3  |3 |  |  | null      | null
4  |4 |  |1 | 4         | null

5  |6 |  |  | null      | null
6  |7 |  |  | null      | null

7  |8 |1 |  | null      | 8
8  |9 |  |1 | 8         | null

Now, we only need to specify that OVER() clause to form a window of all rows that follow the current row.

min(case when c is not null then rn end) 
  over (order by rn 
    rows between 1 following and unbounded following) 
...

The window is ordered by rn and it starts 1 row after the current row (1 following) and ends in infinity (unbounded following). The only thing left to do now is do the arithmetic:

min(case when c is not null then rn end) 
  over (order by rn 
    rows between 1 following and unbounded following) 
- rn + 1

This is a verbose way of calculating RN2 - RN1 + 1, and we’re doing that only in those columns that start an event. The result of the complete query above is now:

a |b |c |cnt |
--|--|--|----|
1 |1 |  |4   |
2 |  |  |    |
3 |  |  |    |
4 |  |1 |    |
6 |  |  |    |
7 |  |  |    |
8 |1 |  |2   |
9 |  |1 |    |

5 thoughts on “Find the Next Non-NULL Row in a Series With SQL”

Łukasz says:

September 3, 2018 at 16:58
Great riddle. Here is my approach:

Data:
```
CREATE TABLE tab(a,b,c)
AS
values (1, 1, null),
       (2, null, null),
       (3, null, null),
       (4, null, 1),
       (6, null, null),
       (7, null, null),
       (8, 1, null),
       (9, null, 1),
       (10, 1,null),
       (11,null,1),
       (12,1,1);
```
Query:
```
SELECT a,b,c, 
   CASE WHEN b = 1 AND c= 1 THEN 1
        WHEN b =1 THEN COUNT(*) OVER(PARTITION BY s)+1 
   END AS cnt
FROM (SELECT *, SUM(COALESCE(b,0)+COALESCE(c,0)) OVER(ORDER BY a) s
      FROM tab) s
```
Demo: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=0775e1cac55b860dc452e6e8fb31c059

Loading...
1. lukaseder says:
  
  September 3, 2018 at 17:43
  
  Very good :-)
  
  Loading...
  
  1. Łukasz says:
    
    September 3, 2018 at 18:07
    
    I am a huge fan of additional partitioning(SUM combined with CASE) when it comes to solve “gaps and islands” class problem.
    For me it is easier to understand and maintain than calculation using multiple ROW_NUMBER()/MAX()/MIN().
    
    Another example “Increment column for streaks”: https://stackoverflow.com/a/51954348/5070879
    
    Loading...
    
KPNC says:

September 4, 2018 at 10:08

typo: min(case when c is not null then rn en)
*end

Loading...

1. lukaseder says:
  
  September 4, 2018 at 15:03
  
  Thanks, fixed
  
  Loading...

Find the Next Non-NULL Row in a Series With SQL

Create the data set, first

How to deal with non-monotonously growing series

Now, how to use this rn column to find the length of an event?

Like this:

Published by lukaseder

5 thoughts on “Find the Next Non-NULL Row in a Series With SQL”

Leave a ReplyCancel reply

Create the data set, first

How to deal with non-monotonously growing series

Now, how to use this rn column to find the length of an event?

Like this:

Published by lukaseder

5 thoughts on “Find the Next Non-NULL Row in a Series With SQL”

Leave a ReplyCancel reply

Discover more from Java, SQL and jOOQ.