I found a very interesting SQL question on Twitter recently:
Rephrasing the question: We have a set of sparse data points:
+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 | 100 |
| 2019-01-02 | 120 |
| 2019-01-05 | 125 |
| 2019-01-06 | 128 |
| 2019-01-10 | 130 |
+------------+-------+
Since dates can be listed as discrete, continuous data points, why not fill in the gaps between 2019-01-02 and 2019-01-05 or 2019-01-06 and 2019-01-10? The desired output would be:
+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 | 100 |
| 2019-01-02 | 120 | <-+
| 2019-01-03 | 120 | | -- Generated
| 2019-01-04 | 120 | | -- Generated
| 2019-01-05 | 125 |
| 2019-01-06 | 128 | <-+
| 2019-01-07 | 128 | | -- Generated
| 2019-01-08 | 128 | | -- Generated
| 2019-01-09 | 128 | | -- Generated
| 2019-01-10 | 130 |
+------------+-------+
In the generated columns, we’ll just repeat the most recent value.
How to do this with SQL?
For the sake of this example, I’m using Oracle SQL, as the OP was expecting to do this with Oracle. The idea is to do this in two steps:
- Generate all the dates between the first and the last data points
- For each date, find either the current data point, or the most recent one
But first, let’s create the data:
create table t (value_date, value) as
select date '2019-01-01', 100 from dual union all
select date '2019-01-02', 120 from dual union all
select date '2019-01-05', 125 from dual union all
select date '2019-01-06', 128 from dual union all
select date '2019-01-10', 130 from dual;
1. Generating all the dates
In Oracle, we can use the convenient
CONNECT BY
syntax for this. We could also use some other tool to generate dates to fill the gaps, including SQL standard recursion using
WITH
, or some
PIPELINED
function, but I like
CONNECT BY
for this purpose.
We’ll write:
select (
select min(t.value_date)
from t
) + level - 1 as value_date
from dual
connect by level <= (
select max(t.value_date) - min(t.value_date) + 1
from t
)
This produces:
VALUE_DATE|
----------|
2019-01-01|
2019-01-02|
2019-01-03|
2019-01-04|
2019-01-05|
2019-01-06|
2019-01-07|
2019-01-08|
2019-01-09|
2019-01-10|
Now we wrap the above query in a derived table and left join the actual data set:
select
d.value_date,
t.value
from (
select (
select min(t.value_date)
from t
) + level - 1 as value_date
from dual
connect by level <= (
select max(t.value_date) - min(t.value_date) + 1
from t
)
) d
left join t
on d.value_date = t.value_date
order by d.value_date;
The date gaps are now filled, but our values column is still sparse:
VALUE_DATE|VALUE|
----------|-----|
2019-01-01| 100|
2019-01-02| 120|
2019-01-03| |
2019-01-04| |
2019-01-05| 125|
2019-01-06| 128|
2019-01-07| |
2019-01-08| |
2019-01-09| |
2019-01-10| 130|
2. Fill the value gaps
On each row, the
VALUE
column should either contain the actual value, or the “last_value” preceding the current row, ignoring all the nulls. Note that I specifically wrote this requirement using specific English language. We can now translate that sentence directly to SQL:
last_value (t.value) ignore nulls over (order by d.value_date)
Since we have added an
ORDER BY
clause to the window function, the default frame
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
applies, which colloquially means “all the preceding rows”. (Technically, that’s not accurate. It means all rows with values less than or equal to the value of the current row – see
Kim Berg Hansen’s comment)
Convenient! We’re trying to find the last value in the window of all the preceding rows, ignoring the nulls.
This is standard SQL, but unfortunately not all RDBMS support
IGNORE NULLS
. Among the ones supported by jOOQ, currently these ones support the syntax:
- DB2
- H2
- Informix
- Oracle
- Redshift
- Sybase SQL Anywhere
- Teradata
Sometimes, not the exact standard syntax is supported, but the standard feature. Use
https://www.jooq.org/translate to see different syntax variants.
The full query now reads:
select
d.value_date,
last_value (t.value) ignore nulls over (order by d.value_date)
from (
select (
select min(t.value_date)
from t
) + level - 1 as value_date
from dual
connect by level <= (
select max(t.value_date) - min(t.value_date) + 1
from t
)
) d
left join t
on d.value_date = t.value_date
order by d.value_date;
… and it yields the desired result:
VALUE_DATE |VALUE|
-------------------|-----|
2019-01-01 00:00:00| 100|
2019-01-02 00:00:00| 120|
2019-01-03 00:00:00| 120|
2019-01-04 00:00:00| 120|
2019-01-05 00:00:00| 125|
2019-01-06 00:00:00| 128|
2019-01-07 00:00:00| 128|
2019-01-08 00:00:00| 128|
2019-01-09 00:00:00| 128|
2019-01-10 00:00:00| 130|
Other RDBMS
This solution made use of some Oracle specific features such as
CONNECT BY
. In other RDBMS, the same idea can be implemented by using a different way of generating data. This article focuses only on using
IGNORE NULLS
. If you’re interested, feel free to post an alternative solution in the comments for your RDBMS.
Like this:
Like Loading...
I know that in this case it matters not, but I dislike this explanation:
“…the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW applies, which means “all the preceding rows”.”
Nope, ROWS BETWEEN means “all the preceding rows” – RANGE BETWEEN means “all rows with the value of the ORDER BY column less than or equal to this row” (which might include following rows.)
I’m sure you know this – it’s just a pet peeve of mine that I always stomp on whenever I see it, so that newcomers who read it won’t be deceived ;-)
I indeed omitted it for sake of not distracting, but you’re right. The reading should be accurate.
SQLite is my normal weapon of choice. To do something like this, I use a recursive CTE to create the dates and a correlated subquery to pick the values:
Nice. Works well with these two fixes:
'+1 day'
(no space between + and 1)ORDER BY innertbl.date DESC
Below are two Postgres SQL solutions — i think twitter question is best answered by the complex using case when logic with subquery and outer join; I am reading the twitter comment– “from 3rd Jan 24 Jan, our value should be 120” – to mean that EACH day Jan 3,4,5,6 – Jan 24 all have value= 120; I could easily set the value in this range to null, zero, -999, etc. ;
Albert! :-) You just hard coded the values that you wanted to fill in the gaps. That doesn’t count!
lukaseder- sure, but the logic can be manipulated to be more dynamic.
see below – the range of dates in the first query can be anything; IT IS NOT hardcoded now. I only coded one end in where clause value_dtx <= value_dt but that can also be changed. The other subquery is the same as creating a table which is also hardcoded in the example
select value_dtx, min (value1) as finalvalue from (
SELECT 1 as grp_id, 120 as “value2”, date’1900-12-31′ as _last_dt_min2, date’2050-12-31′ as _last_dt_max2, value_dtx::date
FROM generate_series(
timestamp without time zone ‘2019-01-01’,
timestamp without time zone ‘2019-01-31’,
‘1 day’) as value_dtx
order by 1
)z
left join lateral
(select * from (
SELECT 1 as grp_id, 100 as “value1”, value_dt::date
FROM generate_series(
timestamp without time zone ‘2019-01-01’,
timestamp without time zone ‘2019-01-01’,
‘1 day’) as value_dt
union all
SELECT 1 as grp_id, 120 as “value1”, value_dt::date
FROM generate_series(
timestamp without time zone ‘2019-01-02’,
timestamp without time zone ‘2019-01-02’,
‘1 day’) as value_dt
union all
SELECT 1 as grp_id, 125 as “value1”, value_dt::date
FROM generate_series(
timestamp without time zone ‘2019-01-25’,
timestamp without time zone ‘2019-01-25’,
‘1 day’) as value_dt
)x
left outer join (select 1 as grp_id,
min(value_dt) as last_dt_min,
max(value_dt) as last_dt_max
from (
SELECT 100 as “value1”, value_dt::date
FROM generate_series(
timestamp without time zone ‘2019-01-01’,
timestamp without time zone ‘2019-01-01’,
‘1 day’) as value_dt
union all
SELECT 120 as “value1”, value_dt::date
FROM generate_series(
timestamp without time zone ‘2019-01-02’,
timestamp without time zone ‘2019-01-02’,
‘1 day’) as value_dt
union all
SELECT 125 as “value1”, value_dt::date
FROM generate_series(
timestamp without time zone ‘2019-01-25’,
timestamp without time zone ‘2019-01-25’,
‘1 day’) as value_dt
)y
group by 1
order by 1
)p on true
where value_dtx <= value_dt
)qq on true
where value1 is not null
group by 1
order by 1
in case it is not obvious here is the full dynamic code
Lukas, Thanks for this inspiring post.
Albert, PostgreSQL aficionados: This is a solution for Postgres based on chunkyks solution:
Recursive solution is interesting, unnecessary, but a nice trick
— Microsoft SQL Server
Lukas, thank you for the post.
Postgres solution might build on existing window function:
Vitally : we’ve solved above with Postgres and your solution isn’t offering anything new. SF Keller at least added recursive sql which is a neat trick. You don’t need to use the recursion as my solution provides the correct answer when output;
BigQuery doesn’t include WITH RECURSIVE or CONNECT BY , so instead I generated the dates using UNNEST(GENERATE_DATE_ARRAY()). Thankfully LAST_VALUE and IGNORE NULLS both work fine, so I used that part of Luka’s query unchanged. In the query that follows, the BiqQuery table
myproj.mydataset.mytable
is assumed to have a column of dates called date_orig and a column of floats called value. I suspect that some of the ORDER BY clauses are superfluous, but I threw them in anyway because BigQuery sometimes surprises me by returning data out of order.The equivalent of
LEAD/LAG IGNORE NULLS
by using windowed ARRAY_AGG function, skipping NULL values and accessing last element of cumulative ARRAYBelow PostgreSQL version:
Live demo: https://dbfiddle.uk/etvmKktH
Original post: https://stackoverflow.com/a/75634777/5070879