The following is a very common problem in all data related technologies and we’re going to look into two very lean, SQL-based solutions for it:

How do I fill the cells of a sparse data set with the “previous non-empty value”?

## The problem

The problem is really simple and I’m reusing the example provided by Stack Overflow user aljassi in this question:

We have a table containing “sparse” data:

Col1 Col2 Col3 Col4 ---------------------- A 0 1 5 B 0 4 0 C 2 0 0 D 0 0 0 E 3 5 0 F 0 3 0 G 0 3 1 H 0 1 5 I 3 5 0

The above data set contains a couple of interesting data points that are non-zero, and some gaps modelled by the value zero. In other examples, we could replace zero by `NULL`

, but it would still be the same problem. The desired result is the following:

Col1 Col2 Col3 Col4 ---------------------- A 0 15B 045C24 5D2 4 5E355F335G33 1 H315I 3 55

Note that all the generated values are highlighted in red, and they correspond to the most recent blue value.

How to do it with SQL? We’ll be looking at two solutions:

## A solution using window functions

This is the solution you should be looking for, and there are two answers in the linked Stack Overflow question that both make use of window functions:

Both solutions are roughly equivalent. Here’s how they work (using Oracle syntax):

WITH t(col1, col2, col3, col4) AS ( SELECT 'A', 0, 1, 5 FROM DUAL UNION ALL SELECT 'B', 0, 4, 0 FROM DUAL UNION ALL SELECT 'C', 2, 0, 0 FROM DUAL UNION ALL SELECT 'D', 0, 0, 0 FROM DUAL UNION ALL SELECT 'E', 3, 5, 0 FROM DUAL UNION ALL SELECT 'F', 0, 3, 0 FROM DUAL UNION ALL SELECT 'G', 0, 3, 1 FROM DUAL UNION ALL SELECT 'H', 0, 1, 5 FROM DUAL UNION ALL SELECT 'I', 3, 5, 0 FROM DUAL ) SELECT col1, nvl(last_value(nullif(col2, 0)) IGNORE NULLS OVER (ORDER BY col1), 0) col2, nvl(last_value(nullif(col3, 0)) IGNORE NULLS OVER (ORDER BY col1), 0) col3, nvl(last_value(nullif(col4, 0)) IGNORE NULLS OVER (ORDER BY col1), 0) col4 FROM t

Now, let’s decompose these window functions:

**NULLIF(colx, 0)**

This is just an easy way of producing `NULL`

values whenever we have what is an accepted “empty” value in our data set. So, instead of zeros, we just get `NULL`

. Applying this function to our data, we’re getting:

Col1 Col2 Col3 Col4 ---------------------- A NULL 1 5 B NULL 4 NULL C 2 NULL NULL D NULL NULL NULL E 3 5 NULL F NULL 3 NULL G NULL 3 1 H NULL 1 5 I 3 5 NULL

We’re doing this because now, we can make use of the useful `IGNORE NULLS`

clause that is available to some ranking functions, specifically `LAST_VALUE()`

, or `LAG()`

. We can now write:

last_value(...) IGNORE NULLS OVER (ORDER BY col1)

Where we take the last non-`NULL`

value that precedes the current row when ordering rows by `col1`

:

- If the current row contains a non-
`NULL`

value, we’re taking that value. - If the current row contains a
`NULL`

value, we’re going “up” until we reach a non-`NULL`

value - If we’re going “up” and we haven’t reached any non-
`NULL`

value, well, we get`NULL`

This leads to the following result:

Col1 Col2 Col3 Col4 ---------------------- A NULL 1 5 B NULL 4 5 C 2 4 5 D 2 4 5 E 3 5 5 F 3 3 5 G 3 3 1 H 3 1 5 I 3 5 5

Note that with most window functions, once you specify an `ORDER BY`

clause, then the following frame clause is taken as a default:

last_value(...) IGNORE NULLS OVER ( ORDER BY col1 ROWS BETWEEN UNBOUNDED PRECEEDING AND CURRENT ROW )

That’s a lot of keywords, but their meaning is not really that obscure once you get a hang of window functions. We suggest reading the following blog posts to learn more about them:

- The Difference Between ROW_NUMBER(), RANK(), and DENSE_RANK()
- Don’t Miss out on Awesome SQL Power with FIRST_VALUE(), LAST_VALUE(), LEAD(), and LAG()

Finally, because we don’t want those `NULL`

values to remain in our results, we simply remove them using `NVL()`

(or `COALESCE()`

in other databases):

nvl(last_value(...) IGNORE NULLS OVER (...), 0)

Easy, isn’t it? Note, that in this particular case, `LAG()`

and `LAST_VALUE()`

will have the same effect.

## A solution using the MODEL clause

Whenever you have a problem in (Oracle) SQL, that starts getting hard to solve with window functions, the Oracle `MODEL`

clause might offer an “easy” solution to it. I’m using quotes on “easy”, because the syntax is a bit hard to remember, but the essence of it is really not that hard.

The `MODEL`

clause is nothing else than an Oracle-specific dialect for implementing spreadsheet-like logic in the database. I highly recommend reading the relevant Whitepaper by Oracle, which explains the functionality very well:

http://www.oracle.com/technetwork/middleware/bi-foundation/10gr1-twp-bi-dw-sqlmodel-131067.pdf

Here’s how you could tackle the problem with `MODEL`

(and bear with me):

WITH t(col1, col2, col3, col4) AS ( SELECT 'A', 0, 1, 5 FROM DUAL UNION ALL SELECT 'B', 0, 4, 0 FROM DUAL UNION ALL SELECT 'C', 2, 0, 0 FROM DUAL UNION ALL SELECT 'D', 0, 0, 0 FROM DUAL UNION ALL SELECT 'E', 3, 5, 0 FROM DUAL UNION ALL SELECT 'F', 0, 3, 0 FROM DUAL UNION ALL SELECT 'G', 0, 3, 1 FROM DUAL UNION ALL SELECT 'H', 0, 1, 5 FROM DUAL UNION ALL SELECT 'I', 3, 5, 0 FROM DUAL ) SELECT * FROM t MODEL DIMENSION BY (row_number() OVER (ORDER BY col1) rn) MEASURES (col1, col2, col3, col4) RULES ( col2[any] = DECODE(col2[cv(rn)], 0, NVL(col2[cv(rn) - 1], 0), col2[cv(rn)]), col3[any] = DECODE(col3[cv(rn)], 0, NVL(col3[cv(rn) - 1], 0), col3[cv(rn)]), col4[any] = DECODE(col4[cv(rn)], 0, NVL(col4[cv(rn) - 1], 0), col4[cv(rn)]) )

There are three clauses that are of interest here:

**The DIMENSION BY clause**

Like in a Microsoft Excel spreadsheet, the `DIMENSION`

corresponds to the consecutive, distinct index of each spreadsheet cell, by which we want to access the cell. In Excel, there are always two dimensions (one written with letters A..Z, AA..ZZ, …) and the other one written with numbers (1..infinity).

Using `MODEL`

, you can specify as many dimensions as you want. In our example, we’ll only use one, the row number of each row, ordered by `col1`

(another use case for a window function).

**The MEASURES clause**

The `MEASURES`

clause specifies the individual cell values for each “cell”. In Microsoft Excel, a cell can have only one value. In Oracle’s `MODEL`

clause, we can operate on many values at once, within a “cell”.

In this case, we’ll just make all the columns our cells.

**The RULES clause**

This is the really interesting part in the `MODEL`

clause. Here, we specify by what rules we want to calculate the values of each individual cell. The syntax is simple:

RULES ( <rule 1>, <rule 2>, ..., <rule N> )

Each individual rule can implement an assignment of the form:

RULES ( cell[dimension(s)] = rule )

In our case, we’ll repeat the same rule for cells `col2`

, `col3`

, and `col4`

, and for any value of the dimension `rn`

(for row number). So, the left-hand side of the assignment is

RULES ( col2[any] = rule, col3[any] = rule, col4[any] = rule, )

The right hand side is a trivial (but not trivial-looking) expression:

DECODE(col2[cv(rn)], 0, NVL(col2[cv(rn) - 1], 0), col2[cv(rn)])

Let’s decompose again.

**DECODE**

`DECODE`

is a simple and useful Oracle function that takes a first argument, compares it with argument 2, and if they’re the same, returns argument 3, otherwise argument 4. It works like a `CASE`

, which is a bit more verbose:

DECODE(A, B, C, D) -- The same as: CASE A WHEN B THEN C ELSE D END

**cv(rn)**

`cv()`

is a `MODEL`

specific “function” that means “current value”. On the left-hand side of the assignment, we used `"any"`

as the dimension specifier, so we’re applying this rule for “any” value of `rn`

. In order to access a specific `rn`

value, we’ll simply write `cv(rn)`

, or the “current value of rn”.

**recursiveness**

The `RULES`

of the `MODEL`

clause are allowed to span a recursive tree (although not a graph, so no cycles are allowed), where each cell can be defined based on a previous cell, which is again defined based on its predecessor. We’re doing this via `col2[cv(rn) - 1]`

, where `cv(rn) - 1`

means the “current row number minus one”.

Easy, right? Granted. The syntax isn’t straight-forward and we’re only scratching the surface of what’s possible with `MODEL`

.

## Conclusion

SQL provides cool ways to implementing data-driven, declarative specifications of what your data should be like. The `MODEL`

clause is a bit eerie, but at the same time extremely powerful. Much easier and also a bit faster are window functions, a tool that should be in the tool chain of every developer working with SQL.

In this article, we’ve shown how to fill gaps in sparse data using window functions or `MODEL`

. A similar use-case are running totals. If this article has triggered your interest, I suggest reading about different approaches of calculating a running total in SQL.

LAST_VALUE(y.total_wall ignore nulls) OVER (ORDER BY x.priority desc ROWS BETWEEN UNBOUNDED preceding AND CURRENT ROW) AS total_wall_new,

Thanks for your comment, Jon. What were you trying to hint at? The syntax from the post is correct in the databases I’ve been using. What database is your syntax from?

Thanks for the reference to my solution on SO. I was to write this post on my blog, since it indeed is quite a common SQL problem, but you know – life, kids, stuff,… Did not have the time to do it yet. 🙂