Use this Neat Window Function Trick to Calculate Time Differences in a Time Series

Whenever you feel that itch…

Can’t I calculate this with SQL?

The answer is: Yes you can! And you should! Let’s see how…

Calculating time differences between rows

Let’s consider the following database containing timestamps (e.g. in a log database). We’re using PostgreSQL syntax for this:


CREATE TABLE timestamps (
  ts timestamp
);

INSERT INTO timestamps VALUES
  ('2015-05-01 12:15:23.0'),
  ('2015-05-01 12:15:24.0'),
  ('2015-05-01 12:15:27.0'),
  ('2015-05-01 12:15:31.0'),
  ('2015-05-01 12:15:40.0'),
  ('2015-05-01 12:15:55.0'),
  ('2015-05-01 12:16:01.0'),
  ('2015-05-01 12:16:03.0'),
  ('2015-05-01 12:16:04.0'),
  ('2015-05-01 12:16:04.0');

Obviously, you’ll be adding constraints and indexes, etc. Now, let’s assume that each individual timestamp represents an event in your system, and you’d like to keep track of how long ago the previous event has happened. I.e. you’d like the following result:

ts                   delta
-------------------------------
2015-05-01 12:15:23
2015-05-01 12:15:24  00:00:01
2015-05-01 12:15:27  00:00:03
2015-05-01 12:15:31  00:00:04
2015-05-01 12:15:40  00:00:09
2015-05-01 12:15:55  00:00:15
2015-05-01 12:16:01  00:00:06
2015-05-01 12:16:03  00:00:02
2015-05-01 12:16:04  00:00:01
2015-05-01 12:16:04  00:00:00

In other words

ts1 (12:15:23) + delta (00:00:01) = ts2 (12:15:24)
ts2 (12:15:24) + delta (00:00:03) = ts3 (12:15:27)
…

This can be achieved very easily with the LAG() window function:


SELECT 
  ts, 
  ts - lag(ts, 1) OVER (ORDER BY ts) delta
FROM timestamps
ORDER BY ts;

The above reads simply:

Give me the difference between the ts value of the current row and the ts value of the row that “lags” behind this row by one, with rows ordered by ts.

Easy, right? With LAG() you can actually access any row from another row within a “sliding window” by simply specifying the lag index. We’ve already described this wonderful window function in a previous blog post.

Bonus: A running total interval

In addition to the difference between this timestamp and the previous one, we might be interested in the total difference between this timestamp and the first timestamp. This may sound like a running total (see our previous article about running totals using SQL), but it can be calculated much more easily using FIRST_VALUE() – a “cousin” of LAG()


SELECT 
  ts, 
  ts - lag(ts, 1) OVER w delta,
  ts - first_value(ts) OVER w total
FROM timestamps
WINDOW w AS (ORDER BY ts)
ORDER BY ts;

… the above query then yields

ts                   delta     total
---------------------------------------
2015-05-01 12:15:23            00:00:00
2015-05-01 12:15:24  00:00:01  00:00:01
2015-05-01 12:15:27  00:00:03  00:00:04
2015-05-01 12:15:31  00:00:04  00:00:08
2015-05-01 12:15:40  00:00:09  00:00:17
2015-05-01 12:15:55  00:00:15  00:00:32
2015-05-01 12:16:01  00:00:06  00:00:38
2015-05-01 12:16:03  00:00:02  00:00:40
2015-05-01 12:16:04  00:00:01  00:00:41
2015-05-01 12:16:04  00:00:00  00:00:41

Extra bonus: The total since a “reset” event

We can take this as far as we want. Let’s assume that we want to reset the total from time to time:


CREATE TABLE timestamps (
  ts timestamp,
  event varchar(50)
);

INSERT INTO timestamps VALUES
  ('2015-05-01 12:15:23.0', null),
  ('2015-05-01 12:15:24.0', null),
  ('2015-05-01 12:15:27.0', 'reset'),
  ('2015-05-01 12:15:31.0', null),
  ('2015-05-01 12:15:40.0', null),
  ('2015-05-01 12:15:55.0', 'reset'),
  ('2015-05-01 12:16:01.0', null),
  ('2015-05-01 12:16:03.0', null),
  ('2015-05-01 12:16:04.0', null),
  ('2015-05-01 12:16:04.0', null);

We can now run the following query:


SELECT
  ts, 
  ts - lag(ts, 1) 
       OVER (ORDER BY ts) delta,
  ts - first_value(ts) 
       OVER (PARTITION BY c ORDER BY ts) total
FROM (
  SELECT 
    COUNT(*) FILTER (WHERE EVENT = 'reset') 
             OVER (ORDER BY ts) c,
    ts
  FROM timestamps
) timestamps
ORDER BY ts;

… to produce

ts                   delta     total
---------------------------------------
2015-05-01 12:15:23            00:00:00
2015-05-01 12:15:24  00:00:01  00:00:01
2015-05-01 12:15:27  00:00:03  00:00:00 <-- reset
2015-05-01 12:15:31  00:00:04  00:00:04
2015-05-01 12:15:40  00:00:09  00:00:13
2015-05-01 12:15:55  00:00:15  00:00:00 <-- reset
2015-05-01 12:16:01  00:00:06  00:00:06
2015-05-01 12:16:03  00:00:02  00:00:08
2015-05-01 12:16:04  00:00:01  00:00:09
2015-05-01 12:16:04  00:00:00  00:00:09

The beautiful part is in the derived table


  SELECT 
    COUNT(*) FILTER (WHERE EVENT = 'reset') 
             OVER (ORDER BY ts) c,
    ts
  FROM timestamps

This derived table just adds the “partition” to each set of timestamps given the most recent “reset” event. The result of the above subquery is:

c  ts
----------------------
0  2015-05-01 12:15:23
0  2015-05-01 12:15:24
1  2015-05-01 12:15:27 <-- reset
1  2015-05-01 12:15:31
1  2015-05-01 12:15:40
2  2015-05-01 12:15:55 <-- reset
2  2015-05-01 12:16:01
2  2015-05-01 12:16:03
2  2015-05-01 12:16:04
2  2015-05-01 12:16:04

As you can see, the COUNT(*) window function counts all the previous “reset” events, ordered by timestamp. This information can then be used as the PARTITION for the FIRST_VALUE() window function in order to find the first timestamp in each partition, i.e. at the time of the most recent “reset” event:


  ts - first_value(ts) 
       OVER (PARTITION BY c ORDER BY ts) total

Conclusion

It’s almost a running gag on this blog to say that…

There was SQL before window functions and SQL after window functions

Window functions are extremely powerful and they’re a part of the SQL standard, supported in most commercial databases, in PostgreSQL, in Firebird 3.0, and in CUBRID. If you aren’t using them already, start using them today! If you’ve liked this article, find out more about window functions in any of the following articles:

8 thoughts on “Use this Neat Window Function Trick to Calculate Time Differences in a Time Series”

BJ Raval says:

December 10, 2018 at 22:49

Although this blog is 3 1/2 years old, I’d appreciate a clarification. This blog post EXACTLY explains a problem I am facing (and also provides an elegant answer) so I copied the SQL syntax with my table name and added a new column also named ‘delta’ to my table. However, the answer I got was nowhere near, Values in ‘delta’ is all zeros (00:00:00) and my equivalent column to ‘ts’ (I have a column called ‘utc_timestamp’ – formatted as ‘timestamp with time zone’ ) also shows different values – ‘2018-10-02′ 17:00:00-07’ instead of original value – ‘2018-10-03 01:47:30-07’. My PostgreSQL version is 10.5 and pgAdmin 4. The only difference I found was that the example column ‘ts’ has data type of ‘timestamp’ whereas my ‘utc_timestamp’ column has data type of ‘timestamp with timezone’
This is the SQL I adapted:

SELECT
utc_timestamp,
utc_timestamp – lag(utc_timestamp, 1) OVER (ORDER BY utc_timestamp) delta
FROM newtable1
ORDER BY utc_timestamp;

What am I doing wrong?
I’d appreciate any help. Thanks!
BJ

Loading...

1. lukaseder says:
  
  December 11, 2018 at 09:48
  
  May I ask you to ask this question on Stack Overflow: https://stackoverflow.com
  
  Loading...
  
lufte says:

February 15, 2020 at 02:44

Thank you for this really clever solution. The “Extra Bonus” solves the exact problem I was having now with a query. I knew the solution lied in window functions somehow, but I just couldn’t nail it. Thank you thank you thank you!

Loading...

1. lukaseder says:
  
  February 17, 2020 at 09:54
  
  Glad it helped! :-)
  
  Loading...
  
Lukasz says:

February 17, 2020 at 18:20

I love the last bonus part :) I had opportunity to use it today.

My version of “reset” mechanism uses SUM(CASE col = reset_value THEN 1 ELSE 0 END) OVER(PARTITION BY … ORDER BY …) but in principle it works exactly the same as your version with filtered COUNT(*).

https://stackoverflow.com/questions/51954221/increment-column-for-streaks/51954348#51954348

This pattern is one of possible approaches for solving gaps-and-island class challenges.

Loading...

1. lukaseder says:
  
  February 17, 2020 at 19:06
  
  Yes, that’s how you can do it if you don’t have FILTER available, i.e. in most RDBMS
  
  Loading...
  
Aco says:

March 3, 2020 at 03:33

This is great and solves a problem I have been facing. However, for some reason, it is recognizing my timestamps as numbers. For example, my Date field has timestamps and the field is set to the type ‘Datetime’. When I run this query, it generates the difference as if it’s a number in the new ‘delta’ column. In other words, it reads 2020-03-02 03:26:34 as 32,634. So then it uses that as the number to subtract the previous from. Thank you for the help.

Loading...

1. lukaseder says:
  
  March 3, 2020 at 09:00
  
  Surely, there’s a reason. Perhaps you could find some help on https://stackoverflow.com ?
  
  Loading...

Use this Neat Window Function Trick to Calculate Time Differences in a Time Series

Calculating time differences between rows

Bonus: A running total interval

Extra bonus: The total since a “reset” event

Conclusion

Like this:

Published by lukaseder

8 thoughts on “Use this Neat Window Function Trick to Calculate Time Differences in a Time Series”

Leave a ReplyCancel reply

Calculating time differences between rows

Bonus: A running total interval

Extra bonus: The total since a “reset” event

Conclusion

Like this:

Published by lukaseder

8 thoughts on “Use this Neat Window Function Trick to Calculate Time Differences in a Time Series”

Leave a ReplyCancel reply

Discover more from Java, SQL and jOOQ.