How to Find the Longest Consecutive Series of Events in SQL

Posted on November 7, 2015February 2, 2018 by lukaseder

A very interesting problem that can be solved very easily with SQL is to find consecutive series of events in a time series. But what is a consecutive series of events in a time series? Take Stack Overflow, for example. Stack Overflow has a cool reputation system that uses badges to reward certain behaviour. As a social website, they encourage users to visit the platform every day. As such, two distinct badges are awarded:

Informally, it is obvious what this means. You’ll have to log in on day 1. Then again on day 2. Then again (perhaps several times, it doesn’t matter) on day 3. Forgot to log in on day 4? Ooops. We’ll start counting again.

How to do this in SQL?

On this blog, every problem will find its solution in SQL. So does this. And in order to solve this problem, we’re going to use the awesome Stack Exchange Data Explorer, which exposes a lot of Stack Exchange’s publicly available usage information. Note that we won’t query the consecutive days of visits, as this information is not made available publicly. Instead, let’s query the consecutive days of posts a user has made. The backing database is SQL Server, so we can run the following statement:


SELECT DISTINCT CAST(CreationDate AS DATE) AS date
FROM Posts
WHERE OwnerUserId = ##UserId##
ORDER BY 1

… which, for my own UserId generates something like:

date          
----------
2010-11-26
2010-11-27
2010-11-29
2010-11-30
2010-12-01
2010-12-02
2010-12-03
2010-12-05
2010-12-06
2010-12-07
2010-12-08
2010-12-09
2010-12-13
2010-12-14
...
(769 rows)

(run the statement yourself, here) As we can see in the data, there have been gaps in the very early days:

date          
--------------------------------------
2010-11-26
2010-11-27 <---- Gap here after 2 days

2010-11-29
2010-11-30
2010-12-01
2010-12-02
2010-12-03 <---- Gap here after 5 days

2010-12-05
2010-12-06
2010-12-07
2010-12-08
2010-12-09 <---- Gap here after 5 days

2010-12-13
2010-12-14
...

Visually, it is very easy to see how many days in a row there were posts without any gaps. But how to do it with SQL? To simplify the problem, let’s “store” individual queries in common table expressions. The above query, we’ll call dates:


WITH 

  -- This table contains all the distinct date 
  -- instances in the data set
  dates(date) AS (
    SELECT DISTINCT CAST(CreationDate AS DATE)
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  )
...

Now, the goal of the resulting query is to put all consecutive dates in the same group, such that we can aggregate over this group. The following query is what we want to write:


SELECT 
  COUNT(*) AS consecutiveDates,
  MIN(date) AS minDate,
  MAX(date) AS maxDate
FROM groups
GROUP BY grp -- This &quot;grp&quot; value will be explained later
ORDER BY 1 DESC, 2 DESC

We’d like to aggregate each group “grp” and count the number of dates in the group, as well as find the lowest and the highest date within each group.

Generating groups for consecutive dates

Let’s look at the data again, and to illustrate the idea, we’ll add consecutive row numbers, regardless of the gaps in dates:

row number   date          
--------------------------------
1            2010-11-26
2            2010-11-27

3            2010-11-29 <-- gap before this row
4            2010-11-30
5            2010-12-01
6            2010-12-02
7            2010-12-03

8            2010-12-05 <-- gap before this row

As you can see, regardless whether there is a gap between dates (two dates are not consecutive), their row numbers will still be consecutive. We can do this with the ROW_NUMBER() window function, very easily:


SELECT
  ROW_NUMBER() OVER (ORDER BY date) AS [row number],
  date
FROM dates

Now, let’s check out the following, interesting query:


WITH 

  -- This table contains all the distinct date 
  -- instances in the data set
  dates(date) AS (
    SELECT DISTINCT CAST(CreationDate AS DATE)
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  ),
  
  -- Generate &quot;groups&quot; of dates by subtracting the
  -- date's row number (no gaps) from the date itself
  -- (with potential gaps). Whenever there is a gap,
  -- there will be a new group
  groups AS (
    SELECT
      ROW_NUMBER() OVER (ORDER BY date) AS rn,
      dateadd(day, -ROW_NUMBER() OVER (ORDER BY date), date) AS grp,
      date
    FROM dates
  )
SELECT *
FROM groups
ORDER BY rn

The above query yields:

rn  grp          date          
--- ----------   ----------
1   2010-11-25   2010-11-26
2   2010-11-25   2010-11-27

3   2010-11-26   2010-11-29
4   2010-11-26   2010-11-30
5   2010-11-26   2010-12-01
6   2010-11-26   2010-12-02
7   2010-11-26   2010-12-03

8   2010-11-27   2010-12-05
9   2010-11-27   2010-12-06
10  2010-11-27   2010-12-07
11  2010-11-27   2010-12-08
12  2010-11-27   2010-12-09

13  2010-11-30   2010-12-13
14  2010-11-30   2010-12-14

(run the statement yourself, here) All we did is subtract the row number from the date to get a new date “grp“. The actual date obtained this way is irrelevant. It’s just an auxiliary value. What we can guarantee, though, is that for consecutive dates, the value of grp will be the same because for all consecutive dates, the following two equations yield true:

date2 - date1 = 1 // difference in days between dates
rn2   - rn1   = 1 // difference in row numbers

Yet, for non-consecutive dates, while the difference in row numbers is still 1, the difference in days is no longer 1. The groups can now be seen easily:

rn  grp          date          
--- ----------   ----------
1   2010-11-25   2010-11-26
2   2010-11-25   2010-11-27

3   2010-11-26   2010-11-29
4   2010-11-26   2010-11-30
5   2010-11-26   2010-12-01
6   2010-11-26   2010-12-02
7   2010-11-26   2010-12-03

8   2010-11-27   2010-12-05
9   2010-11-27   2010-12-06
10  2010-11-27   2010-12-07
11  2010-11-27   2010-12-08
12  2010-11-27   2010-12-09

13  2010-11-30   2010-12-13
14  2010-11-30   2010-12-14

Thus, the complete query can now be seen here:


WITH 

  -- This table contains all the distinct date 
  -- instances in the data set
  dates(date) AS (
    SELECT DISTINCT CAST(CreationDate AS DATE)
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  ),
  
  -- Generate &quot;groups&quot; of dates by subtracting the
  -- date's row number (no gaps) from the date itself
  -- (with potential gaps). Whenever there is a gap,
  -- there will be a new group
  groups AS (
    SELECT
      ROW_NUMBER() OVER (ORDER BY date) AS rn,
      dateadd(day, -ROW_NUMBER() OVER (ORDER BY date), date) AS grp,
      date
    FROM dates
  )
SELECT 
  COUNT(*) AS consecutiveDates,
  MIN(date) AS minDate,
  MAX(date) AS maxDate
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC

And it yields:

consecutiveDates minDate       maxDate       
---------------- ------------- ------------- 
14               2012-08-13    2012-08-26
14               2012-02-03    2012-02-16
10               2013-10-24    2013-11-02
10               2011-05-11    2011-05-20
9                2011-06-30    2011-07-08
7                2012-01-17    2012-01-23
7                2011-06-14    2011-06-20
6                2012-04-10    2012-04-15
6                2012-04-02    2012-04-07
6                2012-03-26    2012-03-31
6                2011-10-27    2011-11-01
6                2011-07-17    2011-07-22
6                2011-05-23    2011-05-28
...

(run the statement yourself, here)

Bonus query 1: Find consecutive weeks

The fact that we chose the granularity of days in the above query is a random choice. We simply took the timestamp from our time series and “collapsed” it to the desired granularity using a CAST function:


SELECT DISTINCT CAST(CreationDate AS DATE)

If we want to know the consecutive weeks, we’ll simply change that function to a different expression, e.g.


SELECT DISTINCT datepart(year, CreationDate) * 100 
              + datepart(week, CreationDate)

This new expression takes the year and the week and generates values like 201503 for week 03 in the year 2015. The rest of the statement remains exactly the same:


WITH 
  weeks(week) AS (
    SELECT DISTINCT datepart(year, CreationDate) * 100 
                  + datepart(week, CreationDate)
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  ),
  groups AS (
    SELECT
      ROW_NUMBER() OVER (ORDER BY week) AS rn,
      dateadd(day, -ROW_NUMBER() OVER (ORDER BY week), week) AS grp,
      week
    FROM weeks
  )
SELECT 
  COUNT(*) AS consecutiveWeeks,
  MIN(week) AS minWeek,
  MAX(week) AS maxWeek
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC

And we’ll get the following result:

consecutiveWeeks minWeek maxWeek 
---------------- ------- ------- 
45               201401  201445  
29               201225  201253  
25               201114  201138  
23               201201  201223  
20               201333  201352  
16               201529  201544  
15               201305  201319  
12               201514  201525  
12               201142  201153  
9                201502  201510  
7                201447  201453  
7                201321  201327  
6                201048  201053  
4                201106  201109  
3                201329  201331  
3                201102  201104  
2                201301  201302  
2                201111  201112  
1                201512  201512

(run the statement yourself, here) Unsurprisingly, the consecutive weeks span much longer ranges, as I generally use Stack Overflow extensively.

Bonus query 2: Simplify the query using DENSE_RANK()

In a previous article, we’ve shown that SQL Trick: ROW_NUMBER() is to SELECT what DENSE_RANK() is to SELECT DISTINCT. If we go back to our consecutive days example, we can rewrite the query to find the distinct dates AND the groups in one go, using DENSE_RANK():


WITH 
  groups(date, grp) AS (
    SELECT DISTINCT 
      CAST(CreationDate AS DATE),
      dateadd(day, 
        -DENSE_RANK() OVER (ORDER BY CAST(CreationDate AS DATE)), 
        CAST(CreationDate AS DATE)) AS grp
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  )
SELECT 
  COUNT(*) AS consecutiveDates,
  MIN(date) AS minDate,
  MAX(date) AS maxDate
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC

(run the statement yourself, here) If the above doesn’t make sense, I recommend reading our previous article here, which explains it:

SQL Trick: row_number() is to SELECT what dense_rank() is to SELECT DISTINCT

Published by lukaseder

I made jOOQ View all posts by lukaseder

28 thoughts on “How to Find the Longest Consecutive Series of Events in SQL”

kk says:

November 16, 2015 at 10:39
The first instance of this SQL in your article:
```
SELECT
  COUNT(*) AS consecutiveDates,
  MIN(week) AS minDate,
  MAX(week) AS maxDate
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC
```
Should use MIN(date) and MAX(date) instead of MIN(week) and MAX(week).

Loading...
Reply
1. lukaseder says:
  
  November 16, 2015 at 13:27
  
  You’re right, thanks, nice catch. Fixed.
  
  Loading...
  
  Reply
  1. Mark Oveson says:
    
    February 2, 2018 at 05:27
    
    It’s still not fixed.
    
    Loading...
    
    Reply
    1. lukaseder says:
      
      February 2, 2018 at 09:17
      
      Yeah, who would have thought that a “first instance” will be copy pasted :) Thanks. Fixed the second instance, too.
      
      Loading...
      
      Reply
Felipe Staudt says:

April 27, 2016 at 16:01

How could I make this same query to return the list of IDs with the longest consecutive series of events?

Loading...

Reply
1. lukaseder says:
  
  April 27, 2016 at 16:21
  
  You already have the GROUP BY to group each series. Now you can collect all the IDs in a comma separated string, for instance, if that’s the format you’re looking for. In PostgreSQL, this would be with STRING_AGG. In Oracle with LISTAGG. In SQL Server with STUFF (a bit more tricky). Hope this helps.
  
  Loading...
  
  Reply
Alexander says:

June 7, 2016 at 09:29

Hi,
Been implementing this to look on a minute basis to be able to report dates where our system has been reportedly “down”. I have however found gaps in the monitoring tool which sometimes misses a minute of reporting and this causes a new group to be created.

Is there any way I can increase the limit to check for gaps if there are 2 minutes between the dates?

Loading...

Reply
1. lukaseder says:
  
  June 7, 2016 at 18:03
  
  If you want to look for 2 minute gaps between timestamp, that’s a different problem statement than if you want to truncate timestamps to a 2 minute lower bound and see if those have gaps. The latter is much easier. The first is best solved with recursive SQL, I suspect, or with databases that have enhanced SQL features for time series analyses like Vertica.
  
  Loading...
  
  Reply
Sunil K (@suniljdh) says:

September 8, 2016 at 16:21

I remember its call Tabibitosan Method

Loading...

Reply
1. lukaseder says:
  
  September 8, 2016 at 16:45
  
  Excellent, I didn’t know it had a name (although, one that I won’t be able to remember, I think :) ). Thanks!
  
  Loading...
  
  Reply

Thanks for this post, it’s a creative solution.

One suggestion: the week query won’t return correct results because the weeks don’t wrap around the end of the year. You’re “unfairly” breaking streaks due to the end of the year. A wrapping query would look something like this:

WITH
  groups(week, grp) AS (
    SELECT DISTINCT
      DATEADD(d,-(DATEPART(dw,CAST(CreationDate AS DATE)) - 1),CAST(CreationDate AS DATE)),
      dateadd(day, 
        -(DENSE_RANK() OVER (ORDER BY DATEADD(d,-(DATEPART(dw,CAST(CreationDate AS DATE)) - 1),CAST(CreationDate AS DATE))) * 7), 
        DATEADD(d,-(DATEPART(dw,CAST(CreationDate AS DATE)) - 1),CAST(CreationDate AS DATE))) AS grp
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  )
SELECT
  COUNT(*) AS consecutiveWeeks,
  MIN(datepart(year, week) * 100 
    + datepart(week, week)) AS minWeek,
  MAX(datepart(year, week) * 100 
    + datepart(week, week)) AS maxWeek
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC

lukaseder says:

September 15, 2016 at 21:09

Excellent, thank you very much for the feedback (and nice observation!)

Loading...

Reply

Jean-Philippe says:

February 9, 2017 at 17:45

I think I’m on the right way with your sql, but what to do if we want to group the rows if we have a time part in the date (datetime), and that we want the grouping separation when we have 10 minutes or more between ordered datetime? Thanks.

Loading...

Reply
1. lukaseder says:
  
  February 9, 2017 at 18:29
  
  Hey there,
  
  I’m not really sure what you mean. I guess these kinds of problems are best described by example input/output. And while you’re at it, I think this would be an excellent question for Stack Overflow! :)
  
  Loading...
  
  Reply
  1. Jean-Philippe says:
    
    February 9, 2017 at 20:36
    
    Thank very much! Effectively, Stack Overflow helped to answer! Here is the link
    
    Loading...
    
    Reply
    1. lukaseder says:
      
      February 9, 2017 at 20:39
      
      Oh, excellent! That was quick! Thanks for sharing here
      
      Loading...
      
      Reply
Marius Rubei says:

January 19, 2018 at 11:57

Is there a way of getting consecutive weeks without WITH clause? I don’t think MYSQL before 8 has this clause. I have some hard time figuring out a way of doing it with JOIN only. Thanks!

Loading...

Reply
1. lukaseder says:
  
  January 19, 2018 at 12:30
  
  MySQL 8 supports window functions and the WITH clause. But anyway, please ask your question on https://stackoverflow.com
  
  Loading...
  
  Reply
Mike D says:

March 7, 2018 at 23:09
A suggestion:
1) Identify the edges of the groups with the lag operator:
```
CASE 
  WHEN (creationdate - lag(creationdate) OVER(ORDER BY creationdate) < 1) 
  THEN 0 
  ELSE 1 
END as edge  
```
2) Cumulative sum on “edge”. You now have the group identifier.

Loading...
Reply
1. lukaseder says:
  
  March 8, 2018 at 09:32
  
  Sure, that identifies a row at the “edge” of a group, or a the beginning of a series. But how to take it from there? I.e. how to implement your step 2)?
  
  Loading...
  
  Reply
John says:

July 16, 2018 at 19:07

How are weekends handles in this sample. I am trying to count suspension days. With that said, a student may get suspended on a Thursday, or Friday for 4 days. I need to count Thursday, Friday, Monday, and Tuesday as consecutive days. Is that even possible?

Loading...

Reply
1. lukaseder says:
  
  July 20, 2018 at 10:42
  Of course it’s possible :-) But quite a different problem to the one I discussed in the article. Here’s a solution for PostgreSQL:
```
WITH RECURSIVE weekdays(day, n) AS (
  SELECT current_date, 1
  UNION ALL
  SELECT CASE WHEN EXTRACT (dow FROM day) = 5 THEN day + 3 ELSE day + 1 END, n + 1
  FROM weekdays
  WHERE n < 4
)
SELECT * FROM weekdays;
```
  It calculates the next 4 weekdays from now recursively. Surely there are other solutions as well, e.g. using window functions. Note that this only considers weekdays, not holidays or other special cases, in case of which you need to store your calendar in a table.
  
  Loading...
  Reply
Z C Croes says:

January 17, 2019 at 11:19

Hi there, I don’t understand the third part of your query. You’re selecting from a tabl “GROUP” and grouping by grp. Is this a seperate table that you already have or am I missing a step?

Loading...

Reply
1. lukaseder says:
  
  January 17, 2019 at 17:46
  
  The common table expression is called “groups”, and it is defined in the WITH clause of each query. Hope this helps
  
  Loading...
  
  Reply
Allan Canuto says:

January 25, 2019 at 13:54

How i can make the column “grp” in postgresql?

Loading...

Reply
1. lukaseder says:
  
  January 25, 2019 at 13:58
  
  I don’t think I understand your question
  
  Loading...
  
  Reply
  1. Allan Canuto says:
    
    January 25, 2019 at 14:45
    
    In postgreSQL don’t exist the function dateadd, the similar is “date + INTERVAL ‘1 day’ “, but I can’t use the row_number as the parameter for that funcion.
    
    Loading...
    
    Reply
    1. lukaseder says:
      
      January 25, 2019 at 14:48
      
      Use date – row_number() OVER()::INT instead
      
      Loading...
      
      Reply

How to Find the Longest Consecutive Series of Events in SQL

How to do this in SQL?

Generating groups for consecutive dates

Bonus query 1: Find consecutive weeks

Bonus query 2: Simplify the query using DENSE_RANK()

Further reading

Like this:

Published by lukaseder

28 thoughts on “How to Find the Longest Consecutive Series of Events in SQL”

Leave a ReplyCancel reply

How to do this in SQL?

Generating groups for consecutive dates

Bonus query 1: Find consecutive weeks

Bonus query 2: Simplify the query using DENSE_RANK()

Further reading

Like this:

Published by lukaseder

28 thoughts on “How to Find the Longest Consecutive Series of Events in SQL”

Leave a ReplyCancel reply

Discover more from Java, SQL and jOOQ.