Beware of Functional Programming in Java!

This isn’t going to be a rant about functional programming, which is awesome. This is a warning about some practices that you are very likely going to apply to your code, which are terribly wrong!.

Higher order functions are essential to functional programming, and thus, talking about them will help you be the center of attention at parties.

If you’re writing JavaScript, you’re doing it all the time. For instance:

setTimeout(function() {
    alert('10 Seconds passed');
}, 10000);

The above setTimeout() function is a higher-order function. It is a function that takes an anonymous function as an argument. After 10 seconds, it will call the function passed as an argument.

We can write another easy higher-order function that provides the above function as a result:

var message = function(text) {
    return function() {
        alert(text);
    }
};

setTimeout(message('10 Seconds passed'), 10000);

If you execute the above, message() will be executed, returning an anonymous function, which alerts the argument text you have passed to message()

In functional programming, the above is common practice. A function returned from a higher-order function will capture the outer scope and is able to act on this scope when called.

Why is this practice treacherous in Java?

For the same reasons. A “function” (lambda) returned from a higher-order “function” (method) will capture the outer scope and is able to act on this scope when called.

The most trivial example is given here:

class Test {
    public static void main(String[] args) {
        Runnable runnable = runnable();
        runnable.run(); // Breakpoint here
    }

    static Runnable runnable() {
        return () -> {
            System.out.println("Hello");
        };
    }
}

In the above logic, if you put a breakpoint right where the runnable.run() call is made, you can see the harmless lambda instance on the stack. A simple generated class, backing the functional interface implementation:

harmless-lambda-instance

Now let’s translate this example to your average Enterprise™ application (notice the annotations), which we’ve greatly simplified to fit this blog post:

class Test {
    public static void main(String[] args) {
        Runnable runnable = new EnterpriseBean()
            .runnable();
        runnable.run(); // Breakpoint here
    }
}

@ImportantDeclaration
@NoMoreXML({
    @CoolNewValidationStuff("Annotations"),
    @CoolNewValidationStuff("Rock")
})
class EnterpriseBean {
    Object[] enterpriseStateObject = 
        new Object[100_000_000];

    Runnable runnable() {
        return () -> {
            System.out.println("Hello");
        };
    }
}

The breakpoint is still at the same spot. What do we see on the stack?

Still a harmless little lambda instance:

harmless-lambda-instance-2

Fine. Of course. Let’s add some additional logging, just for debugging

class Test {
    public static void main(String[] args) {
        Runnable runnable = new EnterpriseBean()
            .runnable();
        runnable.run(); // Breakpoint here
    }
}

@ImportantDeclaration
@NoMoreXML({
    @CoolNewValidationStuff("Annotations"),
    @CoolNewValidationStuff("Rock")
})
class EnterpriseBean {
    Object[] enterpriseStateObject = 
        new Object[100_000_000];

    Runnable runnable() {
        return () -> {
            // Some harmless debugging here
            System.out.println("Hello from: " + this);
        };
    }
}

Ooops!

Suddenly, the “harmless” little this reference forced the Java compiler to enclose the enclosing instance of the EnterpriseBean™ in the returned Runnable class:

treacherous-lambda-with-enclosing-instance

And with it that heavy enterpriseStateObject came along, which can now no longer be garbage collected, until the call site releases the harmless little Runnable

Sure. Just be careful, know what you’re doing, and don’t reference “this” from a lambda

… you say?

How about a more subtle version?

class EnterpriseBean {
    Object[] enterpriseStateObject = 
        new Object[100_000_000];

    Runnable runnable() {
        return () -> log(); // implicit this.log()
    }

    void log() {
        // Some harmless debugging here
        System.out.println("Hello");
    }
}

OK, this is nothing new now, is it?

Indeed, it isn’t. Java 8 doesn’t have first-class functions, and that’s OK. The idea of backing lambda expressions by nominal SAM types is quite cunning, as it allowed to upgrade and lambda-y-fy all existing libraries in the Java ecosystem without changing them.

Also, with an anonymous class, this whole story would not have been surprising. The following coding style has leaked internal state via anonymous classes since good old Swing 1.0 style ActionListener et al.

class Test {
    public static void main(String[] args) {
        Runnable runnable = new EnterpriseBean()
            .runnable();
        runnable.run();
    }
}

@ImportantDeclaration
@NoMoreXML({
    @CoolNewValidationStuff("Annotations"),
    @CoolNewValidationStuff("Rock")
})
class EnterpriseBean {
    Object[] enterpriseStateObject = 
        new Object[100_000_000];

    Runnable runnable() {
        return new Runnable() {
            @Override
            public void run() {
                System.out.println("Hello from " + EnterpriseBean.this);
            }
        };
    }
}

What’s new? The lambda style will encourage using higher-order functions in Java, all over the place. Which is generally good. But only when the higher-order function is a static method, whose resulting types will not enclose any state.

With the above examples, however, we can see that we’re going to be debugging through a couple of memory leaks and problems in the near future, when we start embracing Java 8’s functional style programming.

So, be careful, and follow this rule:

(“Pure”) Higher order functions MUST be static methods in Java!

Further reading

Enclosing instances have caused issues before. Read about how the dreaded double curly braces anti pattern has caused pain and suffering among Java developers for the last two decades.

How to Find the Longest Consecutive Series of Events in SQL

A very interesting problem that can be solved very easily with SQL is to find consecutive series of events in a time series. But what is a consecutive series of events in a time series?

Take Stack Overflow, for example. Stack Overflow has a cool reputation system that uses badges to reward certain behaviour. As a social website, they encourage users to visit the platform every day. As such, two distinct badges are awarded:

stackoverflow-visits

Informally, it is obvious what this means. You’ll have to log in on day 1. Then again on day 2. Then again (perhaps several times, it doesn’t matter) on day 3. Forgot to log in on day 4? Ooops. We’ll start counting again.

How to do this in SQL?

On this blog, every problem will find its solution in SQL. So does this. And in order to solve this problem, we’re going to use the awesome Stack Exchange Data Explorer, which exposes a lot of Stack Exchange’s publicly available usage information.

Note that we won’t query the consecutive days of visits, as this information is not made available publicly. Instead, let’s query the consecutive days of posts a user has made.

The backing database is SQL Server, so we can run the following statement:

SELECT DISTINCT CAST(CreationDate AS DATE) AS date
FROM Posts
WHERE OwnerUserId = ##UserId##
ORDER BY 1

… which, for my own UserId generates something like:

date          
----------
2010-11-26
2010-11-27
2010-11-29
2010-11-30
2010-12-01
2010-12-02
2010-12-03
2010-12-05
2010-12-06
2010-12-07
2010-12-08
2010-12-09
2010-12-13
2010-12-14
...
(769 rows)

(run the statement yourself, here)

As we can see in the data, there have been gaps in the very early days:

date          
--------------------------------------
2010-11-26
2010-11-27 <---- Gap here after 2 days

2010-11-29
2010-11-30
2010-12-01
2010-12-02
2010-12-03 <---- Gap here after 5 days

2010-12-05
2010-12-06
2010-12-07
2010-12-08
2010-12-09 <---- Gap here after 5 days

2010-12-13
2010-12-14
...

Visually, it is very easy to see how many days in a row there were posts without any gaps. But how to do it with SQL?

To simplify the problem, let’s “store” individual queries in common table expressions. The above query, we’ll call dates:

WITH 

  -- This table contains all the distinct date 
  -- instances in the data set
  dates(date) AS (
    SELECT DISTINCT CAST(CreationDate AS DATE)
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  )
...

Now, the goal of the resulting query is to put all consecutive dates in the same group, such that we can aggregate over this group. The following query is what we want to write:

SELECT 
  COUNT(*) AS consecutiveDates,
  MIN(date) AS minDate,
  MAX(date) AS maxDate
FROM groups
GROUP BY grp -- This "grp" value will be explained later
ORDER BY 1 DESC, 2 DESC

We’d like to aggregate each group “grp” and count the number of dates in the group, as well as find the lowest and the highest date within each group.

Generating groups for consecutive dates

Let’s look at the data again, and to illustrate the idea, we’ll add consecutive row numbers, regardless of the gaps in dates:

row number   date          
--------------------------------
1            2010-11-26
2            2010-11-27

3            2010-11-29 <-- gap before this row
4            2010-11-30
5            2010-12-01
6            2010-12-02
7            2010-12-03

8            2010-12-05 <-- gap before this row

As you can see, regardless whether there is a gap between dates (two dates are not consecutive), their row numbers will still be consecutive. We can do this with the ROW_NUMBER() window function, very easily:

SELECT
  ROW_NUMBER() OVER (ORDER BY date) AS [row number],
  date
FROM dates

Now, let’s check out the following, interesting query:

WITH 

  -- This table contains all the distinct date 
  -- instances in the data set
  dates(date) AS (
    SELECT DISTINCT CAST(CreationDate AS DATE)
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  ),
  
  -- Generate "groups" of dates by subtracting the
  -- date's row number (no gaps) from the date itself
  -- (with potential gaps). Whenever there is a gap,
  -- there will be a new group
  groups AS (
    SELECT
      ROW_NUMBER() OVER (ORDER BY date) AS rn,
      dateadd(day, -ROW_NUMBER() OVER (ORDER BY date), date) AS grp,
      date
    FROM dates
  )
SELECT *
FROM groups
ORDER BY rn

The above query yields:

rn  grp          date          
--- ----------   ----------
1   2010-11-25   2010-11-26
2   2010-11-25   2010-11-27

3   2010-11-26   2010-11-29
4   2010-11-26   2010-11-30
5   2010-11-26   2010-12-01
6   2010-11-26   2010-12-02
7   2010-11-26   2010-12-03

8   2010-11-27   2010-12-05
9   2010-11-27   2010-12-06
10  2010-11-27   2010-12-07
11  2010-11-27   2010-12-08
12  2010-11-27   2010-12-09

13  2010-11-30   2010-12-13
14  2010-11-30   2010-12-14

(run the statement yourself, here)

All we did is subtract the row number from the date to get a new date “grp“. The actual date obtained this way is irrelevant. It’s just an auxiliary value.

What we can guarantee, though, is that for consecutive dates, the value of grp will be the same because for all consecutive dates, the following two equations yield true:

date2 - date1 = 1 // difference in days between dates
rn2   - rn1   = 1 // difference in row numbers

Yet, for non-consecutive dates, while the difference in row numbers is still 1, the difference in days is no longer 1. The groups can now be seen easily:

rn  grp          date          
--- ----------   ----------
1   2010-11-25   2010-11-26
2   2010-11-25   2010-11-27

3   2010-11-26   2010-11-29
4   2010-11-26   2010-11-30
5   2010-11-26   2010-12-01
6   2010-11-26   2010-12-02
7   2010-11-26   2010-12-03

8   2010-11-27   2010-12-05
9   2010-11-27   2010-12-06
10  2010-11-27   2010-12-07
11  2010-11-27   2010-12-08
12  2010-11-27   2010-12-09

13  2010-11-30   2010-12-13
14  2010-11-30   2010-12-14

Thus, the complete query can now be seen here:

WITH 

  -- This table contains all the distinct date 
  -- instances in the data set
  dates(date) AS (
    SELECT DISTINCT CAST(CreationDate AS DATE)
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  ),
  
  -- Generate "groups" of dates by subtracting the
  -- date's row number (no gaps) from the date itself
  -- (with potential gaps). Whenever there is a gap,
  -- there will be a new group
  groups AS (
    SELECT
      ROW_NUMBER() OVER (ORDER BY date) AS rn,
      dateadd(day, -ROW_NUMBER() OVER (ORDER BY date), date) AS grp,
      date
    FROM dates
  )
SELECT 
  COUNT(*) AS consecutiveDates,
  MIN(week) AS minDate,
  MAX(week) AS maxDate
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC

And it yields:

consecutiveDates minDate       maxDate       
---------------- ------------- ------------- 
14               2012-08-13    2012-08-26
14               2012-02-03    2012-02-16
10               2013-10-24    2013-11-02
10               2011-05-11    2011-05-20
9                2011-06-30    2011-07-08
7                2012-01-17    2012-01-23
7                2011-06-14    2011-06-20
6                2012-04-10    2012-04-15
6                2012-04-02    2012-04-07
6                2012-03-26    2012-03-31
6                2011-10-27    2011-11-01
6                2011-07-17    2011-07-22
6                2011-05-23    2011-05-28
...

(run the statement yourself, here)

Bonus query 1: Find consecutive weeks

The fact that we chose the granularity of days in the above query is a random choice. We simply took the timestamp from our time series and “collapsed” it to the desired granularity using a CAST function:

SELECT DISTINCT CAST(CreationDate AS DATE)

If we want to know the consecutive weeks, we’ll simply change that function to a different expression, e.g.

SELECT DISTINCT datepart(year, CreationDate) * 100 
              + datepart(week, CreationDate)

This new expression takes the year and the week and generates values like 201503 for week 03 in the year 2015. The rest of the statement remains exactly the same:

WITH 
  weeks(week) AS (
    SELECT DISTINCT datepart(year, CreationDate) * 100 
                  + datepart(week, CreationDate)
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  ),
  groups AS (
    SELECT
      ROW_NUMBER() OVER (ORDER BY week) AS rn,
      dateadd(day, -ROW_NUMBER() OVER (ORDER BY week), week) AS grp,
      week
    FROM weeks
  )
SELECT 
  COUNT(*) AS consecutiveWeeks,
  MIN(week) AS minWeek,
  MAX(week) AS maxWeek
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC

And we’ll get the following result:

consecutiveWeeks minWeek maxWeek 
---------------- ------- ------- 
45               201401  201445  
29               201225  201253  
25               201114  201138  
23               201201  201223  
20               201333  201352  
16               201529  201544  
15               201305  201319  
12               201514  201525  
12               201142  201153  
9                201502  201510  
7                201447  201453  
7                201321  201327  
6                201048  201053  
4                201106  201109  
3                201329  201331  
3                201102  201104  
2                201301  201302  
2                201111  201112  
1                201512  201512  

(run the statement yourself, here)

Unsurprisingly, the consecutive weeks span much longer ranges, as I generally use Stack Overflow extensively.

Bonus query 2: Simplify the query using DENSE_RANK()

In a previous article, we’ve shown that SQL Trick: ROW_NUMBER() is to SELECT what DENSE_RANK() is to SELECT DISTINCT.

If we go back to our consecutive days example, we can rewrite the query to find the distinct dates AND the groups in one go, using DENSE_RANK():

WITH 
  groups(date, grp) AS (
    SELECT DISTINCT 
      CAST(CreationDate AS DATE),
      dateadd(day, 
        -DENSE_RANK() OVER (ORDER BY CAST(CreationDate AS DATE)), 
        CAST(CreationDate AS DATE)) AS grp
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  )
SELECT 
  COUNT(*) AS consecutiveDates,
  MIN(date) AS minDate,
  MAX(date) AS maxDate
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC

(run the statement yourself, here)

If the above doesn’t make sense, I recommend reading our previous article here, which explains it:

https://blog.jooq.org/2013/10/09/sql-trick-row_number-is-to-select-what-dense_rank-is-to-select-distinct/

Further reading

The above has been one very useful example of using window functions (ROW_NUMBER()) in SQL. Learn more about window functions in any of the following articles: