Emulating Window Functions in MySQL 5.7

lukaseder

3 years ago

One of MySQL 8’s biggest improvements is the support of window functions. As I always said in conferences, there’s SQL before window functions and SQL after window functions. Once you start using them, you’ll use them everywhere.

Some of you poor souls are unfortunate enough to be stuck on MySQL 5.7, either of your own choosing, or because you’re using a clone / fork that is still 5.7 compatible. While for most people, this blog post is just for your amusement, or nostalgia, for some of you this post will be quite useful.

Using local variables

A lot of Stack Overflow questions or blog posts out there show the same old trick using local variables. In a procedural context, local variables make perfect sense. For example, this statement batch.

SET @c = (SELECT COUNT(*) FROM information_schema.tables);
-- More processing
-- Return the result:
SELECT @c;

A bit hairier is the fact that these local variables can be declared within a query, and incremented procedurally within a query:

SELECT
  a, 

  -- Use and increment your variable in SELECT
  @rn := @rn + 1
FROM 
  (
    SELECT 3 AS a UNION ALL
    SELECT 4 AS a    
  ) AS t, 

  -- Declare your variable in FROM
  (SELECT @rn := 0) r
ORDER BY a;

And boom, you have a ROW_NUMBER() OVER (ORDER BY a) window function! The result being:

|a  |@rn := @rn + 1|
|---|--------------|
|3  |1             |
|4  |2             |

This works quite incidentally, because the expression incrementing the row number “happens to” be evaluated in the desired order, row by row, because of the query’s ORDER BY a clause. Revert it:

SELECT
  a, @rn := @rn + 1
FROM (
  SELECT 3 AS a UNION ALL
  SELECT 4 AS a    
) AS t, (SELECT @rn := 0) r
ORDER BY a DESC;

And you still get the desired result:

|a  |@rn := @rn + 1|
|---|--------------|
|4  |1             |
|3  |2             |

This is really hairy, because it violates the idea of SQL’s logical order of operations, which most RDBMS agree upon. It assumes ORDER BY “happens before” SELECT, just because the optimiser chooses to do things this way. You can tamper with the optimiser and break the “feature” easily, e.g. by adding DISTINCT:

SELECT DISTINCT
  a, @rn := @rn + 1
FROM (
  SELECT 3 AS a UNION ALL
  SELECT 4 AS a    
) AS t, (SELECT @rn := 0) r
ORDER BY a DESC;

Now the result is no longer what we wanted (how could it possibly be?):

|a  |@rn := @rn + 1|
|---|--------------|
|4  |2             |
|3  |1             |

The reason is that DISTINCT is typically implemented using a sort or a hashmap, both will not preserve any ordering, and according to the aforementioned logical order of operations, this is perfectly fine, because ORDER BY is supposed to “happen after” SELECT and after DISTINCT, at least logically.

But if you’re careful, and cover everything with enough tests, you could still use this trick. After all, being stuck with MySQL 5.7 is already painful enough, so why not treat yourself to an “almost window function”.

Note: Just to indicate how much of a bad idea depending on this incidental feature is, MySQL 8.x now issues a deprecation warning:

Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: ‘SET variable=expression, …’, or ‘SELECT expression(s) INTO variables(s)’.

The main reason I’ve seen this syntax being used on Stack Overflow so far is to emulate ROW_NUMBER, so, I’d say, good riddance (now that MySQL 8 has window function support)

PARTITION BY using ORDER BY

What I haven’t seen much on Stack Overflow or in blogs, is PARTITION BY support. Most solutions I’ve seen use ORDER BY to implement partitioning, which is fine. For example:

SELECT
  a, b,
  ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
  IF (
    @prev = a, 
    @rn := @rn + 1, 
    CASE WHEN (@prev := a) IS NOT NULL OR TRUE THEN @rn := 1 END
  ) AS rn2
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 2 AS a, 4 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := 0, @prev := NULL) r
ORDER BY a, b DESC;

Producing:

|a  |b  |rn1|rn2|
|---|---|---|---|
|1  |5  |1  |1  |
|1  |3  |2  |2  |
|2  |6  |1  |1  |
|2  |4  |2  |2  |

A few notes:

The desired PARTITION BY and ORDER BY clauses both have to be reflected in the top level query. If you only wanted to ORDER BY b DESC, not ORDER BY a as well, tough luck. (If you want to play around with this, try removing the ROW_NUMBER() function, which also orders stuff by a, implicitly)
I’ve tried to put all the variable assignment logic into a single expression in order to avoid any extra columns being generated. This makes the expression a bit more ugly than it needed to be.

PARTITION BY using JSON

A more robust, but perhaps slower approach to emulating PARTITION BY would be to maintain a JSON object that keeps track of each partition key’s ROW_NUMBER(), because why not?

Behold this beauty:

SELECT
  a, b,
  ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
  json_extract(
    @rn := json_set(
      @rn, @path := concat('$."', a, '"'), 
      (coalesce(json_extract(@rn, @path), 0) + 1)
    ), 
    @path
  ) AS rn2,
  @rn AS debug -- Added for debugging purposes only
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 2 AS a, 4 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '{}') r
ORDER BY b DESC;

Check out the results:

|a  |b  |rn1|rn2|debug               |
|---|---|---|---|--------------------|
|2  |6  |1  |1.0|{"1": 2.0, "2": 1.0}|
|1  |5  |1  |1.0|{"1": 1.0}          |
|2  |4  |2  |2.0|{"1": 2.0, "2": 2.0}|
|1  |3  |2  |2.0|{"1": 2.0}          |

You can try this on MySQL 5.7 (removing the ROW_NUMBER(), of course), and you’ll see this works perfectly fine! How does it work?

We start with an empty object {} in the FROM clause.
On every row that is incidentally ordered by the ORDER BY b DESC clause, we’ll extract the row number value for the partition key PARTITION BY a. This is done with a dynamically created JSON path expression concat('$."', a, '"'). For example: $."1" or $."2".
At first, this value is NULL, of course, so we turn it to zero with COALESCE(<expr>, 0).
We add 1 to it
Then we JSON_SET the value back into the object, assigning the result back to @rn.
Then, we re-extract the value we’ve just calculated

This could be simplified a bit if it wasn’t just a single expression, but since I’m thinking about implementing this emulation in jOOQ (see #14529), I wanted to do the exercise of keeping the projection unchanged (imagine, the jOOQ user writes ROW_NUMBER() with jOOQ, and wants this to “just work”).

Caveats:

If the PARTITION BY clause has multiple expressions, then the composite value would have to be used as a key, e.g. using some “impossible” concatenation token (a token that can’t appear in the data set), or a hash value (risking collisions, of course), or an additional lookup, making things quite complicated.
The concat('$."', a, '"') expression doesn’t properly quote a yet, in case it contains double quotes.
If multiple distinct window function calculations with distinct ORDER BY clauses are required, then this approach won’t work as easily. It might be possible to calculate things with one derived table nest level per window function (?). However, multiple distinct PARTITION BY clauses are fine. Just generate a separate @rn variable per distinct PARTITION BY clause.
The JSON document might lose data type information. For example, in JSON, numbers may be represented as floats, so if you require decimal precision, perhaps you should work with JSON strings instead, and cast things back and forth, always valuing correctness over performance.

Do you think you’ve seen everything? Let’s do something even more hairy:

DENSE_RANK with PARTITION BY

We won’t stop here, because once we’ve chosen this crazy path, we might as well see it to the end. Let’s emulate DENSE_RANK(), which is a bit harder, making the SQL more “beautiful”:

SELECT
  a, b,
  DENSE_RANK() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
  json_extract(
    @rn := json_set(@rn, 
      @rnpath := concat('$."rn-', a, '"'), 
      (coalesce(json_extract(@rn, @rnpath), 0) + IF (
        json_extract(@rn, @prepath := concat('$."pre-v-', a, '"')) = b, 
        0, 1
      )),
      @prepath,
      b
    ), 
    @rnpath
  ) AS rn2,
  @rn AS debug
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '{}') r
ORDER BY b DESC;

Here’s the result:

|a  |b  |rn1|rn2|debug                                                 |
|---|---|---|---|------------------------------------------------------|
|2  |6  |1  |1.0|{"rn-1": 2.0, "rn-2": 1.0, "pre-v-1": 3, "pre-v-2": 6}|
|1  |5  |1  |1.0|{"rn-1": 1.0, "pre-v-1": 5}                           |
|1  |5  |1  |1.0|{"rn-1": 1.0, "pre-v-1": 5}                           |
|1  |3  |2  |2.0|{"rn-1": 2.0, "pre-v-1": 3}                           |

How does it differ?

We now have to remember not just the previous row number value per partition ("rn-1", "rn-2"), but also the previous value of b (the ORDER BY criteria) per partition ("pre-v-1", "pre-v-2").
Then, we increment the row number per partition only if the previous value is different from the current value

Caveats:

There can still be multiple PARTITION BY expressions, as well as path escaping problems, see caveats of ROW_NUMBER for details.
If there are multiple ORDER BY columns, the "pre-v-n" values would have to remember their composite value, e.g. by nesting a JSON object. This is a bit simpler to take into account than multiple PARTITION BY expressions

Hairy enough? Let’s go deeper

RANK with PARTITION BY

Who would have thought that RANK is harder than DENSE_RANK (see this article for a direct comparison of the functions). Now, in addition to remembering the previous ordering value per partition, we also need to remember the previous rank per partition (all the while continuing to count up the row number).

Note, you can refactor this to something more readable if you remove the jOOQ imposed single expression restriction, but where’s the challenge in that, right? Here it is, bow before it in awe (or terror):

SELECT
  a, b,
  RANK() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
  coalesce(
    json_extract(
      @rn := json_set(@rn, 
        @rnpath := concat('$."rn-', a, '"'), 
        @currn := coalesce(json_extract(@rn, @rnpath), 0) + 1,
        @prevpath := concat('$."pre-v-', a, '"'),
        b,
        @prernpath := concat('$."pre-rn-', a, '"'),
        IF (json_extract(@rn, @prevpath) = b, 
          coalesce(json_extract(@rn, @prernpath), @currn) div 1,
          @currn
        )
      ), 
      @prernpath
    ), 
    @currn
  ) AS rn2,
  @rn AS debug
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '{}') r
ORDER BY b DESC;

It produces:

|a  |b  |rn1|rn2|debug                                                                                   |
|---|---|---|---|----------------------------------------------------------------------------------------|
|2  |6  |1  |1.0|{"rn-1": 3.0, "rn-2": 1.0, "pre-v-1": 3, "pre-v-2": 6, "pre-rn-1": 3.0, "pre-rn-2": 1.0}|
|1  |5  |1  |1.0|{"rn-1": 1.0, "pre-v-1": 5, "pre-rn-1": 1.0}                                            |
|1  |5  |1  |1.0|{"rn-1": 2.0, "pre-v-1": 5, "pre-rn-1": 1.0}                                            |
|1  |3  |3  |3.0|{"rn-1": 3.0, "pre-v-1": 3, "pre-rn-1": 3.0}                                            |

How does it work? “Simply”:

We now have 3 variables:
"rn-x" is again the original ROW_NUMBER () (because RANK() behaves like ROW_NUMBER(), unless there are ties in the ORDER BY clause)
"pre-v-x" is the previous value of the ORDER BY expression. This is again required to detect ties. In case of a tie, the "pre-rn-x" needs to be returned. Otherwise, "rn-x" is returned, and "pre-rn-x" is set to its value.

Caveats:

The same as before

PERCENT_RANK and CUME_DIST

I’m not convinced that these can be emulated with the local variable based approach. In principle:

PERCENT_RANK() OVER w is just (RANK() OVER w - 1) / (COUNT(*) OVER () - 1)
CUME_DIST() OVER w is just (RANK() OVER w) / (COUNT(*) OVER ())

But as we’ll see below, it’s not possible (I think?) to emulate COUNT(*) OVER () using this local variable based approach. You could, maybe, do another round of calculations when wrapping things in a derived table, though.

LEAD, LAG, etc.

Some of these can also be emulated with the above technique, in particular the ones that are “backward looking”.

LAG: For example, with LAG, we just have to remember again the "pre-v-x" for each partition, and produce it again on the current row. Depending on the LAG‘s OFFSET, we might need to keep around a JSON array of values, always appending the current value to the array, and removing the first value, like in a FIFO queue.
LEAD: The forward looking functions just have to reverse the ORDER BY clause. For example, all LEAD functions can be implemented with LAG as well.
FIRST_VALUE: This is a bit simpler than LAG, as we don’t have to keep an entire JSON array of values. We just remember the first one, and then keep reproducing this.
LAST_VALUE is again just the inverse of FIRST_VALUE with reversed ORDER BY clause.
NTH_VALUE needs a counter per partition, to be sure we catch the Nth value. Alternatively, we can again store everything in a JSON array until it reaches size N.
IGNORE NULLS can be implemented by skipping all the NULL values from being entered into the aforementioned FIFO queue
Things get a bit trickier when there is a RANGE or ROWS clause, in case of which the JSON array / FIFO queue has to be shifted. This affects FIRST_VALUE more than LEAD, I’d say.

The actual implementation is left as an exercise to the user. (Probably about time to consider upgrading to MySQL 8, by now!)

Aggregate functions

All SQL aggregate functions can be turned into window functions by appending OVER (). For example:

SUM(x) is an aggregate function, aggregating data per group generated by the GROUP BY clause, shared by the entire query.
SUM(x) OVER () is the corresponding window function, aggregating data per partition generated by the PARTITION BY clause per window function (or rather, per explicit or implicit window specification)

Since these previously discussed local variable based approaches are row-by-row based calculations, I don’t think it’s possible to emulate partition wide aggregate window functions, because these require being able to look at the entire partition, including rows that haven’t yet been projected.

However (never give up!), some window frames can be emulated also for aggregate functions, especially the backward looking ones. For simplicity, I’ll just try emulating this:

SUM(b) OVER (
  PARTITION BY a
  ORDER BY b DESC
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)

Note: without explicit window frame, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is implicit, and that means that in order to include tied rows in the sum, we’d have to again be forward looking, which I don’t think is possible with the local variable row-by-row based approach.

However, it might be possible to emulate alternative backward looking ROWS frames. That exercise is again left to the reader.

So, let’s do this:

SELECT
  a, b,
  SUM(b) OVER w AS sum1,
  json_extract(
    @w := json_set(@w, 
      @spath := concat('$."s-', a, '"'), 
      (coalesce(json_extract(@w, @spath), 0) + b),
      @cpath := concat('$."c-', a, '"'), 
      (coalesce(json_extract(@w, @cpath), 0) + 1)
    ), 
    @spath
  ) AS sum2,
  COUNT(*) OVER w AS cnt1,
  json_extract(@w, @cpath) AS cnt2,
  AVG(b) OVER w AS avg1,
  json_extract(@w, @spath) / json_extract(@w, @cpath) AS avg2,
  @w AS debug
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @w := '{}') r
WINDOW w AS (
  PARTITION BY a 
  ORDER BY b DESC 
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY b DESC;

The output seems correct:

|a  |b  |sum1|sum2|cnt1|cnt2|avg1  |avg2        |debug                                            |
|---|---|----|----|----|----|------|------------|-------------------------------------------------|
|2  |6  |6   |6.0 |1   |1.0 |6     |6           |{"c-1": 3.0, "c-2": 1.0, "s-1": 13.0, "s-2": 6.0}|
|1  |5  |5   |5.0 |1   |1.0 |5     |5           |{"c-1": 1.0, "s-1": 5.0}                         |
|1  |5  |10  |10.0|2   |2.0 |5     |5           |{"c-1": 2.0, "s-1": 10.0}                        |
|1  |3  |13  |13.0|3   |3.0 |4.3333|4.3333333333|{"c-1": 3.0, "s-1": 13.0}                        |

Notes:

I’ve stored all the window calculations in the same JSON object, assuming they all share the same window specification, so they can reuse their values (e.g. AVG(x) = SUM(x) / COUNT(x), and thus AVG(x) OVER w = SUM(x) OVER w / COUNT(x) OVER w)
Other than that, things work pretty much just like for ROW_NUMBER()

Conclusion

This has been a fun blog post to write. I hope it was helpful to you either as an exercise to think about what window functions really do, or in the worst case, to help you poor soul actually implement things this way on MySQL 5.7.

There have been a lot of caveats. This emulation approach doesn’t always work and makes (heavy) assumptions about your query. For example:

You can’t use DISTINCT
You can’t use arbitrary ORDER BY clauses that don’t match the window function’s
You can’t use multiple window functions with different window specifications
You can’t use forward looking window frames (including frameless aggregate window functions that aggregate the entire partition)

There are probably more caveats that haven’t been discussed here. If you’re diligent, and test things heavily, however, you might be able to pull off using these approaches. Good luck (and don’t blame me 😅)