Site icon Java, SQL and jOOQ.

Emulating Window Functions in MySQL 5.7

One of MySQL 8’s biggest improvements is the support of window functions. As I always said in conferences, there’s SQL before window functions and SQL after window functions. Once you start using them, you’ll use them everywhere.

Some of you poor souls are unfortunate enough to be stuck on MySQL 5.7, either of your own choosing, or because you’re using a clone / fork that is still 5.7 compatible. While for most people, this blog post is just for your amusement, or nostalgia, for some of you this post will be quite useful.

Using local variables

A lot of Stack Overflow questions or blog posts out there show the same old trick using local variables. In a procedural context, local variables make perfect sense. For example, this statement batch.

SET @c = (SELECT COUNT(*) FROM information_schema.tables);
-- More processing
-- Return the result:
SELECT @c;

A bit hairier is the fact that these local variables can be declared within a query, and incremented procedurally within a query:

SELECT
  a, 

  -- Use and increment your variable in SELECT
  @rn := @rn + 1
FROM 
  (
    SELECT 3 AS a UNION ALL
    SELECT 4 AS a    
  ) AS t, 

  -- Declare your variable in FROM
  (SELECT @rn := 0) r
ORDER BY a;

And boom, you have a ROW_NUMBER() OVER (ORDER BY a) window function! The result being:

|a  |@rn := @rn + 1|
|---|--------------|
|3 |1 |
|4 |2 |

This works quite incidentally, because the expression incrementing the row number “happens to” be evaluated in the desired order, row by row, because of the query’s ORDER BY a clause. Revert it:

SELECT
  a, @rn := @rn + 1
FROM (
  SELECT 3 AS a UNION ALL
  SELECT 4 AS a    
) AS t, (SELECT @rn := 0) r
ORDER BY a DESC;

And you still get the desired result:

|a  |@rn := @rn + 1|
|---|--------------|
|4 |1 |
|3 |2 |

This is really hairy, because it violates the idea of SQL’s logical order of operations, which most RDBMS agree upon. It assumes ORDER BY “happens before” SELECT, just because the optimiser chooses to do things this way. You can tamper with the optimiser and break the “feature” easily, e.g. by adding DISTINCT:

SELECT DISTINCT
  a, @rn := @rn + 1
FROM (
  SELECT 3 AS a UNION ALL
  SELECT 4 AS a    
) AS t, (SELECT @rn := 0) r
ORDER BY a DESC;

Now the result is no longer what we wanted (how could it possibly be?):

|a  |@rn := @rn + 1|
|---|--------------|
|4 |2 |
|3 |1 |

The reason is that DISTINCT is typically implemented using a sort or a hashmap, both will not preserve any ordering, and according to the aforementioned logical order of operations, this is perfectly fine, because ORDER BY is supposed to “happen after” SELECT and after DISTINCT, at least logically.

But if you’re careful, and cover everything with enough tests, you could still use this trick. After all, being stuck with MySQL 5.7 is already painful enough, so why not treat yourself to an “almost window function”.

Note: Just to indicate how much of a bad idea depending on this incidental feature is, MySQL 8.x now issues a deprecation warning:

Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: ‘SET variable=expression, …’, or ‘SELECT expression(s) INTO variables(s)’.

The main reason I’ve seen this syntax being used on Stack Overflow so far is to emulate ROW_NUMBER, so, I’d say, good riddance (now that MySQL 8 has window function support)

PARTITION BY using ORDER BY

What I haven’t seen much on Stack Overflow or in blogs, is PARTITION BY support. Most solutions I’ve seen use ORDER BY to implement partitioning, which is fine. For example:

SELECT
  a, b,
  ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
  IF (
    @prev = a, 
    @rn := @rn + 1, 
    CASE WHEN (@prev := a) IS NOT NULL OR TRUE THEN @rn := 1 END
  ) AS rn2
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 2 AS a, 4 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := 0, @prev := NULL) r
ORDER BY a, b DESC;

Producing:

|a  |b  |rn1|rn2|
|---|---|---|---|
|1 |5 |1 |1 |
|1 |3 |2 |2 |
|2 |6 |1 |1 |
|2 |4 |2 |2 |

A few notes:

PARTITION BY using JSON

A more robust, but perhaps slower approach to emulating PARTITION BY would be to maintain a JSON object that keeps track of each partition key’s ROW_NUMBER(), because why not?

Behold this beauty:

SELECT
  a, b,
  ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
  json_extract(
    @rn := json_set(
      @rn, @path := concat('$."', a, '"'), 
      (coalesce(json_extract(@rn, @path), 0) + 1)
    ), 
    @path
  ) AS rn2,
  @rn AS debug -- Added for debugging purposes only
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 2 AS a, 4 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '{}') r
ORDER BY b DESC;

Check out the results:

|a  |b  |rn1|rn2|debug               |
|---|---|---|---|--------------------|
|2 |6 |1 |1.0|{"1": 2.0, "2": 1.0}|
|1 |5 |1 |1.0|{"1": 1.0} |
|2 |4 |2 |2.0|{"1": 2.0, "2": 2.0}|
|1 |3 |2 |2.0|{"1": 2.0} |

You can try this on MySQL 5.7 (removing the ROW_NUMBER(), of course), and you’ll see this works perfectly fine! How does it work?

This could be simplified a bit if it wasn’t just a single expression, but since I’m thinking about implementing this emulation in jOOQ (see #14529), I wanted to do the exercise of keeping the projection unchanged (imagine, the jOOQ user writes ROW_NUMBER() with jOOQ, and wants this to “just work”).

Caveats:

Do you think you’ve seen everything? Let’s do something even more hairy:

DENSE_RANK with PARTITION BY

We won’t stop here, because once we’ve chosen this crazy path, we might as well see it to the end. Let’s emulate DENSE_RANK(), which is a bit harder, making the SQL more “beautiful”:

SELECT
  a, b,
  DENSE_RANK() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
  json_extract(
    @rn := json_set(@rn, 
      @rnpath := concat('$."rn-', a, '"'), 
      (coalesce(json_extract(@rn, @rnpath), 0) + IF (
        json_extract(@rn, @prepath := concat('$."pre-v-', a, '"')) = b, 
        0, 1
      )),
      @prepath,
      b
    ), 
    @rnpath
  ) AS rn2,
  @rn AS debug
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '{}') r
ORDER BY b DESC;

Here’s the result:

|a  |b  |rn1|rn2|debug                                                 |
|---|---|---|---|------------------------------------------------------|
|2 |6 |1 |1.0|{"rn-1": 2.0, "rn-2": 1.0, "pre-v-1": 3, "pre-v-2": 6}|
|1 |5 |1 |1.0|{"rn-1": 1.0, "pre-v-1": 5} |
|1 |5 |1 |1.0|{"rn-1": 1.0, "pre-v-1": 5} |
|1 |3 |2 |2.0|{"rn-1": 2.0, "pre-v-1": 3} |

How does it differ?

Caveats:

Hairy enough? Let’s go deeper

RANK with PARTITION BY

Who would have thought that RANK is harder than DENSE_RANK (see this article for a direct comparison of the functions). Now, in addition to remembering the previous ordering value per partition, we also need to remember the previous rank per partition (all the while continuing to count up the row number).

Note, you can refactor this to something more readable if you remove the jOOQ imposed single expression restriction, but where’s the challenge in that, right? Here it is, bow before it in awe (or terror):

SELECT
  a, b,
  RANK() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
  coalesce(
    json_extract(
      @rn := json_set(@rn, 
        @rnpath := concat('$."rn-', a, '"'), 
        @currn := coalesce(json_extract(@rn, @rnpath), 0) + 1,
        @prevpath := concat('$."pre-v-', a, '"'),
        b,
        @prernpath := concat('$."pre-rn-', a, '"'),
        IF (json_extract(@rn, @prevpath) = b, 
          coalesce(json_extract(@rn, @prernpath), @currn) div 1,
          @currn
        )
      ), 
      @prernpath
    ), 
    @currn
  ) AS rn2,
  @rn AS debug
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '{}') r
ORDER BY b DESC;

It produces:

|a  |b  |rn1|rn2|debug                                                                                   |
|---|---|---|---|----------------------------------------------------------------------------------------|
|2  |6  |1  |1.0|{"rn-1": 3.0, "rn-2": 1.0, "pre-v-1": 3, "pre-v-2": 6, "pre-rn-1": 3.0, "pre-rn-2": 1.0}|
|1  |5  |1  |1.0|{"rn-1": 1.0, "pre-v-1": 5, "pre-rn-1": 1.0}                                            |
|1  |5  |1  |1.0|{"rn-1": 2.0, "pre-v-1": 5, "pre-rn-1": 1.0}                                            |
|1  |3  |3  |3.0|{"rn-1": 3.0, "pre-v-1": 3, "pre-rn-1": 3.0}                                            |

How does it work? “Simply”:

Caveats:

PERCENT_RANK and CUME_DIST

I’m not convinced that these can be emulated with the local variable based approach. In principle:

But as we’ll see below, it’s not possible (I think?) to emulate COUNT(*) OVER () using this local variable based approach. You could, maybe, do another round of calculations when wrapping things in a derived table, though.

LEAD, LAG, etc.

Some of these can also be emulated with the above technique, in particular the ones that are “backward looking”.

The actual implementation is left as an exercise to the user. (Probably about time to consider upgrading to MySQL 8, by now!)

Aggregate functions

All SQL aggregate functions can be turned into window functions by appending OVER (). For example:

Since these previously discussed local variable based approaches are row-by-row based calculations, I don’t think it’s possible to emulate partition wide aggregate window functions, because these require being able to look at the entire partition, including rows that haven’t yet been projected.

However (never give up!), some window frames can be emulated also for aggregate functions, especially the backward looking ones. For simplicity, I’ll just try emulating this:

SUM(b) OVER (
  PARTITION BY a
  ORDER BY b DESC
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)

Note: without explicit window frame, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is implicit, and that means that in order to include tied rows in the sum, we’d have to again be forward looking, which I don’t think is possible with the local variable row-by-row based approach.

However, it might be possible to emulate alternative backward looking ROWS frames. That exercise is again left to the reader.

So, let’s do this:

SELECT
  a, b,
  SUM(b) OVER w AS sum1,
  json_extract(
    @w := json_set(@w, 
      @spath := concat('$."s-', a, '"'), 
      (coalesce(json_extract(@w, @spath), 0) + b),
      @cpath := concat('$."c-', a, '"'), 
      (coalesce(json_extract(@w, @cpath), 0) + 1)
    ), 
    @spath
  ) AS sum2,
  COUNT(*) OVER w AS cnt1,
  json_extract(@w, @cpath) AS cnt2,
  AVG(b) OVER w AS avg1,
  json_extract(@w, @spath) / json_extract(@w, @cpath) AS avg2,
  @w AS debug
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @w := '{}') r
WINDOW w AS (
  PARTITION BY a 
  ORDER BY b DESC 
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY b DESC;

The output seems correct:

|a  |b  |sum1|sum2|cnt1|cnt2|avg1  |avg2        |debug                                            |
|---|---|----|----|----|----|------|------------|-------------------------------------------------|
|2  |6  |6   |6.0 |1   |1.0 |6     |6           |{"c-1": 3.0, "c-2": 1.0, "s-1": 13.0, "s-2": 6.0}|
|1  |5  |5   |5.0 |1   |1.0 |5     |5           |{"c-1": 1.0, "s-1": 5.0}                         |
|1  |5  |10  |10.0|2   |2.0 |5     |5           |{"c-1": 2.0, "s-1": 10.0}                        |
|1  |3  |13  |13.0|3   |3.0 |4.3333|4.3333333333|{"c-1": 3.0, "s-1": 13.0}                        |

Notes:

Conclusion

This has been a fun blog post to write. I hope it was helpful to you either as an exercise to think about what window functions really do, or in the worst case, to help you poor soul actually implement things this way on MySQL 5.7.

There have been a lot of caveats. This emulation approach doesn’t always work and makes (heavy) assumptions about your query. For example:

There are probably more caveats that haven’t been discussed here. If you’re diligent, and test things heavily, however, you might be able to pull off using these approaches. Good luck (and don’t blame me 😅)

Exit mobile version