How to Order Versioned File Names Semantically in Java

In most cases, natural sorting by sorting lexicographically is useful as a default in Java. This includes sorting file names, which are sorted lexicographically as well.

However, when we have version numbers in our files (such as a set of SQL migration scripts), then we prefer the files to be sorted in a more intuitive ordering, where the version numbers contained in the string become “semantic”. In the following example, we have a set of versions, once sorted “naturally”, and once “semantically”:

Natural sorting

  • version-1
  • version-10
  • version-10.1
  • version-2
  • version-21

Semantic sorting

  • version-1
  • version-2
  • version-10
  • version-10.1
  • version-21

Semantic ordering, Windows style

The Windows Explorer does this as well, although there’s a slight difference as the “.” character is used to separate filename from ending, so now, we’re comparing a version sub-number (1) with a file ending (sql)…

The JDK doesn’t seem to have a built-in Comparator that implements this ordering, but we can easily roll our own. The idea is simple. We want to split a file name into several chunks, where a chunk is either a string (sorted lexicographically), or an integer number (sorted numerically). We split that file name using a regular expression:

Pattern.compile("(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)");

This expression matches the boundary between string and number, without actually capturing anything, so we can use it for split() operations. The idea was inspired by this stack exchange answer. Here’s the logic of the comparator annotated with comments:

public final class FilenameComparator
implements Comparator<String> {

    private static final Pattern NUMBERS = 
        Pattern.compile("(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)");

    @Override
    public final int compare(String o1, String o2) {

        // Optional "NULLS LAST" semantics:
        if (o1 == null || o2 == null)
            return o1 == null ? o2 == null ? 0 : -1 : 1;

        // Splitting both input strings by the above patterns
        String[] split1 = NUMBERS.split(o1);
        String[] split2 = NUMBERS.split(o2);
        int length = Math.min(split1.length, split2.length);

        // Looping over the individual segments
        for (int i = 0; i < length; i++) {
            char c1 = split1[i].charAt(0);
            char c2 = split2[i].charAt(0);
            int cmp = 0;

            // If both segments start with a digit, sort them
            // numerically using BigInteger to stay safe
            if (c1 >= '0' && c1 <= '9' && c2 >= 0 && c2 <= '9')
                cmp = new BigInteger(split1[i]).compareTo(
                      new BigInteger(split2[i]));

            // If we haven't sorted numerically before, or if
            // numeric sorting yielded equality (e.g 007 and 7)
            // then sort lexicographically
            if (cmp == 0)
                cmp = split1[i].compareTo(split2[i]);

            // Abort once some prefix has unequal ordering
            if (cmp != 0)
                return cmp;
        }

        // If we reach this, then both strings have equally
        // ordered prefixes, but maybe one string is longer than
        // the other (i.e. has more segments)
        return split1.length - split2.length;
    }
}

That’s it. Here’s an example on how to use this:

// Random order
List<String> list = asList(
    "version-10", 
    "version-2", 
    "version-21", 
    "version-1", 
    "version-10.1"
);

// Turn versions into files
List<File> l2 = list
    .stream()
    .map(s -> "C:\\temp\\" + s + ".sql")
    .map(File::new)
    .collect(Collectors.toList());

System.out.println("Natural sorting");
l2.stream()
  .sorted()
  .forEach(System.out::println);

System.out.println();
System.out.println("Semantic sorting");
l2.stream()
  .sorted(Comparator.comparing(
      File::getName, 
      new FilenameComparator()))
  .forEach(System.out::println);

The output is:

Natural sorting
C:\temp\version-1.sql
C:\temp\version-10.1.sql
C:\temp\version-10.sql
C:\temp\version-2.sql
C:\temp\version-21.sql

Semantic sorting
C:\temp\version-1.sql
C:\temp\version-2.sql
C:\temp\version-10.1.sql
C:\temp\version-10.sql
C:\temp\version-21.sql

Again, the algorithm is rather simple as it doesn’t distinguish between file endings and “segments”, so (1) is compared with (sql), which might not be the desired behaviour. This can be easily fixed by recognising actual file endings and excluding them from the comparison logic – at the price of not being able to sort files without file endings… The comparator would then look like this:

public final class FilenameComparator
implements Comparator<String> {

    private static final Pattern NUMBERS = 
        Pattern.compile("(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)");
    private static final Pattern FILE_ENDING =
        Pattern.compile("(?<=.*)(?=\\..*)");

    @Override
    public final int compare(String o1, String o2) {
        if (o1 == null || o2 == null)
            return o1 == null ? o2 == null ? 0 : -1 : 1;

        String[] name1 = FILE_ENDING.split(o1);
        String[] name2 = FILE_ENDING.split(o2);

        String[] split1 = NUMBERS.split(name1[0]);
        String[] split2 = NUMBERS.split(name2[0]);
        int length = Math.min(split1.length, split2.length);

        // Looping over the individual segments
        for (int i = 0; i < length; i++) {
            char c1 = split1[i].charAt(0);
            char c2 = split2[i].charAt(0);
            int cmp = 0;

            if (c1 >= '0' && c1 <= '9' && c2 >= 0 && c2 <= '9')
                cmp = new BigInteger(split1[i]).compareTo(
                      new BigInteger(split2[i]));

            if (cmp == 0)
                cmp = split1[i].compareTo(split2[i]);

            if (cmp != 0)
                return cmp;
        }

        int cmp = split1.length - split2.length;
        if (cmp != 0)
            return cmp;

        cmp = name1.length - name2.length;
        if (cmp != 0)
            return cmp;

        return name1[1].compareTo(name2[1]);
    }
}

The output is now:

C:\temp\version-1.sql
C:\temp\version-2.sql
C:\temp\version-10.sql
C:\temp\version-10.1.sql
C:\temp\version-21.sql

Discussion about a JDK implementation

Tagir Valeev from JetBrains was so kind to point out discussions about adding such an implementation to the JDK:

The discussion is here:

Clearly, the suggested implementation on the JDK mailing list is superior to the one from this blog post, as it:

  • Correctly handles unicode
  • Works with individual codepoint based comparisons rather than regular expressions, which has a lower memory footprint. This can be significant for sorting large lists, as sorting has O(N log N) complexity

Java 8 Friday Goodies: Lambdas and Sorting

At Data Geekery, we love Java. And as we’re really into jOOQ’s fluent API and query DSL, we’re absolutely thrilled about what Java 8 will bring to our ecosystem. We have blogged a couple of times about some nice Java 8 goodies, and now we feel it’s time to start a new blog series, the…

Java 8 Friday

Every Friday, we’re showing you a couple of nice new tutorial-style Java 8 features, which take advantage of lambda expressions, extension methods, and other great stuff. You’ll find the source code on GitHub. tweet this

Java 8 Goodie: Lambdas and Sorting

Sorting arrays, and collections is an awesome use-case for Java 8’s lambda expression for the simple reason that Comparator has always been a @FunctionalInterface all along since its introduction in JDK 1.2. We can now supply Comparators in the form of a lambda expression to various sort() methods.

For the following examples, we’re going to use this simple Person class:

static class Person {
    final String firstName;
    final String lastName;

    Person(String firstName, String lastName) {
        this.firstName = firstName;
        this.lastName = lastName;
    }

    @Override
    public String toString() {
        return "Person{" +
                "firstName='" + firstName + '\'' +
                ", lastName='" + lastName + '\'' +
                '}';
    }
}

Obviously, we could add natural sorting to Person as well by letting it implement Comparable, but lets focus on external Comparators. Consider the following list of Person, whose names are generated with some online random name generator:

List<Person> people =
Arrays.asList(
    new Person("Jane", "Henderson"),
    new Person("Michael", "White"),
    new Person("Henry", "Brighton"),
    new Person("Hannah", "Plowman"),
    new Person("William", "Henderson")
);

We probably want to sort them by last name and then by first name.

Sorting with Java 7

A “classic” Java 7 example of such a Comparator is this:

people.sort(new Comparator<Person>() {
  @Override
  public int compare(Person o1, Person o2) {
    int result = o1.lastName.compareTo(o2.lastName);

    if (result == 0)
      result = o1.firstName.compareTo(o2.firstName);

    return result;
  }
});
people.forEach(System.out::println);

And the above would yield:

Person{firstName='Henry', lastName='Brighton'}
Person{firstName='Jane', lastName='Henderson'}
Person{firstName='William', lastName='Henderson'}
Person{firstName='Hannah', lastName='Plowman'}
Person{firstName='Michael', lastName='White'}

Sorting with Java 8

Now, let’s translate the above to equivalent Java 8 code:

Comparator<Person> c = (p, o) ->
    p.lastName.compareTo(o.lastName);

c = c.thenComparing((p, o) ->
    p.firstName.compareTo(o.firstName));

people.sort(c);
people.forEach(System.out::println);

The result is obviously the same. How to read the above? First, we assign a lambda expression to a local Person Comparator variable:

Comparator<Person> c = (p, o) ->
    p.lastName.compareTo(o.lastName);

Unlike Scala, C#, or Ceylon which know type inference from an expression towards a local variable declaration through a val keyword (or similar), Java performs type inference from a variable (or parameter, member) declaration towards an expression that is being assigned.

In other, more informal words, type inference is performed from “left to right”, not from “right to left”. This makes chaining Comparators a bit cumbersome, as the Java compiler cannot delay type inference for lambda expressions until you pass the comparator to the sort() method.

Once we have assigned a Comparator to a variable, however, we can fluently chain other comparators through thenComparing():

c = c.thenComparing((p, o) ->
    p.firstName.compareTo(o.firstName));

And finally, we pass it to the List‘s new sort() method, which is a default method implemented directly on the List interface:

default void sort(Comparator<? super E> c) {
    Collections.sort(this, c);
}

Workaround for the above limitation

While Java’s type inference “limitations” can turn out to be a bit frustrating, we can work around type inference by creating a generic IdentityComparator:

class Utils {
    static <E> Comparator<E> compare() {
        return (e1, e2) -> 0;
    }
}

With the above compare() method, we can write the following fluent comparator chain:

people.sort(
    Utils.<Person>compare()
         .thenComparing((p, o) -> 
              p.lastName.compareTo(o.lastName))
         .thenComparing((p, o) -> 
              p.firstName.compareTo(o.firstName))
);

people.forEach(System.out::println);

Extracting keys

This can get even better. Since we’re usually comparing the same POJO / DTO value from both Comparator arguments, we can provide them to the new APIs through a “key extractor” function. This is how it works:

people.sort(Utils.<Person>compare()
      .thenComparing(p -> p.lastName)
      .thenComparing(p -> p.firstName));
people.forEach(System.out::println);

So, given a Person p we provide the API with a function extracting, for instance, p.lastName. And in fact, once we use key extractors, we can omit our own utility method, as the libraries also have a comparing() method to initiate the whole chain:

people.sort(
    Comparator.comparing((Person p) -> p.lastName)
          .thenComparing(p -> p.firstName));
people.forEach(System.out::println);

Again, we need to help the compiler as it cannot infer all types, even if in principle, the sort() method would provide enough information in this case. To learn more about Java 8’s generalized type inference, see our previous blog post.

Conclusion

As with Java 5, the biggest improvements of the upgrade can be seen in the JDK libraries. When Java 5 brought typesafety to Comparators, Java 8 makes them easy to read and write (give or take the odd type inference quirk).

Java 8 is going to revolutionise the way we program, and next week, we will see how Java 8 impacts the way we interact with SQL.

More on Java 8

In the mean time, have a look at Eugen Paraschiv’s awesome Java 8 resources page

Faster SQL Pagination with jOOQ Using the Seek Method

Last week, I have blogged about why it is important to stay in control of your SQL, as writing good SQL helps keeping your operations costs down. This is true in many ways and today, we’re going to look into another way to write good, high-performing SQL: Using the “Seek Method”.

Slow OFFSET

In order to understand the Seek Method, let’s first understand what problem it solves: SQL OFFSET clauses are slow. They’re slow for a simple reason. In order to reach a high offset from a result set, all previous records have to be skipped and counted. While a query with no OFFSET can be very fast (using MySQL syntax):

SELECT first_name, last_name, score
FROM players
WHERE game_id = 42
ORDER BY score DESC
LIMIT 10;

Skipping to page number 10’000 will be much slower:

SELECT first_name, last_name, score
FROM players
WHERE game_id = 42
ORDER BY score DESC
LIMIT 10
OFFSET 100000;

Even if the tuple (game_id, score) is indexed, we’ll have to actually traverse the whole index in order to count how many records we’ve already skipped. While this problem can be somewhat lessened by a trick, joining players to a derived table, there is an alternative, much faster approach to tackling paging: the Seek Method.

The Seek Method

While it is not quite clear who originally invented the Seek Method (some also call it “keyset paging”), a very prominent advocate for it is Markus Winand. He describes the Seek Method on his blog (and in his book):

http://use-the-index-luke.com/sql/partial-results/fetch-next-page

Essentially, the Seek Method does not skip records before an OFFSET, but it skips records until the last record previously fetched. Think about paging on Google. From a usability point of view, you hardly ever skip exactly 100’000 records. You mostly want to skip to the next page and then again, to the next page, i.e. just past the last record / search result previously fetched. Take the following top 10 players (fake names generated with name generator):

first_name | last_name | score
------------------------------
Mary       | Paige     |  1098
Tracey     | Howard    |  1087
Jasmine    | Butler    |  1053
Zoe        | Piper     |  1002
Leonard    | Peters    |   983
Jonathan   | Hart      |   978
Adam       | Morrison  |   976
Amanda     | Gibson    |   967
Alison     | Wright    |   958
Jack       | Harris    |   949

The above are the first 10 players ordered by score. This can be achieved quite quickly using LIMIT 10 only. Now, when skipping to the next page, you can either just use an OFFSET 10 clause, or you skip all users with a score higher than 949:

SELECT first_name, last_name, score
FROM players
WHERE game_id = 42
-- Let's call this the "seek predicate"
AND score < 949
ORDER BY score DESC
LIMIT 10;

This will then give you the players on the next page:

first_name | last_name | score
------------------------------
William    | Fraser    |   947
Claire     | King      |   945
Jessica    | McDonald  |   932
...        | ...       |   ...

Note that the previous query assumes that the score is unique within the players table, which is unlikely, of course. If William Fraser also had 949 points, just as Jack Harris, the last player on the first page, he would be “lost between pages”. It is thus important to create a non-ambiguous ORDER BY clause and “seek predicate”, by adding an additional unique column:

SELECT player_id, first_name, last_name, score
FROM players
WHERE game_id = 42
-- assuming 15 is Jack Harris's player_id
AND (score, player_id) < (949, 15)
ORDER BY score DESC, player_id DESC
LIMIT 10;

Now, the “seek predicate” depends on the ORDER BY clause. Here are a couple of possible, alternative configurations:

-- "consistent" ASC and DESC correspond to > and <
AND (score, player_id) > (949, 15)
ORDER BY score ASC, player_id ASC

-- "mixed" ASC and DESC complicate things a bit
AND ((score < 949)
  OR (score = 949 AND player_id > 15))
ORDER BY score DESC, player_id ASC

-- The above might be further performance-tweaked
AND (score <= 949)
AND ((score < 949)
  OR (score = 949 AND player_id > 15))
ORDER BY score DESC, player_id ASC

If columns in the ORDER BY clause are nullable, NULLS FIRST and NULLS LAST might apply and further complicate the “seek predicate”.

How is this better than OFFSET?

The Seek Method allows for avoiding expensive “skip-and-count” operations, replacing them with a simple range scan on an index that might cover the “seek predicate”. Since you’re applying ORDER BY on the columns of the “seek predicate” anyway, you might have already chosen to index them appropriately.

While the Seek Method doesn’t improve queries for low page numbers, fetching higher page numbers is significantly faster as proven in this nice benchmark:

Reproduced from use-the-index-luke.com with permission by Markus Winand

More interesting feedback on the subject can be found in this reddit.com thread, in which even Tom Kyte himself added a couple of remarks.

A side effect of the Seek Method

A side effect of the Seek Method is the fact that the paging is more “stable”. When you’re about to display page 2 and a new player has reached page 1 in the mean time, or if any player is removed entirely, you will still display the same players on page 2. In other words, when using the Seek Method, there is no guarantee that the first player on page 2 has rank 11.

This may or may not be desired. It might be irrelevant on page 10’000, though.

jOOQ 3.3 support for the Seek Method

The upcoming jOOQ 3.3 (due for late 2013) will include support for the Seek Method on a SQL DSL API level. In addition to jOOQ’s existing LIMIT .. OFFSET support, a “seek predicate” can then be specified through the synthetic SEEK clause (similar to jOOQ’s synthetic DIVIDE BY clause):

DSL.using(configuration)
   .select(PLAYERS.PLAYER_ID,
           PLAYERS.FIRST_NAME,
           PLAYERS.LAST_NAME,
           PLAYERS.SCORE)
   .from(PLAYERS)
   .where(PLAYERS.GAME_ID.eq(42))
   .orderBy(PLAYERS.SCORE.desc(),
            PLAYERS.PLAYER_ID.asc())
   .seek(949, 15) // (!)
   .limit(10)
   .fetch();

Instead of explictly phrasing the “seek predicate”, just pass the last record from the previous query, and jOOQ will see that all records before and including this record are skipped, given the ORDER BY clause.

This appears much more readable than the actual SQL rendered because the “seek predicate” is closer to the ORDER BY clause where it belongs. Also, jOOQ’s usual row value typesafety is applied here helping you find the right degree / arity and data types for your SEEK clause. In the above example, the following method calls would not compile in Java:

// Not enough arguments in seek()
   .orderBy(PLAYERS.SCORE.desc(),
            PLAYERS.PLAYER_ID.asc())
   .seek(949)

// Wrong argument types in seek()
   .orderBy(PLAYERS.SCORE.desc(),
            PLAYERS.PLAYER_ID.asc())
   .seek(949, "abc")

Get to work with the Seek Method

With native API support for a SEEK clause, you can get in control of your SQL again and implement high-performing SQL quite easily. Early adopters can already play around with the current state of jOOQ’s 3.3.0 Open Source Edition, which is available on GitHub.

And even if you don’t use jOOQ, give the Seek Method a try. You may just have a much faster application afterwards!