How to Pattern-Match Files and Display Adjacent Lines in Java


Recently, we’ve published our article about the awesome window function support in jOOλ 0.9.9, which I believe is some of the best additions to the library that we’ve ever done.

Today, we’ll look into an awesome application of window functions in a use-case that is inspired by this Stack Overflow question Sean Nguyen:

How to get lines before and after matching from java 8 stream like grep?

I have a text files that have a lot of string lines in there. If I want to find lines before and after a matching in grep, I will do like this:

grep -A 10 -B 10 "ABC" myfile.txt

How can I implement the equivalent in java 8 using streams?

So the question is:

How can I implement the equivalent in Java 8 using streams?

jOOλ - The Missing Parts in Java 8 jOOλ improves the JDK libraries in areas where the Expert Group's focus was elsewhere.Well, the unix shell and its various “pipable” commands are about the only thing that are even more awesome (and mysterious) than window functions. Being able to grep for a certain string in a file, and then display a “window” of a couple of lines is quite useful.

With jOOλ 0.9.9, however, we can do that very easily in Java 8 as well. Consider this little snippet:

Seq.seq(Files.readAllLines(Paths.get(
        new File("/path/to/Example.java").toURI())))
   .window()
   .filter(w -> w.value().contains("ABC"))
   .forEach(w -> {
       System.out.println();
       System.out.println("-1:" + w.lag().orElse(""));
       System.out.println(" 0:" + w.value());
       System.out.println("+1:" + w.lead().orElse(""));
       // ABC: Just checking
   });

This program will output:

-1: .window()
 0: .filter(w -> w.value().contains("ABC"))
+1: .forEach(w -> {

-1:     System.out.println("+1:" + w.lead().orElse(""));
 0:     // ABC: Just checking
+1: });

So, I’ve run the program on itself and I’ve found all the lines that match “ABC”, plus the previous lines (“lagging” / lag()) and the following lines (leading / lead()). These lead() and lag() functions work just like their SQL equivalents.

But unlike SQL, composing functions in Java (or other general purpose languages) is a bit simpler as there is less syntax clutter involved. We can easily do aggregations over a window frame to collect a generic amount of lines “lagging” and “leading” a match. Consider the following alternative:

int lower = -5;
int upper =  5;
        
Seq.seq(Files.readAllLines(Paths.get(
        new File("/path/to/Example.java").toURI())))
   .window(lower, upper)
   .filter(w -> w.value().contains("ABC"))
   .map(w -> w.window()
              .zipWithIndex()
              .map(t -> tuple(t.v1, t.v2 + lower))
              .map(t -> (t.v2 > 0 
                       ? "+" 
                       : t.v2 == 0 
                       ? " " : "") 
                       + t.v2 + ":" + t.v1)
              .toString("\n"))

And the output that we’re getting is this:

-5:int upper =  5;
-4:        
-3:Seq.seq(Files.readAllLines(Paths.get(
-2:        new File("/path/to/Example.java").toURI())))
-1:   .window(lower, upper)
 0:   .filter(w -> w.value().contains("ABC"))
+1:   .map(w -> w.window()
+2:              .zipWithIndex()
+3:              .map(t -> tuple(t.v1, t.v2 + lower))
+4:              .map(t -> (t.v2 > 0 
+5:                       ? "+" 

Could it get any more concise? I don’t think so. Most of the logic above was just generating the index next to the line.

Conclusion

Window functions are extremely powerful. The recent discussion on reddit about our previous article on jOOλ’s window function support has shown that other languages also support primitives to build similar functionality. But usually, these building blocks aren’t as concise as the ones exposed in jOOλ, which are inspired by SQL.

With jOOλ mimicking SQL’s window functions, there is only little cognitive friction when composing powerful operations on in memory data streams.

Learn more about window functions in these articles here:

10 thoughts on “How to Pattern-Match Files and Display Adjacent Lines in Java

  1. Hi Lukas,
    I like the window functions, sliding has no lag/lead.
    Is it right that the second example prints

    
    -5 // ABC
    -4 ...
    -3 ...
    -2 ...
    -1 ...
     0 ...
    

    when I place ‘// ABC’ in the first line. I expected lines from -5 to +5, but I don’t know the SQL behavior exactly.
    Greets,
    Daniel

    • You’re right, those indexes are incorrect. The window frame (lower, upper) are implemented correctly, but using the lower frame bound for the index calculation is wrong.

      Exercise: How to do it right? 🙂

      • I think I have to fully understand window() first. I already looked at jOOL’s source code. I currently don’t have the whole picture. E.g. the partition part within Window… Do you have good docs on this topic? Google is not very helpful – the search results require background knowledge.

        • A window defines a subset of the data produced by the table specification (usually FROM, WHERE, GROUP BY, HAVING) to calculate rankings or aggregations upon, in the context of the current value. Translated to a Stream, this means the whole stream upon which window() is called.

          Now the window specification has three parts:

          1. The partition. Only rows that are in the same partition as the current row are considered for the window. This is always optional. If left blank, the whole table specification / stream is a single partition
          2. The order. The partition is ordered by this specification. The order is mandatory for rankings and optional for aggregations.
          3. The frame. The partition can be limited to certain offsets relative to the current row given the previous ordering. This clause is always optional. If left blank, the frame spans the whole partition if no order is provided, or it spans all the “previous” rows if an order is provided

          It takes a bit of practice to fully grasp, but then, it’s really easy and extremely powerful. A lot of FP concepts can be expressed much more simply using window functions.

          This is also a good article:
          http://tapoueh.org/blog/2013/08/20-Window-Functions

  2. I think I understood 🙂 Lovely feature!

    This works for me:

    
    int lower = -5;
    int upper = 5;
    
    Seq.seq(Files.readAllLines(Paths.get(
        new File("/path/to/Example.java").toURI())))
        .window(lower, upper)
        .filter(w -> w.value().contains("ABC"))
        .forEach(w -> {
            System.out.println();
            for (int i = lower; i <= upper; i++) {
                String index = ((i > 0) ? "+" : (i == 0) ? " " : "") + i;
                String value = (i < 0) ? w.lag(-i).orElse("") :
                               (i > 0) ? w.lead(i).orElse("") : w.value();
                System.out.println(index + ":" + value);
            }
            // ABC: Just checking
        });
    
    • Excellent. Or, we could even skip displaying those lines…

      Anyway, indeed. SQL window functions are something really really nice. One of my favourite examples:
      https://blog.jooq.org/2015/11/07/how-to-find-the-longest-consecutive-series-of-events-in-sql/

      I understand that FP people don’t appreciate them very much, because they’re not idiomatic (see also comments here: https://redd.it/3zqcdu and here: https://redd.it/3zqcb4). “Not idiomatic” is, of course, just another way of saying: “This is not my religion, and I don’t / refuse to understand it, so it must be wrong.” 🙂

      Note, you can also pass several window specifications to window() using window(WindowSpecification.of(...), WindowSpecification.of(...)). This allows for performing operations on different partitions / orderings, etc. at the same time.

      Note: The current implementation in jOOλ is not optimal. A lot of caching still needs to be done, so this won’t work well yet on large data sets (unlike SQL window functions, which perform really really nicely).

        • That would be a great opportunity to meet you in real life! However, as you may have hear, my employer is privatized soon and is currently not investing in innovation, it is rather sanitizing…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s