Something is very wrong with the above example. Can you see it? No? Let me rename those variables for you.
someCollection .stream() .map(e -> someFunction(e)) .collect(Collectors.toList()) .subList(0, 2);
Better now? Exactly. The above algorithm is O(N) when it could be O(1):
hugeCollection .stream() .map(e -> superExpensiveMapping(e)) .collect(Collectors.toList()) .subList(0, 2);
(Let’s assume the lack of explicit ordering is irrelevant) I’m working mostly with SQL and helping companies tune their SQL (check out our training, btw) and as such, I’m always very eager to reduce the algorithmic complexity of queries. To some people, this seems like wizardry, but honestly, most of the time, it just boils down to adding a well placed index. Why? Because an index reduces the algorithmic complexity of a query algorithm from O(N) (scanning the entire table for matches) to O(log N) (scanning a B-tree index for the same matches).
hugeCollection .stream() .limit(2) .map(e -> superExpensiveMapping(e)) .collect(Collectors.toList());
Sidenote: Reducing algorithmic complexity is almost never premature optimisation. As your data set grows, bad complexity will always bite you!The same is true for the above examples. Why would you ever traverse and transform an entire list of a huge amount of elements (N) with an expensive operation (superExpensiveMapping), when you really only need to do this only for the first two values?
ConclusionSQL is a declarative language where the query optimiser gets this right automatically for you: It will (almost) always filter the data first (
WHEREclause), and only then transform it (
JOIN, GROUP BY, SELECT, etc). Just as in SQL, when we hand-write our queries using Streams (and also in imperative programming), do always:
Filter First, Map LaterYour production system will thank you. Note, another interesting case with
flatMap()is documented here.
4 thoughts on “A Basic Programming Pattern: Filter First, Map Later”
You can filter after the map and have the same results since streams evaluate lazily. It just has to be filtered before the collect.
Filtering after the collect is like using HAVING instead of WHERE in SQL it works but you should use WHERE if possible.
Yes sure, that optimisation works in this case, but it might not work in other cases, (e.g. the current
flatMap()implementation may cause issues).
I simply didn’t want to rely on any such optimisation in this post, which would just be a distraction from the main argument.
Your example only shows that bad programming (i.e. using subList instead of stream method to filter), leads to bad performance. It does not show that you should “Filter First, Map Later”. If flatMap has this issue, then use it as an example.
The flatMap example is now available in this follow up blog post: https://blog.jooq.org/2017/07/03/are-java-8-streams-truly-lazy-not-completely