LINQ and Java

LINQ has been quite a successful, but also controversial addition to the .NET ecosystem. Many people are looking for a comparable solution in the Java world. To better understand what a comparable solution could be, let’s have a look at the main problem that LINQ solves: Query languages are often declarative programming languages with many keywords. They offer few control-flow elements, yet they are highly descriptive. The most popular query language is SQL, the ISO/IEC standardised Structured Query Language, mostly used for relational databases. Declarative programming means that programmers do not explicitly phrase out their algorithms. Instead, they describe the result they would like to obtain, leaving algorithmic calculus to their implementing systems. Some databases have become very good at interpreting large SQL statements, applying SQL language transformation rules based on language syntax and metadata. An interesting read is Tom Kyte’s metadata matters, hinting at the incredible effort that has been put into Oracle’s Cost-Based Optimiser. Similar papers can be found for SQL Server, DB2 and other leading RDBMS.

LINQ-to-SQL is not SQL

LINQ is an entirely different query language that allows to embed declarative programming aspects into .NET languages, such as C#, or ASP. The nice part of LINQ is the fact that a C# compiler can compile something that looks like SQL in the middle of C# statements. In a way, LINQ is to .NET what SQL is to PL/SQL, pgplsql or what jOOQ is to Java (see my previous article about PL/Java). But unlike PL/SQL, which embeds the actual SQL language, LINQ-to-SQL does not aim for modelling SQL itself within .NET. It is a higher-level abstraction that keeps an open door for attempting to unify querying against various heterogeneous data stores in a single language. This unification will create a similar impedance mismatch as ORM did before, maybe an even bigger one. While similar languages can be transformed into each other to a certain extent, it can become quite difficult for an advanced SQL developer to predict what actual SQL code will be generated from even very simple LINQ statements.

LINQ Examples

This gets more clear when looking at some examples given by the LINQ-to-SQL documentation. For example the Count() aggregate function:

System.Int32 notDiscontinuedCount =
    (from prod in db.Products
    where !prod.Discontinued
    select prod)
    .Count();

Console.WriteLine(notDiscontinuedCount);

In the above example, it is not immediately clear if the .Count() function is transformed into a SQL count(*) aggregate function within the parenthesised query (then why not put it into the projection?), or if it will be applied only after executing the query, in the application memory. The latter would be prohibitive, if a large number or records would need to be transferred from the database to memory. Depending on the transaction model, they would even need to be read-locked! Another example is given here where grouping is explained:

var prodCountQuery =
    from prod in db.Products
    group prod by prod.CategoryID into grouping
    where grouping.Count() >= 10
    select new
    {
        grouping.Key,
        ProductCount = grouping.Count()
    };

In this case, LINQ models its language aspects entirely different from SQL. The above LINQ where clause is obviously a SQL HAVING clause. into grouping is an alias for what will be a grouped tuple, which is quite a nice idea. This does not directly map to SQL, though, and must be used by LINQ internally, to produce typed output. What’s awesome, of course, are the statically typed projections that can be reused afterwards, directly in C#! Let’s look at another grouping example:

var priceQuery =
    from prod in db.Products
    group prod by prod.CategoryID into grouping
    select new
    {
        grouping.Key,
        TotalPrice = grouping.Sum(p => p.UnitPrice)
    };

In this example, C#’s functional aspects are embedded into LINQ’s Sum(p => p.UnitPrice) aggregate expression. TotalPrice = ... is just simple column aliasing. The above leaves me with lots of open questions. How can I control, which parts are really going to be translated to SQL, and which parts will execute in my application, after a SQL query returns a partial result set? How can I predict whether a lambda expression is suitable for a LINQ aggregate function, and when it will cause a huge amount of data to be loaded into memory for in-memory aggregation? And also: Will the compiler warn me that it couldn’t figure out how to generate a C#/SQL algorithm mix? Or will this simply fail at runtime?

To LINQ or not to LINQ

Don’t get me wrong. Whenever I look inside the LINQ manuals for some inspiration, I have a deep urge to try it in a project. It looks awesome, and well-designed. There are also lots of interesting LINQ questions on Stack Overflow. I wouldn’t mind having LINQ in Java, but I want to remind readers that LINQ is NOT SQL. If you want to stay in control of your SQL, LINQ or LINQesque APIs may be a bad choice for two reasons:
  1. Some SQL mechanisms cannot be expressed in LINQ. Just as with JPA, you may need to resort to plain SQL.
  2. Some LINQ mechanisms cannot be expressed in SQL. Just as with JPA, you may suffer from severe performance issues, and will thus resort again to plain SQL.
Beware of the above when choosing LINQ, or a “Java implementation” thereof! You may be better off, using SQL (i.e. JDBC, jOOQ, or MyBatis) for data fetching and Java APIs (e.g. Java 8’s Stream API) for in-memory post-processing

LINQ-like libraries modelling SQL in Java, Scala

LINQ-like libraries abstracting SQL syntax and data stores in Java, Scala

15 thoughts on “LINQ and Java

  1. You are correct that initially it may not be obvious what SQL syntax the LINQ expressions will translate into.

    However, LINQ to Entities has a very clear demarcation: Expressions or sub-expressions that contain free variables *must* consist of methods/operators that translate into SQL constructs. If a (sub)expression consists entirely of bound variables (i.e. it can be evaluated once for the entire LINQ expression), it is substituted for the evaluated value.

    For instance db.Orders.Where(x => x.TotalAmount > threshold+1000) is allowed when threshold is a variable with a bound value (x.TotalAmount is a free variable)

    db.Orders.Where(x => x.TotalAmount > CalculateThreshold(x.Customer)) is not allowed because CalculateThreshold does not convert to a SQL function and it is not fully bound (it takes a free variable as argument and thus can only be evaluated for each order).

    When using LINQ you quickly learn to recognize these patterns. If you want to filter using C# defined functions there’s no way around pulling data into memory and perform filter there. A number of LINQ operators causes LINQ to Entities to evaluate the expression. A common operator is .ToList(). Inserting a .ToList() like db.Orders(x => x.Customer.Type==2).ToList().Where(x => MeetsThreshold(x)) is allowed because following ToList() it is not LINQ to Entities anymore, rather it is LINQ to Objects.

    And yes, x => x.Customer.Type==2 is a legal LINQ to Entities expression – which will generate a subquery.

    Support for most operators that map to SQL operators have been implemented, as well as a number of functions. String predicate functions like Contains(…), StartsWith(), EndsWith() will translate to “like” expressions in the database specific syntax (using * or % or whatever for wildcards).

    Likewise (and very useful) the collection method Contains() method can be used to select entities where e,g, the ID is in a given list of IDs. No more unwieldy long OR sequences or strange string encodings. To select customers with specific IDs: db.Customers.Where(x => new[]{1,2,3,5,8,13}.Contains(x.ID))

    1. Thanks for all that insight. No doubt, there is a lot of usefulness in LINQ and I’m sure that one can easily learn how to query efficiently. The added typesafety is awesome, and as I said, sometimes I wish I had LINQ in Java.

      Just as with JPQL, you can learn what SQL syntax a query will translate to. But you don’t always want that… Sometimes, you want to write typesafe SQL from the beginning. That was the point I was trying to make.

      As always, it’s not a black-and-white situation.

      1. Yes. And you do want to inspect the generated SQL if you make complicated queries. While the LINQ to Entities provider is quite clever (it can tell if you late in the fluent method chain create projections that only use a subset of the columns), sometimes you need to help it a little.

        Vice versa, sometimes with a little help, LINQ to Entities will write really clever and highly optimized SQL that you didn’t want to write yourself.

        As an example of this is the db.Orders.Include(x => x.LineItems).Where(x => x.OrderID == 42). The “Include” method will prompt L2E to create a left outer join and ensure that the order and all of the line items are fetched at once. Now, if you wrote something like that by hand you would need to sort/group afterwards to untangle the order columns/rows from the lineitem columns/rows.

    2. Yeah, automatic left outer joins. Obscured by an Include().Where() method chain. Thanks for mentioning this. It’s precisely why I wrote this article. Include().Where() is just not SQL. Why not just explicitly write the left outer join?

      1. Because in cases like that I do not want to write outer joins!

        I want to be able to express that I am going to need the lineitems as well but I do not want to dip twice. In that case the outer join becomes a trick, an implementation detail that I want to hide in the same way as SQL hides how joins are performed (nested loops, merge join, hash joins).

        Some outer joins I want to be explicit about. In that case I would use the LINQ SelectMany() method. But in other cases I want to express the desired outcome and have some control over the strategy (I want to *include* the query for lineitems in the same query). What I do NOT want is having to spell out exactly *how* that should be done. Focusing on what and not how is actually much more in the spirit of SQL (a declarative language). Spelling out the left outer join in this case would be comparable to controlling the query plan using hints. You can do it, but you really should not.

        As I said, if you wrote the outer join yourself you would then *also* have to untangle the rectangular result afterwards, finding the distinct orders and sorting/grouping the lineitems. I hate writing code like that. It is unwieldy and error prone and cloaks the real prupose: A simple hint about optimization.

        Indeed the Include() method *need not* result in an outer join. What I state by using Include() is that I want it included within the same query. If the underlying database supports command batching, LINQ could very well use that to query first the orders and then the lineitems for those orders in the same batch and return two result sets.

        It comes down to this: I want to express *intent*. How the underlying system meets that intent is detail. Of course, it has to be robust, stable and not surprising. Having worked a lot with LINQ I can tell you that I have very much confidence in how LINQ to Entities work for me.

        You are correct that LINQ is not SQL. That is a very important point. But you use the LINQ abstraction in the same way as you use the SQL abstraction, and for many intents and purposes they are equivalent. The point is, though, that I don’t really care. Once you pick up LINQ you do not want to write your queries in SQL anymore. You should be able to *read* SQL to inspect that the generated SQL is not suboptimal, like you should be able to read a query plan.

        In many ways LINQ I feel that LINQ is what SQL should have been (the way outer joins are expressed, the way you can introduce and use local variables instead of endlessly repeating the same expression, as just two examples).

    3. OK, you do have a point there. Well, we can hope that LINQ’s good parts have an influence on SQL in the way Scala has on Java 8.

      Specifically, being able to define a variable for an expression is quite nice. CTE are a start, but having to repeat column sublists in GROUP BY and SELECT clauses is a bit of a pain, indeed.

  2. Do also note that while LINQ is enabled by a number of language features (specifically extension methods, expression trees, lambdas, anonymous structural types), the *implementation* of LINQ applications are library features.

    LINQ to Entities, LINQ to SQL, LINQ to Objects, LINQ to XML are all implemented as library features. This is important because you can *extend* them if you so wish.

    You can also use the fact that LINQ to Entities is based on expression trees. You can create functions like

    private Expression<Func> GoodCustomer() {
    return x=>x.Orders.Total.Sum() > 100000;
    }

    and use it to compose customer queries like db.Customers.Where( x => x.Type==42).Where(GoodCustomer())

    The community has produced a number of other LINQ providers:
    LINQ to Twitter
    LINQ to Facebook Query Language
    LINQ to Google
    LINQ to Bing
    LINQ to Amazon
    LINQ to EBay
    LINQ to JSON
    LINQ to Excel

    1. Yes, yes, I understand the marketing and community building efforts done by Microsoft, and they sure do have their merits. In fact, I cannot stress enough how impressed I am by these features. But LINQ is still not SQL :-) … just as it isn’t XQuery / XPath / XSL for XML, for instance.

      Just as JPQL in Java has its place, where you want to “decently” query various relational data stores. But JPQL is not SQL.

      1. Sorry if I’m late to this thread, but I just want to ask why you seem to be hung up on the fact that LINQ is just not SQL? Does it matter? The whole point of abstraction is that you obscure the inner workings of something and make it simpler to use. yes LINQ is not SQL, but so what? Sorry, I just needed to ask!

        1. In my opinion, SQL and collections querying (or XML querying for that matter) are entirely different beasts. Every abstraction is to some extent leaky. An API that tries to unify querying of so many “backends” will be leaky as well.

          This may not matter to many people, but it does to those who want to write SQL including all the awesome / quirky SQL features that have been added over time. And they cannot do this with LINQ.

          At the bottom of this article, you’ll find an alternative solution that cleanly separates the SQL querying from the collections querying / transformation using the Java 8 Streams API, which was designed with a collections-only focus.

            1. Exactly. What is mainly being criticised here is that the closeness to SQL is a “false friend”. I personally believe the API might’ve been better if it wasn’t so close to SQL. People get drawn into thinking that it replaces SQL, which it doesn’t.

Leave a Reply