jOOQ Tuesdays: Mario Fusco Talks About Functional and Declarative Programming

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

mariofusco

I’m very excited to feature today Mario Fusco, author of LambdaJ, working on Red Hat’s drools, a Java Champion and frequent speaker at Java conferences on all topics functional programming.

Mario, a long time ago, I have already stumbled upon your name when looking up the author of Lambdaj – a library that went to the extreme to bring lambdas to Java 5 or earlier. How does it work? And what’s the most peculiar hack you implemented to make it work?

When I started developing Lambdaj in 2007 I thought to it just as a proof-of-concept to check how far I could push Java 5. I never expected that it could become something that somebody else other than myself may actually want to use. In reality, given the limited, or I should say non-existing, capabilities of Java 5 as a functional language, Lambdaj was entirely a big hack. Despite this, people started using and somewhat loving it, and this made me (and possibly somebody else) realize that Java developers, or at least part of them, were tired of the pure imperative paradigm imposed by the language and ready to experiment with something more functional.

The main feature of Lambdaj, and what made its DSL quite nice to use, was the possibility to reference the method of a class in a static and type safe way and pass it to another method. In this way you could for example sort a list of persons by their age doing something like:

sort(persons, on(Person.class).getAge());

As anticipated what happened under the hood was a big hack: the on() method created a proxy of the Person class so you could safely call the getAge() method on it. The proxy didn’t do anything useful other than registering the method call. However it had to return something of the same type of the value returned by the actual method to avoid a ClassCastException. To this purpose it had a mechanism to generate a reasonably unique instance of that type, an int in my example. Before returning that value it also associated it, using a WeakHashMap, to the invoked method. In this way the sort() method was actually invoked with a list and the value generated by my proxy. It then retrieved from the map the Java method associated with that value and invoked it on all the items of the list performing the operation, a sorting in this case, that it was supposed to execute.

That’s crazy :) I’m sure you’re happy that a lot of Lambdaj features are now deprecated. You’re now touring the world with your functional programming talks. What makes you so excited about this topic?

The whole Lambdaj project is now deprecated and abandoned. The new functional features introduced with Java 8 just made it obsolete. Nevertheless it not only had the merit to make developers become curious and interested about functional programming, but also to experiment with new patterns and ideas that in the end also influenced the Java 8 syntax. Take for instance how you can sort a Stream of persons by age using a method reference

persons.sort(Person::getAge)

It looks evident how the method references have been at least inspired by the Lambdaj‘s on() method.

There is a number of things that I love of functional programming:

  1. The readability: a snippet of code written in functional style looks like a story while too often the equivalent code in imperative style resembles a puzzle.
  2. The declarative nature: in functional programming is enough to declare the result that you want to achieve rather than specifying the steps to obtain it. You only care about the what without getting lost in the details of the how.
  3. The possibility of treating data and behaviors uniformly: functional programming allows you to pass to a method both data (the list of persons to be sorted) and computation (the function to be applied to each person in the list). This idea is fundamental for many algorithms like for example the map/reduce: since data and computation are the same thing and the second is typically orders of magnitude smaller you are free to send them to the machine holding the data instead of the opposite.
  4. The higher level of abstraction: the possibility of encapsulating computations in functions and pass them around to other functions allows both a dramatic reduction of code duplication and the design of more generic and expressive API.
  5. Immutability and referential transparency: using immutable values and having side-effects programs makes far easier to reason on your code, test it and ensure its correctness.
  6. The parallelism friendliness: all the features listed above also enable the parallelization of your software in a simpler and more reliable way. It is not coincidence that functional programming started becoming more popular around 10 years ago that is also when multicore CPUs began to be available on commodity hardware.

Our readers love SQL (or at least, they use it frequently). How does functional programming compare to SQL?

The most evident thing that FP and SQL have in common is their declarative paradigm. To some extent SQL, or at least the data selection part, can be seen as a functional language specialized to manipulate data in tabular format.

The data modification part is a totally different story though. The biggest part of SQL users normally change data in a destructive way, overwriting or even deleting the existing data. This is clearly in contrast with the immutability mantra of functional programming. However this is only how SQL is most commonly used, but nothing dictates that it couldn’t be also employed in a non-destructive append-only way. I wish to see SQL used more often in this way in future.

In your day job, you’re working for Red Hat, on drools. Business rules sound enterprisey. How does that get along with your fondness of functional programming?

Under an user point of view a rule engine in general and drools in particular are the extreme form of declarative programming, second only to Prolog. For this reason developers who are only familiar with the imperative paradigm struggle to use it, because they also try to enforce it to work in an imperative way. Conversely programmers more used to think in functional (and then declarative) terms are more often able to use it correctly when they approach it for the first time.

For what regards me, my work as developer of both the core engine and the compiler of drools allows me to experiment every day in both fields of language design and algorithmic invention and optimization. To cut it short it’s a challenging job and there’s lot’s of fun in it: don’t tell this to my employer but I cannot stop being surprised that they allow me to play with this everyday and they also pay me for that.

You’re also on the board of VoxxedDays Ticino, Zurich, and CERN (wow, how geeky is that? A large hadron collider Java conference!). Why is Voxxed such a big success for you?

I must admit that, before being involved in this, I didn’t imagine the amount of work that organizing a conference requires. However this effort is totally rewarded. In particular the great advantage of VoxxedDays is the fact of being local 1-day events made by developers for developers that practically anybody can afford.

I remember that the most common feedback I received after the first VoxxedDays Ticino that we did 2 years ago was some like: “This has been the very first conference I attended in my life and I didn’t imagine it could have been a so amazing experience both under a technical and even more a social point of view. Thanks a lot for that, I eagerly wait to attend even next year”. Can you imagine something more rewarding for a conference organizer?

The other important thing for me is giving the possibility to speakers that aren’t rock stars (yet) to talk in public and share their experience with a competent audience. I know that for at least some of them this is only the first step to let themselves and others discover their capabilities as public speakers and launch them toward bigger conferences like the Devoxx.

Thank you very much Mario

If you want to learn more about Mario’s insights on functional programming, please do visit his interesting talks at Devoxx from the recent past:

SQL, Streams, For Comprehension… It’s All the Same

Recently, at Devoxx, I’ve seen this beautiful slide in a talk by Kevlin Henney

In his talk, he was displaying a variety of approaches to solve the FizzBuzz “problem”, including a couple of very elegant solutions in completely declarative approaches and languages.

In this particular slide, Kevlin used a notation that is derived from maths. The set builder notation. Here’s an example from Wikipedia:

even-numbers

The example reads: For all n in (the set of all integer numbers), take those for which there exists () another integer k, for which the following equation is satisfied: n = 2k.

Or in plain English: All even integers. (because for even integers, there exists another integer that is half the even integer)

Beautiful, eh? In imperative programming, we’d probably do something like this instead:

List<Integer> even = new ArrayList<>();
for (int i = /* hmm...? */; i < /* what to put here */; i++)
    even.add(i * 2);

Or this:

List<Integer> even = new ArrayList<>();
for (int i = /* hmm...? */; i < /* what to put here */; i = i + 2)
    even.add(i);

But there are several problems with the imperative approach:

  • We have to realistically start somewhere
  • We have to realistically end somewhere
  • We have to store all values in an intermediate collection

Sure, those aren’t severe limitations in every day use-cases, because we’re probably solving a real world problem where we don’t actually need an infinite number of even integers, and storing them in an intermediate collection doesn’t consume all of our memory, but still, the declarative, mathematical approach is much leaner, because we can still answer those questions about where to start and where to end later, and we never need to materialise any intermediate collection before we make those final decisions.

For instance, we can declare X to be that set, and then declare Y to be a set that is derived from X, and finally materialise Z, which is a very tiny set derived from Y. For this, we may have never needed to materialise all the (even) integers.

How this compares to SQL

Kevlin made a cunning comparison. Of course, all functional programming aficionados will immediately recognise that languages like Scala have something called a “for comprehension”, which models precisely the mathematical set-builder notation.

Java 8 now has the Streams API, which allows us, to some extent, model something similar (although not as powerful). But Kevlin didn’t use those “modern” languages. He used SQL as a comparison. That “arcane” declarative programming language that has been around forever, and that we love so much. Yes, here’s how we can declare all the even numbers in SQL:

SELECT n
FROM integers
WHERE EXISTS (
  SELECT k
  FROM integers
  WHERE n = 2 * k
)

If optimisers were perfect, this semi-self-join between the two references of the integers “table” could be optimised perfectly. In most databases, we’d probably manually transform the above notation to this equivalent one:

SELECT n
FROM integers
WHERE MOD(n, 2) = 0

Yes, indeed. The set-builder notation and the SQL language are very similar beasts. The former prefers using mathematical symbols for brevity and conciseness, the latter prefers using English words to connect the different operators, but it’s the same thing. And if you squint hard enough, you’ll see that Java 8 Streams, for instance, are also pretty much the same thing:

everything-is-a-table

I’ve blogged about this recently where all the Java 8 Streams operations are compared to their SQL clause counterparts:
https://blog.jooq.org/2015/08/13/common-sql-clauses-and-their-equivalents-in-java-8-streams

How is this better?

It’s simple. Both the set-builder notation, and the SQL language (and in principle, other languages’ for comprehensions) are declarative. They are expressions, which can be composed to other, more complex expressions, without necessarily executing them.

Remember the imperative approach? We tell the machine exactly what to do:

  • Start counting from this particular minimal integer value
  • Stop counting at this particular maximal integer value
  • Store all even integers in between in this particular intermediate collection

What if we don’t actually need negative integers? What if we just wanted to have a utility that calculates even integers and then reuse that to list all positive integers? Or, all positive integers less than 100? Etc.

In the imperative approach, we have to refactor constantly, to avoid the overhead of

  • Producing too many integers
  • Storing too many integers (or storing them at all)

In truly declarative languages like SQL, we’re just describing “even integers” with an expression, possibly assigning the expression a name:

CREATE VIEW even_integers AS
SELECT n
FROM integers
WHERE EXISTS (
  SELECT k
  FROM integers
  WHERE k = 2 * n
)

So, when we actually use and materialise the even integers, e.g. positive integers less than 100, the optimiser can optimise away the double access to the integer table and produce only the exact number of values that we’re requesting (without materialising them in intermediate collections):

SELECT n
FROM even_integers
WHERE n BETWEEN 0 AND 100

Conclusion

Thinking in terms of sets, in terms of declaring sets, has always been our dream as software engineers. The approach is extremely compelling and elegant. We can delegate a lot of boring algorithmic work to the implementation engine of the declarative programming language. In the case of SQL, it would be a SQL database optimiser, which figures out a great lot of optimisations that we might not have thought of.

The above example is trivial. We can perfectly live in a world where we manually iterate over a local integer variable that goes from 0 to 100:

for (int i = 0; i <= 100; i++)
  doSomething(i);

But stuff gets hairy quite quickly. Compare Mario Fusco‘s famous tweet’s two versions of the same algorithm:

This also applies to SQL, and what’s even better in SQL than with Streams: The SQL statement is a declarative expression tree, not a formally ordered set of stream pipeline operations. The optimiser can freely reorder / transform the expression tree into something that it thinks is more optimal. This isn’t just a promise. This works in modern SQL databases every day, for very complex queries, which you can write in a matter of seconds, rather than hours.

Stay tuned for a short series of blog posts on the jOOQ blog illustrating what modern cost-based optimisation can do for you, when you’re using the SQL language.

Warning: Don’t oversimplify

This article just illustrates the roots of the SQL mindset in mathematics and functional programming. Do note that modern SQL is vastly more sophisticated than its roots, and has moved away from this original paradigm to embrace other paradigms for practical reasons.

Don’t limit your SQL usage to what for comprehensions offer. There’s much more to SQL!

jOOQ 4.0’s New API Will Use Annotations Only for Truly Declarative Java/SQL Programming

SQL is the only really popular and mature 4GL (Fourth Generation Programming Language). I.e. it is the only popular declarative language.

At the same time, SQL has proven that turing completeness is not reserved to lesser languages like C, C++, or Java. Since SQL:1999 and its hierarchical common table expressions, SQL can be safely considered “turing complete”. This means that any program can be written in SQL. Don’t believe it? Take, for instance, this SQL Mandelbrot set calculation as can be seen in this Stack Overflow question.

mandelbrot set

Source: User Elie on http://stackoverflow.com/q/314864/521799

Wonderful! No more need for procedural, and object oriented cruft.

How we’ve been wrong so far…

At Data Geekery (the company behind jOOQ), we love SQL. And we love Java. But one thing has always bothered us in the past. Java is not really a purely declarative language. A lot of Java language constructs are real anti patterns for the enlightened declarative programmer. For instance:

// This is bad
for (String string : strings)
    System.out.println(string);

// This is even worse
try {
    someSQLStatements();
}
catch (SQLException e) {
    someRecovery();
}

The imperative style of the above code is hardly ever useful. Programmers need to tediously tell the Java compiler and the JVM what algorithm they meant to implement, down to the single statement, when in reality, using the JIT and other advanced optimisation techniques, they don’t really have to.

Luckily, there are annotations

Since Java 5, however, there have been farsighted people in expert groups who have added a powerful new concept to the Java language: Annotations (more info here). At first, experiments were made with only a handful of limited-use annotations, like:

  • @Override
  • @SuppressWarnings

But then, even more farsighted people have then proceeded in combining these annotations to form completely declaratively things like a component:

@Path("/MonsterRest")
@Stateless
@WebServlet(urlPatterns = "/MonsterServlet")
@Entity
@Table(name = "MonsterEntity")
@XmlRootElement
@XmlAccessorType(XmlAccessType.FIELD)
@NamedQuery(name = "findAll", query = "SELECT c FROM Book c")
public class Book extends HttpServlet {
 
    // ======================================
    // =             Attributes             =
    // ======================================
 
    @Id
    @GeneratedValue
    private Long id;
    private String isbn;
    private Integer nbOfPage;
    private Boolean illustrations;
    private String contentLanguage;
    @Column(nullable = false)
    @Size(min = 5, max = 50)
    @XmlElement(nillable = false)
    private String title;
    private Float price;
    @Column(length = 2000)
    @Size(max = 2000)
    private String description;
    @ElementCollection
    @CollectionTable(name = "tags")
    private List<String> tags = new ArrayList<>();

Look at this beauty. Credits to Antonio Goncalves

However, we still think that there is a lot of unnecessary object oriented bloat in the above. Luckily, recent innovations that make Java annotations turing complete (or even sentient?) will now finally allow us to improve upon this situation, specifically for jOOQ, which aims to model the declarative SQL language in Java. Finally, annotations are a perfect fit!

Those innovations are:

These innovations allow us to completely re-implement the entire jOOQ 4.0 API in order to allow for users writing SQL as follows:

@Select({
    @Column("FIRST_NAME"),
    @Column("LAST_NAME")
})
@From(
    table = @Table("AUTHOR"),
    join = @Join("BOOK"),
    predicate = @On(
        left = @Column("AUTHOR.ID"),
        op = @Eq,
        right = @Column("BOOK.AUTHOR_ID")
    )
)
@Where(
    predicate = @Predicate(
        left = @Column("BOOK.TITLE"),
        op = @Like,
        right = @Value("%Annotations in a Nutshell%")
    )
)
class SQLStatement {}

Just like JPA, this makes jOOQ now fully transparent and declarative, by using annotations. Developers will now be able to completely effortlessly translate their medium to highly complex SQL queries into the exact equivalent in jOOQ annotations.

Don’t worry, we’ll provide migration scripts to upgrade your legacy jOOQ 3.x application to 4.0. A working prototype is on the way and is expected to be released soon, early adopter feedback is very welcome, so stay tuned for more exciting SQL goodness!

10 Easy Steps to a Complete Understanding of SQL

Too many programmers think SQL is a bit of a beast. It is one of the few declarative languages out there, and as such, behaves in an entirely different way from imperative, object-oriented, or even functional languages (although, some say that SQL is also somewhat functional).

As a SQL trainer (do visit our training, it’s great!) I’m writing SQL every day and embracing SQL with our company’s Open Source library jOOQ.

jOOQ: The best way to write SQL in Java

I thus feel compelled to bring the beauty of SQL a bit closer to those of you still struggling with it. The following tutorial is destined for

  • readers who have already worked with SQL but never completely understood it
  • readers who know SQL well but have never really thought about its syntax
  • readers who want to teach SQL to others

This tutorial will focus on SELECT statements only. Other DML statements will be covered in another tutorial.


Note: This tutorial was previously published exclusively on Tech.Pro (see a historic version here). Unfortunately, Tech.Pro went offline. With the permission of Tech.Pro, we’re re-publishing this content again on the jOOQ blog.

Here are…

10 Easy Steps to a Complete Understanding of SQL.

1. SQL is declarative

Get this into your head first. Declarative. The only paradigm where you “just” declare the nature of the results that you would like to get. Not how your computer shall compute those results. Isn’t that wonderful?

SELECT first_name, last_name 
FROM employees 
WHERE salary > 100000

Easy to understand. You don’t care where employee records physically come from. You just want those that have a decent salary.

What do we learn from this?

So if this is so simple, what’s the problem? The problem is that most of us intuitively think in terms of imperative programming. As in: “machine, do this, and then do that, but before, run a check and fail if this-and-that”. This includes storing temporary results in variables, writing loops, iterating, calling functions, etc. etc.

Forget about all that. Think about how to declare things. Not about how to tell the machine to compute things.

2. SQL syntax is not “well-ordered”

A common source of confusion is the simple fact that SQL syntax elements are not ordered in the way they are executed. The lexical ordering is:

  • SELECT [ DISTINCT ]
  • FROM
  • WHERE
  • GROUP BY
  • HAVING
  • UNION
  • ORDER BY

For simplicity, not all SQL clauses are listed. This lexical ordering differs fundamentally from the logical order (which may again differ from the order of execution, depending on the optimiser choices):

  • FROM
  • WHERE
  • GROUP BY
  • HAVING
  • SELECT
  • DISTINCT
  • UNION
  • ORDER BY

There are three things to note:

  1. FROM is the first clause, not SELECT. The first thing that happens is loading data from the disk into memory, in order to operate on such data.
  2. SELECT is executed after most other clauses. Most importantly, after FROM and GROUP BY. This is important to understand when you think you can reference stuff that you declare in the SELECT clause from the WHERE clause. The following is not possible:

    SELECT A.x + A.y AS z
    FROM A
    WHERE z = 10 -- z is not available here!
    

    If you wanted to reuse z, you have two options. Either repeat the expression:

    SELECT A.x + A.y AS z
    FROM A
    WHERE (A.x + A.y) = 10
    

    … or you resort to derived tables, common table expressions, or views to avoid code repetition. See examples further down.

  3. UNION is placed before ORDER BY in both lexical and logical ordering. Many people think that each UNION subselect can be ordered, but according to the SQL standard and most SQL dialects, that is not true. While some dialects allow for ordering subqueries or derived tables, there is no guarantee that such ordering will be retained after a UNION operation

Note, not all databases implement things the same way. Rule number 2, for instance, does not apply exactly in the above way to MySQL, PostgreSQL, and SQLite.

What do we learn from this?

Always remember both the lexical order and the logical order of SQL clauses to avoid very common mistakes. If you understand that distinction, it will become very obvious why some things work and others don’t.

Of course, it would have been nice if the language was designed in a way that the lexical order actually reflected the logical order, as it is implemented in Microsoft’s LINQ.

3. SQL is about table references

Because of the difference between lexical ordering and logical ordering, most beginners are probably tricked into thinking that column values are the first-class citizens in SQL. They are not. The most important things are table references.

The SQL standard defines the FROM clause as such:

<from clause> ::= 
    FROM <table reference> 
        [ { <comma> <table reference> }... ]

The “output” of the FROM clause is a combined table reference of the combined degree of all table references. Let’s digest this, slowly.

FROM a, b

The above produces a combined table reference of the degree of a + the degree of b. If a has 3 columns and b has 5 columns, then the “output table” will have 8 (3 + 5) columns.

The records contained in this combined table reference are those of the cross product / cartesian product of a x b. In other words, each record of a is paired with each record of b. If a has 3 records and b has 5 records, then the above combined table reference will produce 15 records (3 x 5).

This “output” is “fed” / “piped” into the GROUP BY clause (after filtering in the WHERE clause), where it is transformed into a new “output”. We’ll deal with that later on.

If we’re looking at these things from a relational algebra / set theory perspective, a SQL table is a relation or a set of tuples. And each SQL clause will transform one or several relations in order to produce new relations.

What do we learn from this?

Always think in terms of table references to understand how data is “pipelined” through your SQL clauses.

4. SQL table references can be rather powerful

A table reference is something rather powerful. A simple example of their power is the JOIN keyword, which is actually not part of the SELECT statement, but part of a “special” table reference. The joined table, as defined in the SQL standard (simplified):

<table reference> ::=
    <table name>
  | <derived table>
  | <joined table>

If we take again the example from before:

FROM a, b

a can be a joined table as such:

a1 JOIN a2 ON a1.id = a2.id

Expanding this into the previous expression, we’d get:

FROM a1 JOIN a2 ON a1.id = a2.id, b

While it is discouraged to combine the comma-separated list of table references syntax with the joined table syntax, you can most certainly do this. The resulting, combined table reference will now have a degree of a1+a2+b.

Derived tables are even more powerful than joined tables. We’ll get to that.

What do we learn from this?

Always, always think in terms of table references. Not only is this important to understand how data is “pipelined” through your SQL clauses (see previous section), it will also help you understand how complex table references are constructed.

And, importantly, understand that JOIN is a keyword for constructing joined tables. Not a part of the SELECT statement. Some databases allow for using JOIN in INSERT, UPDATE, DELETE

5. SQL JOIN tables should be used rather than comma-separated tables

Before, we’ve seen this clause:

FROM a, b

Advanced SQL developers will probably tell you that it is discouraged to use the comma-separated list at all, and always fully express your JOINs. This will help you improve readability of your SQL statement, and thus prevent mistakes.

One very common mistake is to forget a JOIN predicate somewhere. Think about the following:

FROM a, b, c, d, e, f, g, h
WHERE a.a1 = b.bx
AND a.a2 = c.c1
AND d.d1 = b.bc
-- etc...

The join table syntax is both

  • Safer, as you can place join predicates close to the joined tables, thus preventing mistakes.
  • More expressive, as you can distinguish between OUTER JOIN, INNER JOIN, etc.

What do we learn from this?

Always use JOIN. Never use comma-separated table references in your FROM clauses.

6. SQL’s different JOIN operations

JOIN operations essentially come with five flavours:

  • EQUI JOIN
  • SEMI JOIN
  • ANTI JOIN
  • CROSS JOIN
  • DIVISION

These terms are commonly used in relational algebra. SQL uses different terms for the above concepts, if they exist at all. Let’s have a closer look:

EQUI JOIN

This is the most common JOIN operation. It has two sub-flavours:

  • INNER JOIN (or just JOIN)
  • OUTER JOIN (further sub-flavoured as LEFT, RIGHT, FULL OUTER JOIN)

The difference is best explained by example:

-- This table reference contains authors and their books.
-- There is one record for each book and its author.
-- authors without books are NOT included
author JOIN book ON author.id = book.author_id

-- This table reference contains authors and their books
-- There is one record for each book and its author.
-- ... OR there is an "empty" record for authors without books
-- ("empty" meaning that all book columns are NULL)
author LEFT OUTER JOIN book ON author.id = book.author_id

SEMI JOIN

w

This relational concept can be expressed in two ways in SQL: Using an IN predicate, or using an EXISTS predicate. “Semi” means “half” in latin. This type of join is used to join only “half” of a table reference. What does that mean? Consider again the above joining of author and book. Let’s imagine that we don’t want author/book combinations, but just those authors who actually also have books. Then we can write:

-- Using IN
FROM author
WHERE author.id IN (SELECT book.author_id FROM book)

-- Using EXISTS
FROM author
WHERE EXISTS (SELECT 1 FROM book WHERE book.author_id = author.id)

While there is no general rule as to whether you should prefer IN or EXISTS, these things can be said:

  • IN predicates tend to be more readable than EXISTS predicates
  • EXISTS predicates tend to be more expressive than IN predicates (i.e. it is easier to express very complex SEMI JOIN)
  • There is no formal difference in performance. There may, however, be a huge performance difference on some databases.

Because INNER JOIN also produces only those authors that actually have books, many beginners may think that they can then remove duplicates using DISTINCT. They think they can express a SEMI JOIN like this:

-- Find only those authors who also have books
SELECT DISTINCT first_name, last_name
FROM author
JOIN book ON author.id = book.author_id

This is very bad practice for two reasons:

  • It is very slow, as the database has to load a lot of data into memory, just to remove duplicates again.
  • It is not entirely correct, even if it produces the correct result in this simple example. But as soon as you JOIN more table references, you will have a very hard time correctly removing duplicates from your results.

Some more information about abuse of DISTINCT can be seen in this blog post.

ANTI JOIN

This relational concept is just the opposite of a SEMI JOIN. You can produce it simply by adding a NOT keyword to the IN or EXISTS predicates. An example, where we’ll select those authors who do not have any books:

-- Using IN
FROM author
WHERE author.id NOT IN (SELECT book.author_id FROM book)

-- Using EXISTS
FROM author
WHERE NOT EXISTS (SELECT 1 FROM book WHERE book.author_id = author.id)

The same rules with respect to performance, readability, expressivity apply. However, there is a small caveat with respect to NULLs when using NOT IN, which is a bit out of scope for this tutorial.

CROSS JOIN

This produces a cross product of the two joined table references, combining every record of the first table reference with every record of the second table reference. We have seen before, that this can be achieved with comma-separated table references in the FROM clause. In the rare cases where this is really desired, you can also write a CROSS JOIN explicitly, in most SQL dialects:

-- Combine every author with every book
author CROSS JOIN book

DIVISION

The relational division is really a beast of its own breed. In short, if JOIN is multiplication, division is the inverse of JOIN. Relational divisions are very tough to express in SQL. As this is a beginners’ tutorial, explaining it is out of scope. For the brave among you, read on about it here, here, and here.

What do we learn from this?

A lot. Again, let’s hammer this into our heads. SQL is about table references. Joined tables are quite sophisticated table references. But there is a difference in relational-speak and SQL-speak. Not all relational join operations are also formal SQL join operations. With a bit of practice and knowledge about relational theory, you will always be able to choose the right type of relational JOIN and be able to translate it to the correct SQL.

7. SQL’s derived tables are like table variables

Before, we’ve learned that SQL is a declarative language, and as such, variables do not have a place (they do in some SQL dialects, though). But you can write something like variables. And those beasts are called derived tables.

A derived table is nothing but a subquery wrapped in parentheses.

-- A derived table
FROM (SELECT * FROM author)

Note that some SQL dialects require derived tables to have a correlation name (also known as alias).

-- A derived table with an alias
FROM (SELECT * FROM author) a

Derived tables are awesome when you want to circumvent the problems caused by the logical ordering of SQL clauses. For instance, if you want to reuse a column expression in both the SELECT and the WHERE clause, just write (Oracle dialect):

-- Get authors' first and last names, and their age in days
SELECT first_name, last_name, age
FROM (
  SELECT first_name, last_name, current_date - date_of_birth age
  FROM author
)
-- If the age is greater than 10000 days
WHERE age > 10000

Note that some databases, and the SQL:1999 standard have taken derived tables to the next level, introducing common table expressions. This will allow you to reuse the same derived table several times within a single SQL SELECT statement. The above query would then translate to the (almost) equivalent:

WITH a AS (
  SELECT first_name, last_name, current_date - date_of_birth age
  FROM author
)
SELECT *
FROM a
WHERE age > 10000

Obviously, you could also externalise “a” into a standalone view for even broader reuse of common SQL subselects. Read more about views here.

What do we learn from this?

Again, again, again. SQL is mostly about table references, not columns. Make use of them. Don’t be afraid of writing derived tables or other complex table references.

8. SQL GROUP BY transforms previous table references

Let’s reconsider our previous FROM clause:

FROM a, b

And now, let’s apply a GROUP BY clause to the above combined table reference

GROUP BY A.x, A.y, B.z

The above produces a new table reference with only three remaining columns (!). Let’s digest this again. If you apply GROUP BY, then you reduce the number of available columns in all subsequent logical clauses – including SELECT. This is the syntactical reason why you can only reference columns from the GROUP BY clause in the SELECT clause.

  • Note that other columns may still be available as arguments of aggregate functions:

    SELECT A.x, A.y, SUM(A.z)
    FROM A
    GROUP BY A.x, A.y
    
  • Note that MySQL, unfortunately, doesn’t adhere to this standard, causing nothing but confusion. Don’t fall for MySQL’s tricks. GROUP BY transforms table references. You can thus only reference columns also referenced in the GROUP BY clause.

What do we learn from this?

GROUP BY, again, operates on table references, transforming them into a new form.

9. SQL SELECT is called projection in relational algebra

I personally like the term “projection”, as it is used in relational algebra. Once you’ve generated your table reference, filtered it, transformed it, you can step to projecting it to another form. The SELECT clause is like a projector. A table function making use of a row value expression to transform each record from the previously constructed table reference into the final outcome.

Within the SELECT clause, you can finally operate on columns, creating complex column expressions as parts of the record / row.

There are a lot of special rules with respect to the nature of available expressions, functions, etc. Most importantly, you should remember these:

  1. You can only use column references that can be produced from the “output” table reference
  2. If you have a GROUP BY clause, you may only reference columns from that clause, or aggregate functions.
  3. You can use window functions instead of aggregate functions, when you don’t have a GROUP BY clause.
  4. If you don’t have a GROUP BY clause, you must not combine aggregate functions with non-aggregate functions.
  5. There are some rules with respect to wrapping regular functions in aggregate functions and vice-versa.
  6. There are …

Well, there are lots of complex rules. They could fill yet another tutorial. For instance, the reason why you cannot combine aggregate functions with non-aggregate functions in the projection of a SELECT statement without GROUP BY clause (rule number 4) is this:

  1. It doesn’t make sense. Intuitively.
  2. If intuition doesn’t help (it hardly does, with a SQL beginner), then syntax rules do. SQL:1999 introduced GROUPING SETS, and SQL:2003 introduced empty grouping sets: GROUP BY (). Whenever an aggregate function is present, and there is no explicit GROUP BY clause, an implicit, empty GROUPING SET is applied (rule number 2). Hence, the original rules about logical ordering aren’t exactly true anymore, and the projection (SELECT) influences the outcome of a logically preceding, yet lexically succeeding clause (GROUP BY).

Confused? Yes. Me too. Let’s get back to simpler things.

What do we learn from this?

The SELECT clause may be one of the most complex clauses in SQL, even if it appears so simple. All other clauses just “pipe” table references from one to another. The SELECT clause messes up the beauty of these table references, by completely transforming them, applying some rules to them retroactively.

In order to understand SQL, it is important to understand everything else first, before trying to tackle SELECT. Even if SELECT is the first clause in lexical ordering, it should be the last.

10. SQL DISTINCT, UNION, ORDER BY, and OFFSET are simple again

After the complicated SELECT, we can get back to simple things again:

  • Set operations (DISTINCT and UNION)
  • Ordering operations (ORDER BY, OFFSET .. FETCH)

Set operations

Set operations operate on “sets”, which are actually nothing other than… tables. Well, almost. Conceptually, they’re easy to understand.

  • DISTINCT removes duplicates after the projection.
  • UNION concatenates two subselects and removes duplicates
  • UNION ALL concatenates two subselects retaining duplicates
  • EXCEPT removes records from the first subselect that are also contained in the second subselect (and then removes duplicates)
  • INTERSECT retains only records contained in both subselects (and then removes duplicates)

All of this removing duplicates is usually non-sense. Most often, you should just use UNION ALL, when you want to concatenate subselects.

Ordering operations

Ordering is not a relational feature. It is a SQL-only feature. It is applied at the very end of both lexical ordering and logical ordering of your SQL statement. Using ORDER BY and OFFSET .. FETCH is the only way to guarantee that records can be accessed by index in a reliable way. All other ordering is always arbitrary and random, even if it may appear to be reproducible.

OFFSET .. FETCH is only one syntax variant. Other variants include MySQL’s and PostgreSQL’s LIMIT .. OFFSET, or SQL Server’s and Sybase’s TOP .. START AT. A good overview of various ways to implement OFFSET .. FETCH can be seen here.

Let’s get to work

As with every language, SQL takes a lot of practice to master. The above 10 simple steps will help you make more sense of the every day SQL that you’re writing. On the other hand, it is also good to learn from common mistakes. The following two articles list lots of common mistakes Java (and other) developers make when writing SQL:

LINQ and Java

LINQ has been quite a successful, but also controversial addition to the .NET ecosystem. Many people are looking for a comparable solution in the Java world. To better understand what a comparable solution could be, let’s have a look at the main problem that LINQ solves:

Query languages are often declarative programming languages with many keywords. They offer few control-flow elements, yet they are highly descriptive. The most popular query language is SQL, the ISO/IEC standardised Structured Query Language, mostly used for relational databases.

Declarative programming means that programmers do not explicitly phrase out their algorithms. Instead, they describe the result they would like to obtain, leaving algorithmic calculus to their implementing systems. Some databases have become very good at interpreting large SQL statements, applying SQL language transformation rules based on language syntax and metadata. An interesting read is Tom Kyte’s metadata matters, hinting at the incredible effort that has been put into Oracle’s Cost-Based Optimiser. Similar papers can be found for SQL Server, DB2 and other leading RDBMS.

LINQ-to-SQL is not SQL

LINQ is an entirely different query language that allows to embed declarative programming aspects into .NET languages, such as C#, or ASP. The nice part of LINQ is the fact that a C# compiler can compile something that looks like SQL in the middle of C# statements. In a way, LINQ is to .NET what SQL is to PL/SQL, pgplsql or what jOOQ is to Java (see my previous article about PL/Java). But unlike PL/SQL, which embeds the actual SQL language, LINQ-to-SQL does not aim for modelling SQL itself within .NET. It is a higher-level abstraction that keeps an open door for attempting to unify querying against various heterogeneous data stores in a single language. This unification will create a similar impedance mismatch as ORM did before, maybe an even bigger one. While similar languages can be transformed into each other to a certain extent, it can become quite difficult for an advanced SQL developer to predict what actual SQL code will be generated from even very simple LINQ statements.

LINQ Examples

This gets more clear when looking at some examples given by the LINQ-to-SQL documentation. For example the Count() aggregate function:

System.Int32 notDiscontinuedCount =
    (from prod in db.Products
    where !prod.Discontinued
    select prod)
    .Count();

Console.WriteLine(notDiscontinuedCount);

In the above example, it is not immediately clear if the .Count() function is transformed into a SQL count(*) aggregate function within the parenthesised query (then why not put it into the projection?), or if it will be applied only after executing the query, in the application memory. The latter would be prohibitive, if a large number or records would need to be transferred from the database to memory. Depending on the transaction model, they would even need to be read-locked!

Another example is given here where grouping is explained:

var prodCountQuery =
    from prod in db.Products
    group prod by prod.CategoryID into grouping
    where grouping.Count() >= 10
    select new
    {
        grouping.Key,
        ProductCount = grouping.Count()
    };

In this case, LINQ models its language aspects entirely different from SQL. The above LINQ where clause is obviously a SQL HAVING clause. into grouping is an alias for what will be a grouped tuple, which is quite a nice idea. This does not directly map to SQL, though, and must be used by LINQ internally, to produce typed output. What’s awesome, of course, are the statically typed projections that can be reused afterwards, directly in C#!

Let’s look at another grouping example:

var priceQuery =
    from prod in db.Products
    group prod by prod.CategoryID into grouping
    select new
    {
        grouping.Key,
        TotalPrice = grouping.Sum(p => p.UnitPrice)
    };

In this example, C#’s functional aspects are embedded into LINQ’s Sum(p => p.UnitPrice) aggregate expression. TotalPrice = ... is just simple column aliasing. The above leaves me with lots of open questions. How can I control, which parts are really going to be translated to SQL, and which parts will execute in my application, after a SQL query returns a partial result set? How can I predict whether a lambda expression is suitable for a LINQ aggregate function, and when it will cause a huge amount of data to be loaded into memory for in-memory aggregation? And also: Will the compiler warn me that it couldn’t figure out how to generate a C#/SQL algorithm mix? Or will this simply fail at runtime?

To LINQ or not to LINQ

Don’t get me wrong. Whenever I look inside the LINQ manuals for some inspiration, I have a deep urge to try it in a project. It looks awesome, and well-designed. There are also lots of interesting LINQ questions on Stack Overflow. I wouldn’t mind having LINQ in Java, but I want to remind readers that LINQ is NOT SQL. If you want to stay in control of your SQL, LINQ or LINQesque APIs may be a bad choice for two reasons:

  1. Some SQL mechanisms cannot be expressed in LINQ. Just as with JPA, you may need to resort to plain SQL.
  2. Some LINQ mechanisms cannot be expressed in SQL. Just as with JPA, you may suffer from severe performance issues, and will thus resort again to plain SQL.

Beware of the above when choosing LINQ, or a “Java implementation” thereof! You may be better off, using SQL (i.e. JDBC, jOOQ, or MyBatis) for data fetching and Java APIs (e.g. Java 8’s Stream API) for in-memory post-processing

LINQ-like libraries modelling SQL in Java, Scala

LINQ-like libraries abstracting SQL syntax and data stores in Java, Scala