How to Create a Range From 1 to 10 in SQL

How do you create a range from 1 to 10 in SQL? Have you ever thought about it? This is such an easy problem to solve in any imperative language, it’s ridiculous. Take Java (or C, whatever) for instance:

for (int i = 1; i <= 10; i++)
  System.out.println(i);

This was easy, right? Things even look more lean when using functional programming. Take Scala, for instance:

(1 to 10) foreach { t => println(t) }

We could fill about 25 pages about various ways to do the above in Scala, agreeing on how awesome Scala is (or what hipsters we are).

But how to create a range in SQL?

… And we’ll exclude using stored procedures, because that would be no fun. In SQL, the data source we’re operating on are tables. If we want a range from 1 to 10, we’d probably need a table containing exactly those ten values. Here are a couple of good, bad, and ugly options of doing precisely that in SQL. OK, they’re mostly bad and ugly.

By creating a table

The dumbest way to do this would be to create an actual temporary table just for that purpose:

CREATE TABLE "1 to 10" AS
SELECT 1 value FROM DUAL UNION ALL
SELECT 2       FROM DUAL UNION ALL
SELECT 3       FROM DUAL UNION ALL
SELECT 4       FROM DUAL UNION ALL
SELECT 5       FROM DUAL UNION ALL
SELECT 6       FROM DUAL UNION ALL
SELECT 7       FROM DUAL UNION ALL
SELECT 8       FROM DUAL UNION ALL
SELECT 9       FROM DUAL UNION ALL
SELECT 10      FROM DUAL

See also this SQLFiddle

This table can then be used in any type of select. Now that’s pretty dumb but straightforward, right? I mean, how many actual records are you going to put in there?

By using a VALUES() table constructor

This solution isn’t that much better. You can create a derived table and manually add the values from 1 to 10 to that derived table using the VALUES() table constructor. In SQL Server, you could write:

SELECT V
FROM (
  VALUES (1), (2), (3), (4), (5), 
         (6), (7), (8), (9), (10)
) [1 to 10](V)

See also this SQLFiddle

By creating enough self-joins of a sufficent number of values

Another “dumb”, yet a bit more generic solution would be to create only a certain amount of constant values in a table, view or CTE (e.g. two) and then self join that table enough times to reach the desired range length (e.g. four times). The following example will produce values from 1 to 10, “easily”:

WITH T(V) AS (
  SELECT 0 FROM DUAL UNION ALL
  SELECT 1 FROM DUAL
)
SELECT V FROM (
  SELECT 1        + 
             T1.V + 
         2 * T2.V + 
         4 * T3.V + 
         8 * T4.V V
  FROM T T1, T T2, T T3, T T4
)
WHERE V <= 10
ORDER BY V

See also this SQLFiddle

By using grouping sets

Another way to generate large tables is by using grouping sets, or more specifically by using the CUBE() function. This works much in a similar way as the previous example when self-joining a table with two records:

SELECT ROWNUM FROM (
  SELECT 1
  FROM DUAL
  GROUP BY CUBE(1, 2, 3, 4)
)
WHERE ROWNUM <= 10

See also this SQLFiddle

By just taking random records from a “large enough” table

In Oracle, you could probably use ALL_OBJECTs. If you’re only counting to 10, you’ll certainly get enough results from that table:

SELECT ROWNUM FROM ALL_OBJECTS
WHERE ROWNUM <= 10

See also this SQLFiddle

What’s so “awesome” about this solution is that you can cross join that table several times to be sure to get enough values:

SELECT ROWNUM 
FROM ALL_OBJECTS, ALL_OBJECTS,
     ALL_OBJECTS, ALL_OBJECTS
WHERE ROWNUM <= 10

OK. Just kidding. Don’t actually do that. Or if you do, don’t blame me if your productive system runs low on memory.

By using the awesome PostgreSQL GENERATE_SERIES() function

Incredibly, this isn’t part of the SQL standard. Neither is it available in most databases but PostgreSQL, which has the GENERATE_SERIES() function. This is much like Scala’s range notation: (1 to 10)

SELECT * FROM GENERATE_SERIES(1, 10)

See also this SQLFiddle

By using CONNECT BY

If you’re using Oracle, then there’s a really easy way to create such a table using the CONNECT BY clause, which is almost as convenient as PostgreSQL’s GENERATE_SERIES() function:

SELECT LEVEL FROM DUAL
CONNECT BY LEVEL < 10

See also this SQLFiddle

By using a recursive CTE

Recursive common table expressions are cool, yet utterly unreadable. the equivalent of the above Oracle CONNECT BY clause when written using a recursive CTE would look like this:

WITH "1 to 10"(V) AS (
  SELECT 1 FROM DUAL
  UNION ALL
  SELECT V + 1 FROM "1 to 10"
  WHERE V < 10
)
SELECT * FROM "1 to 10"

See also this SQLFiddle

By using Oracle’s MODEL clause

A decent “best of” comparison of how to do things in SQL wouldn’t be complete without at least one example using Oracle’s MODEL clause (see this awesome use-case for Oracle’s spreadsheet feature). Use this clause only to make your co workers really angry when maintaining your SQL code.

Bow before this beauty!

SELECT V
FROM (
  SELECT 1 V FROM DUAL
) T
MODEL DIMENSION BY (ROWNUM R)
      MEASURES (V)
      RULES ITERATE (10) (
        V[ITERATION_NUMBER] = CV(R) + 1
      )
ORDER BY 1

See also this SQLFiddle

Conclusion

There aren’t actually many nice solutions to do such a simple thing in SQL. Clearly, PostgreSQL’s GENERATE_SERIES() table function is the most beautiful solution. Oracle’s CONNECT BY clause comes close. For all other databases, some trickery has to be applied in one way or another.

Unfortunately.

Faster SQL Pagination with Keysets, Continued

A while ago, I have blogged about how to perform keyset pagination (some also call this the “seek method”). Keyset pagination is a very powerful technique to perform constant-time pagination also on very large result sets, where “classic” OFFSET pagination will inevitably get slow on large page numbers.

Keyset pagination is most useful for lazy loading the “next page”, much more than for jumping to page 235 directly, for instance. A good example where this is useful are social media, such as Twitter or Facebook with their streams of content. When you reach the bottom of a page, you just want the “next couple of tweets”. Not the tweets at offset 3564.

If your data being paginated on stays “reasonably unchanged” while paginating, you can even perform classical OFFSET pagination by pre-calculating (and caching) all page boundaries in one go. Imagine, you have this data (I’m using PostgreSQL for the example):

create table t(
  id int,
  value int
);

insert into t (id, value)
select id, random() * 100
from generate_series(1, 1000) x(id);

Our data has 1000 records with random values between 0 and 99. Now, let’s browse this data in ascending order, ordered by VALUE and then by ID. If we wanted to get to page 6 with page sizes of 5 with OFFSET pagination, we’d write:

select id, value
from t
order by value, id
limit 5
offset 25

This would then yield something like

|  ID | VALUE |
|-----|-------|
| 640 |     2 |
| 776 |     2 |
| 815 |     2 |
| 947 |     2 |
|  37 |     3 |

See also this SQLFiddle

OFFSET pagination emulation with cached page boundaries

The above can be emulated using keyset pagination if we know the page boundaries of every page. In other words, in order to jump to page 6 without an actual OFFSET, we’d have to know the value of the record immediately preceding {"ID":640,"VALUE":2}. Or better, let’s just figure out all the page boundaries with the following query:

select 
  t.id, 
  t.value,
  case row_number() 
       over(order by t.value, t.id) % 5 
    when 0 then 1 
    else 0 
  end page_boundary
from t
order by t.value, t.id

The above query yields

|   ID | VALUE | PAGE_BOUNDARY |
|------|-------|---------------|
|  ... |   ... |           ... |
|  474 |     2 |             0 |
|  533 |     2 |             1 | <-- Before page 6
|  640 |     2 |             0 |
|  776 |     2 |             0 |
|  815 |     2 |             0 |
|  947 |     2 |             0 |
|   37 |     3 |             1 | <-- Last on page 6
|  287 |     3 |             0 |
|  450 |     3 |             0 |
|  ... |   ... |           ... |

See also this SQL Fiddle

As you can see, the record just before {"ID":640,"VALUE":2} is {"ID":533,"VALUE":2}, which is the page boundary that we need to jump to if we want to go to page 6. Page 7 then starts with the record just after {"ID":37,"VALUE":3}.

The above query selects too much data, of course. We’re only interested in those records where PAGE_BOUNDARY = 1. Besides, why not calculate the page numbers already in the database with yet another call to ROW_NUMBER(). Let’s write:

select 
  x.value, 
  x.id,
  row_number() 
    over(order by x.value, x.id) + 1 page_number
from (
  select 
    t.id, 
    t.value,
    case row_number() 
         over(order by t.value, t.id) % 5 
      when 0 then 1 
      else 0 
    end page_boundary
  from t
  order by t.value, t.id
) x
where x.page_boundary = 1

This will then yield:

| VALUE |  ID | PAGE_NUMBER |
|-------|-----|-------------|
|     0 | 786 |           2 |
|     1 | 250 |           3 |
|     1 | 959 |           4 |
|     2 | 229 |           5 |
|     2 | 533 |           6 | <-- Before page 6
|     3 |  37 |           7 |
|     3 | 768 |           8 |

See also this SQLFiddle.

We can now jump to page 6 with this simple query:

select id, value
from t
where (value, id) > (2, 533)
order by value, id
limit 5

… which will yield the same as the previous OFFSET query:

|  ID | VALUE |
|-----|-------|
| 640 |     2 |
| 776 |     2 |
| 815 |     2 |
| 947 |     2 |
|  37 |     3 |

See also this SQLFiddle

If you’re planning on using the upcoming jOOQ 3.3, the same query can be achieved with the following SEEK syntax:

DSL.using(configuration)
   .select(T.ID, T.VALUE)
   .from(T)
   .orderBy(T.VALUE, T.ID)
   .seek(2, 533)
   .limit(5);

The advantage of this is that you don’t have to write out the SEEK predicate explicitly, which adds readability and typesafety, specifically if your ORDER BY clause is a little more complex

If window functions aren’t available

The above queries make use of window functions, which aren’t available in all databases, unfortunately. If you’re using MySQL, for instance, you will have to use a different mechanism to find the PAGE_BOUNDARY. One such example us using a scalar subquery:

select 
  t.id, 
  t.value,
  case (
      select count(*)
      from t t2
      where (t2.value, t2.id) <= (t.value, t.id)
    ) % 5 
    when 0 then 1 
    else 0 
  end page_boundary
from t
order by t.value, t.id

See also this SQLFiddle

Such a scalar subquery might be quite costly if your database performs poor query optimisation. Your best bet would be to measure things and decide whether caching page boundaries to be able to apply keyset pagination is really faster than classic OFFSET paging.

Conclusion

As explained in our previous blog post about keyset pagination, this technique can bring great performance improvements as pagination can be achieved in constant time, leveraging existing indexes. It is most useful when the underlying data is very stable (no records added / removed while paginating), or when pagination “stability” is desired even if records are added / removed.

Keyset pagination, or the “seek method”, should be part of every SQL developer’s tool set.

Where’s the Self-Confidence when Advertising Java 8, Oracle?

I have often wondered, why the team around Brian Goetz has been heading towards a “decent compromise” so strongly from the beginning, both from a marketing AND technical point of view, instead of adding more boldness to how Java 8 is advertised. At Devoxx Belgium 2013, Brian Goetz seems to have really sold his accomplishments completely under value, according to this interesting article. Having extensively followed the lambda-dev mailing list, I can only stress how little the creators of Java 8 loved their new defender methods feature, for instance.

Java 8 is what we have all been waiting for, for so long! After all, the introduction of lambda expressions and defender methods (equally impactful, if not as often advertised!) is one of the most significant improvements to the Java language since the very beginnings.

Given the tremendous success of LINQ in .NET, I have recently contemplated whether Java 8, lambda expressions and the Streams API might actually be an equally interesting approach to adding features to an ecosystem, compared to the “scariness” of comprehensions and monads as understood by LINQ: https://blog.jooq.org/2013/11/02/does-java-8-still-need-linq-or-is-it-better-than-linq/

While my article above certainly wasn’t well received with the .NET community (and even Erik Meijer himself smirked at it), it did get quite a bit of love from the Java community. In other words, the Java community is ready for Java 8 goodness. Let’s hope Oracle will start advertising it as the cool new thing that it is.

The Spam Manifesto: Reactive Spamming, Big Spam, Spam Reduce

Dear Spammers,

SpamInACanWhen blogging and interacting with our users on forums, dealing with your spam is part of our every day work. Our WordPress blog luckily uses Akismet, a comment spam prevention tool, which removes lots and lots of spam, such that our moderation efforts stay very low. Google Groups, on the other hand, uses Google’s own awesome spam filter, which works impeccably in GMail already. But unfortunately, we have to perform quite a bit of moderation as there are an incredible number of Chinese spammers.

Who are these spammers? Who spends so much time in developing robots for stuff that doesn’t really work as most mail / blog vendors have awesome counter-measures? You’re just wasting your time without any actual conversion!

To make things worse, blog software renders many links as “nofollow”, so even if a spam comment slips through, you won’t get the conversion. You spammers are obviously in the tech stone age, perhaps writing your spam software in COBOL or Delphi?

Let us help you get up to date with recent developments. I’ve had a brief discussion with Petri Kainulainen on the subject, and we think that there should be a new

Spam manifesto

We would like to introduce Reactive Spamming. We believe that Hipster Spammers could take the lead and greatly benefit from fancy tech such as Clojure, Akka and Hadoop. Imagine implementing SpamReduce or MapSpam algorithms on BigSpam. Gartner will certainly feature you on their tech radar. We’ll create SpamConf, a great conference where brilliant minds of our time can discuss the latest in spamming, and the Groovy folks will present their new SpamDSL, to help you express your algorithms in FluentSpam.

And now, for some background info about Spam.

A History of Databases in “No-tation”

We’re heading towards very exciting times in the field of databases!

At Topconf in beautiful Tallin, Estonia, Nikita Ivanov (founder and CEO of GridGain Systems) was talking about how the ever crumbling price of DRAM gets in-memory computing and thus in-memory databases within the reach of being affordable by even small and medium enterprises. Nikita claims that 99% of all companies have less than 10TB of transactional data. While this has been completely impossible ten years ago, nowadays, you can store that much data in memory for less than 15000 USD! Compared to the Oracle license that you might buy with the server, that’s almost nothing. Imagine that you can scale up several orders of magnitude without changing your “legacy” architecture. Without switching to something like NoSQL.

A day before, Christoph Engelbert presented Hazelcast, a competitor product of GridGain Systems. Unfortunately, I couldn’t attend his talk but I was lucky enough to spend a couple of hours with Christoph on the flight back home. He’s a very interesting and fun guy to talk to and gave me quite some insight about what his company is evangelising in the context of “Big Data”. Essentially, modern data processing involves moving computation towards data, instead of moving data towards computation. While Hazelcast solves this through their own storage mechanisms, this paradigm has been equally true for “legacy” OLAP systems based on relational databases. Using PL/SQL, or T-SQL, or any other procedural language, you can execute complex algorithms right where the data is: In your database.

For those of you frequently following my blog, you will not be surprised that I am very thrilled about the above evolutions in data computing. The ever increasing consternation with ORMs and the big amount of confusion about the future of “NoSQL” have lead to a recent revival of SQL as a language.

Back to the roots.

This seems to have culminated at the recent O’Reilly Strata Conference, where Mark Madsen, a popular researcher and analyst was walking around with a geeky T-Shirt showing the History of NoSQL. I’ve had a brief chat with him on Twitter. He might be selling this T-Shirt, if it goes viral.

History of NoSQL by Mark Madsen. Picture by Ed Dumbill
History of NoSQL by Mark Madsen. Picture published by Edd Dumbill

So apparently, SQL is back, and strong as ever!

jOOQ Newsletter November 14, 2013

subscribe to the newsletter here

jOOQ Blog License now CC-BY-SA

Next to providing you with the best Java / SQL integration on the market, we’re also passionate bloggers on the matter of Java, SQL and Open Source. We think that with our experience around jOOQ, we should be major influencers on those subjects in general.

Our blog at blog.jooq.org will have reached the 200k hits threshold by the end of the week and we’ll most certainly celebrate that. Our topics and insights are increasingly appreciated by a wider and wider audience outside of the jOOQ user base, also on our syndication partners DZone (where we’ve had around 800k readers so far), JCG(readers unknown) and Tech.Pro (100k reads so far). The recent success shows that our marketing efforts pay off. Here are some stats from the jOOQ blog:

Because our blog is reaching far beyond our user base, we have decided to license its content under the terms of the CC-BY-SA 3.0 license, a permissive license that reflects our Open Source spirit. You may thus freely use our content for commercial purposes, if you attribute authorship to us. Please contact us, if you’re not sure how to create appropriate attribution.

Dual Licensing. An Experience Report

A month ago, we started dual-licensing jOOQ. We are happy to see that our competitors follow our lead in offering commercial services around their software. This is a strong indicator for having done something right. Here’s a little review from Data Geekery about the recent events around our new licensing model.

We have to admit that switching over from a very permissive Open Source license to more restrictive dual-licensing wasn’t exactly a walk in the park. Getting legal aspects right wasn’t easy. How many Open Source products out there do you think are neglecting due diligence with respect to copyright? Our estimate: 95%.

Yet, removing commercial database support from the jOOQ Open Source Edition has had only little impact on the number of downloads, nonetheless. After a short break in August / September (no jOOQ 3.1 patch releases), jOOQ 3.2 is almost as strong as ever as can be seen in this chart originating from oss.sonatype.org:

This doesn’t even count the number of downloads from www.jooq.org/download, or from SourceForge, before we removed the SourceForge download channel. The same effect can be seen on Stack Overflow and on GitHub, where jOOQ has had a significant increase of traction in the last 2-3 months!

Furthermore, with our recent discussions with the Apache GORA and Apache CloudStack guys, we’re positive that dual-licensing won’t keep jOOQ out of the professional Open Source world.

At the same time, sales talks around tailor-made agreements with medium and large customers are ongoing. We’re considering our work of the last 4 months a great success and we’re positive to be able to provide you with a much better jOOQ in the near future by creating professional Open Source software built on solid financial grounds, which everyone can greatly profit from.

Upcoming Events

As mentioned in the October newsletter, Lukas is going to be present at a number of events in the near future, talking about jOOQ and other database related stuff. ThejOOQ presentation at Topconf in beautiful Tallinn, Estonia has had around 35 attendants – well, it was hard to compete with the Google Glass presentation :-)

Upcoming events include

Stay informed about 2014 events on www.jooq.org/news.

SQL Zone – ORM (Un)Popularity

We’re personally thrilled by the fact that the ORM debate is far from over, even more than half a decade after the vietnam of computer science was first recognised. We firmly believe that ORMs are a very leaky abstraction, which is fine in “top-down” engineering approaches where the relational database is a second-class citizen.

But many companies don’t think that should be the case. Many companies want their data to be the first-class citizen, processed by more volatile entities, such as Java programs. It often just doesn’t make sense to have the data abide by the rules of the ORM. It is thus not surprising that Charles Humble from InfoQ has again detected increasing discomfort with ORMs at QCon and other conferences.

In our opinion, there’s a simple reason for this. SQL is constantly evolving, but JPA isn’t. Most importantly, JPA doesn’t do SQL as understood by the ISO / IEC standards. And it doesn’t look as though that’s going to change. Read our blog post on that subject.

SQL Zone – The History of NoSQL

A witty remark about what NoSQL really is has recently been made at the O’Reilly Strata Conference in London, where Mark Madsen, a popular researcher and analyst was walking around with a geeky T-Shirt depicting the History of NoSQL.

We sincerely hope that this awesome piece of humour will go viral. See for yourself:
https://twitter.com/edd/status/400190499585544192/photo/1

Clearly, betting on the “SQL horse” isn’t such a bad bet after all.

Popular ORMs Don’t do SQL

I’m contemplating about what has happened in the ISO / IEC SQL standard during the last 15 years. We’ve had quite a few new features added to our beloved SQL language. Check this out:

  • With the ISO/IEC SQL:1999 standard, we could take advantage of grouping sets and (recursive) common table expressions.
  • With the ISO/IEC SQL:2003 standard, we’ve had the very sophisticated window functions and the MERGE statement.
  • With the ISO/IEC SQL:2008 standard, we could perform partitioned JOINs.
  • With the ISO/IEC SQL:2011 standard, we can now interoperate with temporal databases (implemented so far in IBM DB2 and Oracle).

And obviously, there’s much more goodness hiding in the almost unreadable 1423-page-long documents.

But JPA…

Now, does any of these awesome features appear in JPA? Nope. Will the next SQL standard introduce new awesome features? I’m sure it will! I could imagine that the Oracle / CUBRID CONNECT BY clause, or the Oracle / SQL Server PIVOT / UNPIVOT clauses are good candiadates for standardisation. I’d go absolutely nuts if Oracle’s crazy MODEL clause would make it, too.

While exciting things happen at these ends, the ORM impedance mismatch will further deepen and confirm Charles Humble’s recent findings from the QCon, where he’s observed an increasing number of people who are unhappy with popular ORMs’ ever increasing complexity. A complexity example: NamedEntityGraph!

@NamedEntityGraph(
    name="ExecutiveProjects"
    attributeNodes={
        @NamedAttributeNode("address"),
        @NamedAttributeNode(
            value="projects",
            subgraph="projects"
        )
    },
    subgraphs={
        @NamedSubgraph(
            name="projects",
            attributeNodes={
                @NamedAttributeNode("properties")
            }
        ),
        @NamedSubgraph(
            name="projects",
            type=LargeProject.class,
            attributeNodes={
                @NamedAttributeNode("executive")
            }
        )
    }
)

Man, did this really have to be added to JPA? Stack Overflow cannot display that many annotations on a single screen! Well, if this is JEE’s answer to SQL’s recent evolutions, then I’m glad I don’t do too much JEE these days. I’m doing SQL, which is an awesome language if let running loose.

Oracle GlassFish, or Why You Should Think About Open Source Again

Oracle’s recent announcement about the discontinuation of commercial services for JEE’s reference implementation GlassFish has caused many reactions in the community involved with JEE. The reactions reach from:

This event seems to have a big impact on the whole Java ecosystem as many of the above people are key players and influencers in our community, and they neither agree nor know what this step by Oracle means for the future of JEE.

The most interesting point of view among all of the above, in my opinion, is tomitribe’s, looking at things from a mere business point of view with respect to Open Source. They’re saying:

Open Source Isn’t Free

Or in other words, “There ain’t no such thing as a free lunch”. And to quote tomitribe even more, a very interesting thought they’re setting out is this:

What this says to me is that we as an industry still do not fully understand Open Source.

We most certainly do not understand Open Source. I’m an Open Source software vendor myself. I believe that Open Source is:

An excellent marketing tool

People look at Open Source as something “generally good”. When I talked about jOOQ at conferences and when it was an all-Open-Source piece of software (not yet dual-licensed), I got lots of opportunity to do free advertising. This has rapidly changed, now that I am offering an alternative commercial license.

A good tool enabler

I get free access to

The same here. As I’m now a “commercial” software vendor, some tools are no longer accessible to me.

The truth is: Open Source is a business strategy

It really is. And it seemed to have worked well for RedHat or Pivotal in the past. Has it worked for anyone else? We don’t know yet, as most other larger companies have such huge amounts of revenue in “classic” fields that they can simply “afford” Open Source. In fact, they’re so good at investing manpower and innovation into Open Source, it keeps the commercial competition in check, as it is hard to write a better and more complete JEE implementation than Weblogic or Websphere.

Apparently, even Larry Ellison is said to agree that the future of data centers lies within using commodity machines. At the same time, RedHat suggests “trying free” to Oracle.

No matter what the impact of the commercial unsupport of GlassFish on JEE will be, we’re only at the beginning of fully understanding what kind of impact this large scale “freemium” model will have on our world. This isn’t just about the software industry. The whole Internet has brought us “free” stuff. We get:

  • “Free” standards (compare W3C, IETF standards to ISO standards!)
  • “Free” Facebook and Twitter and GMail accounts
  • “Free” newspapers
  • “Free” music and films
  • “Free” commodity services for all sorts of work
  • “Free” work force as we can offshore anything to low-wage countries

This has been picked up recently by Tim Kreider, the author of “We Learn Nothing”, where he depicts how writing “free stuff” for the New York Times helps building “exposure”, and how that’s just nonsense as all this hard journalist work doesn’t pay anymore.

Does building “exposure” ring a bell?

Yes, I can build “exposure” by writing free Open Source on GitHub, and by answering complex questions for free on Stack Overflow. I personally use both tools to advertise jOOQ, no doubt. So I get a service (advertising) for a service (content). My deal appears fair to me. But loads of GitHub and Stack Overflow users contribute … just for the sake of contributing. To whom? To GitHub and Stack Overflow. And why? I don’t know.

So, should you contribute to GlassFish, if Oracle starts decreasing support and loosening interest as they have before with MySQL, Hudson, and other products inherited from Sun?

Let’s remember that Karl Marx has already taught us that our idea of capitalism will inevitably lead us to (citing from Wikipedia):

  • technological progress
  • increased productivity
  • growth
  • rationality
  • scientific revolution

Absolutely! There’s no way that productivity can get any better than by having loads of software developers world wide produce better and better tools (growth, progress) for nothing more than … for free!

So, don’t be a pawn of others’ Open Source strategies

So, instead of contemplating what Oracle’s move away from supporting the Open Source reference implementation of JEE means, become active yourself! Don’t just blindly consume Open Source, make it an option like any other option by consciously deciding in favour of Open Source or commercial software, depending on your specific needs.

Stop advertising their cool products for free at conferences, unless you pull out your own advantage from such an advertisement. Open Source is just yet another business model.

ID Lists Aren’t the Best Solution for the N+1 Problem

In their eternal attempts to circumvent the N+1 problem, Hibernate users often resort to IN predicates with ID lists. In this post, we’ll see how those users might just be replacing a horrible thing with a bad one, which is better but not yet good. Here’s why:

The N+1 Problem

The N+1 problem is a well understood issue, documented in various blog posts. The previously linked article shows the following set of queries to explain the nature of this problem:

SELECT id, name FROM albums
SELECT id, name FROM songs WHERE album_id = 1
SELECT id, name FROM songs WHERE album_id = 2
SELECT id, name FROM songs WHERE album_id = 3
SELECT id, name FROM songs WHERE album_id = 4
SELECT id, name FROM songs WHERE album_id = 5

This set of queries is often produced by ORMs such as Hibernate, when entities are configured to be lazy fetched.

The article also tackles the problem by replacing the second set of N=5 queries by a single query with an IN predicate:

SELECT id, title, filename FROM songs
WHERE album_id IN (1, 2, 3, 4, 5)

This will reduce the number of queries from N+1 to 1+1, which is certainly faster.

But let’s look at things from the SQL side

Is such an IN predicate with an ID List a good solution? It is certainly viable for very small lists. But when your list grows, consider these things:

IN list size

Not all databases support arbitrary lengths of IN lists. In particular the following limitations exist:

  • Oracle IN predicate: 1000 elements
  • Ingres: 1024 total bind values
  • SQLite: 999 total bind values
  • Sybase ASE: 2000 total bind values
  • SQL Server 2008 R2: 2100 total bind values

This is quite annoying as developers have to probably learn the above the hard way. If you’re using jOOQ, you can “safely” ignore the above constraints as jOOQ will rewrite your query such that:

  • Large Oracle IN predicates are split into several OR-connected IN predicates, or AND-connected NOT IN predicates
  • Large amounts of bind values are detected at SQL rendering time and replaced by inline values

Variable binding speed

There’s quite a bit of work involved with variable binding in some JDBC drivers. Essentially, some database protocols will need to transfer many values one-by-one to the database. This doesn’t happen when you inline bind values, as the only thing transferred is a single SQL string. Having too many bind values (I’m talking about 10k or more), is certainly not a good idea.

Cursor cache misses

Sophisticated databases such as Oracle maintain cursor caches, which can be leveraged for cursor sharing. This means that subsequent executions of identical SQL statements will profit from expensive execution plan calculations having been done already, along with cursor statistics being collected. Think about it this way:

-- This is the first time Oracle encounters this 
-- query. The DB has to parse the query and 
-- calculate an execution plan, which can be quite 
-- expensive if you have lots of JOINs
SELECT id, name FROM songs 
WHERE album_id IN (?, ?)

-- This is the second time Oracle encounters this 
-- same query. The DB can now re-use the previous
-- execution plan as it is likely to be optimal 
-- again
SELECT id, name FROM songs 
WHERE album_id IN (?, ?)

-- This is not the same query as the previous ones
-- A new execution plan has to be calculated
SELECT id, name FROM songs 
WHERE album_id IN (?, ?, ?)

As you can quickly see, the above example shows that an ID list in an IN predicate is a moving target, which is likely to remove the usefulness of bind values entirely, as each query is prone to produce new cursors and new execution plans in your database. You might as well have inlined your bind values, which would have even helped you prevent bind value peeking issues.

So what’s better than ID lists?

There are a number of things that are better. Note that not all of them may be suitable for your concrete problem, and not all of them will always outperform ID lists. Use common sense and maybe a load and/or performance test to be sure, which is the best query in your situation.

Explicit “eager” fetching, using JOINs

Sometimes, it would just be easier to denormalise the data in the database. Instead of fetching songs one by one, just fetch them along with the albums:

SELECT
  a.id a_id, 
  a.name a_name,
  s.id s_id,
  s.name s_name
FROM albums a
JOIN songs s ON s.album_id = a.id

This will transfer more data over the wire (repeating album information) in exchange for executing only a single query (reducing N+1 to 1). This is only good for slight denormalisations. If you JOIN dozens of 1:N relationships, you might not be happy with this solution.

Semi-joining the original query

If you can access the original query’s SQL code, just semi-join it when fetching songs! It’s simple:

SELECT id, name FROM songs 
WHERE album_id IN (
  SELECT id FROM albums
)

-- Or using EXISTS
SELECT s.id, s.name FROM songs s
WHERE EXISTS (
  SELECT 1 FROM albums a
  WHERE a.id = s.album_id
)

This will require some SQL transformation. Again, using a typesafe query builder / SQL builder to compose queries, such as jOOQ, JaQu or Criteria API, you may be able to implement such SQL transformation / SQL composition more easily.

Note that this is probably the fastest solution that you can choose, at least in sophisticated databases with powerful query optimisers.

Using arrays for ID lists

If you really cannot query your songs without an ID list, at least, use a single array as a bind variable as such (Oracle dialect):

SELECT id, name FROM songs 
WHERE album_id IN (
  SELECT * FROM TABLE(?)
)

The above syntax is Oracle-specific. Check out this Stack Overflow question for other alternatives. Note that Oracle’s VARRAY and TABLE types are strongly typed, i.e. you will have to have such a type, first:

CREATE TYPE numbers AS TABLE OF NUMBER(38);

Alternatively, you can use one of these “built-in” table types:

  • ORA_MINING_NUMBER_NT
  • ORA_MINING_VARCHAR2_NT

Creating discrete-sized IN lists

If your database doesn’t support arrays, and you need to rely on ID lists, there is one last option that you may have to avoid too many cursor cache misses and hard parses. Create discrete-sized IN lists, filling up the bind values to the next discrete length. Let’s assume lengths 2, 3, 5, 8, 13. This is best explained by example:

-- Of course, this only makes sense with bind values
-- Inlining is done for the purpose of the example
-- only

-- Two IDs   fill up to 2
album_id IN (1, 2)

-- Three IDs fill up to 3
album_id IN (1, 2, 3)

-- Four IDs  fill up to 5
album_id IN (1, 2, 3, 4, 4)

-- Five IDs  fill up to 5
album_id IN (1, 2, 3, 4, 5)

-- Six IDs   fill up to 8
album_id IN (1, 2, 3, 4, 5, 6, 6, 6)

There is no rule of thumb at what steps your IN list sizes should increase, so you might want to actually measure this.

Note!: You may use NULL to fill up IN lists of an IN predicate, but not of a NOT IN predicate. To learn more about this, read this blog post about NULL and NOT IN predicates.

TL;DR: Get back in control of your SQL

As soon as a decent amount of data is involved with your data processing, common ORM models may not be sufficient anymore, as it is very hard to tune such ORMs. You may need to resort to SQL and explicitly express your SQL statements in the most optimal way for your problem domain.

Using SQL Injection Vulnerabilities to Dump Your Database

The threat caused by SQL injection is heavily underestimated even by many senior developers and software architects. Most people are unaware of the fact that an entire server can be at risk by a single vulnerability even in the remotest piece of logic. This article will give a frightening insight into the potential severity of SQL injection vulnerabilities.

What is SQL injection?

The Wikipedia article on SQL injection reads:

SQL injection is a code injection technique, used to attack data driven applications, in which malicious SQL statements are inserted into an entry field for execution.

In other words, if a website or some other software entity has a vulnerability, it is possible for an attacker to “inject” arbitrary pieces of SQL code for execution on the server. A straight-forward and popular example of such a vulnerability is when a dynamic SQL statement directly embeds user input such as the following Java code does:

01-sqli

The title variable could be user input, e.g. an HTML field. Nothing prevents this user input from being a SQL code snippet that would make perfect sense syntactically. For example:

02-sqli

It is easy to see that when substituted into the previous SQL statement, this “title” will result in always selecting all films. Another, famous example of how this can go wrong is xkcd’s famous Little Bobby Tables strip.

Automating the search for vulnerabilities

The above introduction shows that SQL injection is possible and quite simple to exploit, manually, if the server side source code is well known. Finding such a vulnerability in a huge application with thousands of SQL statements, however, is a lot of work. Besides, there are much more subtle vulnerabilities than the obvious pattern matching search. sqlmap is an Open Source, GPLv2 licensed tool for automating such searches.

Let’s assume we have a simple RESTful application written in Java using Jetty and JAX-RS querying the MySQL Sakila example database, which exposes the above vulnerability at this URL:

http://localhost:8080/sql-injection-examples/vulnerable-services/films/alien

alien is a @PathParam value passed to this method, which returns all films titled %alien%:

03-sqli

Now, let’s download sqlmap and let it run against the above URL:

python sqlmap.py 
  -u localhost:8080/sql-injection-examples/vulnerable-services/films/

Note that sqlmap is implemented in Python 2.x. When we let this run, my server-side SQL logs show me that there are a couple of interesting SQL queries being executed:

04-sqli

sqlmap tries to inject all sorts of snippets that would help it discover if the vulnerable query is deterministic, whether the URL is stable, what database server type this is, if the vulnerability is inside a subquery, whether UNION clauses can be appended, etc. This is also nicely displayed in the sqlmap stdout log output:

05-sqli

With a total of 59 HTTP requests (among which 41 resulted in HTTP 500 errors), sqlmap was capable of detecting the nature of the vulnerability of my SQL statement, and it also figured out the database server and version.

Dumping data

This is not so impressive yet. With a lot of SQL knowledge and creativity, I might have figured this out myself. But sqlmap offers a lot of other interesting operation modes as can be seen in the user guide. Let us use the database navigation tools. For this, we’ll add the –dbs parameter on the same URL:

python sqlmap.py 
  -u localhost:8080/sql-injection-examples/vulnerable-services/films/ 
  --dbs

This will immediately dump the following databases:

06-sqli

Note that sqlmap logs everything it learns in a local SQLite database, such that it can keep the number of HTTP requests as small as possible. This is important for an attacker, as frequent HTTP 404 or 500 statuses will eventually trigger attention by maintenance personnel at the server side.

But how did it find the actual database names through my vulnerable SQL statement? In three steps:

  1. It sent a first statement to check if the vulnerability still exists
  2. It sent a second statement to see how many databases there are (8)
  3. It sent a third statement to learn the name of each database

Let’s look at step 2. The query executed is this one:

07-sqli

Because sqlmap previously discovered that my naïve server implementation just dumps stack traces of HTTP 500 errors, it knew that it could generate the following MySQL error message and convey the information of interest hidden inside that message:

com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: 
  Duplicate entry 'qnfwq8qdmiq1' for key 'group_key'

The SQL concat function wraps the number 8 with two uniquely identifiable strings. With a similar query and the MySQL LIMIT clause, database names can then be extracted one by one, e.g. the Sakila database:

com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: 
  Duplicate entry 'qnfwqsakilaqdmiq1' for key 'group_key'

Again, the information of interest is wrapped between delimiters.

Dumping data without stack traces

Obviously, dumping stack traces in a productive system is really a bad idea, as you should never give an intruder any hints about how your system works. Let’s try to change our application such that it will display a simple 500 Internal Server Error message:

08-sqli

No problem for sqlmap. When I rerun my previous database dumping command, sqlmap generates all database names letter by letter. Now, that it has no way of producing error message output in HTTP responses, it can only do a binary search on each letter of a schema and see if any given search still produces the regular list of Sakila films or an empty list:

09-sqli

The above query checks if the first schema (LIMIT 0, 1) has a letter higher than ASCII code 120 (ORD) at position 13 (MID). This is a lot slower and more visible in server logs than before, but it can still produce the result.

Dumping data without displaying UI output

What if the vulnerability is deep down in our application, never producing UI output? Let’s change our service again, executing the vulnerable query but without returning any data:

10-sqli

My log file shows me that there are many more SQL statements executed by sqlmap and even the confirmation if there is a vulnerability at the given URL takes longer, as sqlmap now resorts to binary searches using delays. Here’s how sqlmap confirms that there is a vulnerability:

11-sqli

As can be seen, the above queries run for N or N + 5 seconds, where N is the amount of time the actual query takes. If N is sufficiently stable, delay-based binary searches can be performed to discover database names letter by letter.

Dumping sensitive data

The previous sections have shown that a vulnerability is almost always exploitable by automatic means. Intruders have time. They can run their script over a week to have sqlmap dump your complete database schema. But the threat doesn’t end there. Once table and column names are accessible, all of those tables can be dumped, too. Let’s dump the film table:

python sqlmap.py 
  -u localhost:8080/sql-injection-examples/vulnerable-services/films/ 
  --dump –T film

This takes very long, as each row and each column is fetched through at least one single query. But eventually, I get a nice CSV file of the complete contents. Needless to say that this could be your application’s credit card information, or other sensitive or classified data.

Arbitrary querying

You would think you’d notice if someone dumps your entire schema? That’s not necessary for the intruder. Remember, they have already been able to extract schema, table and column names. They can now selectively execute arbitrary queries as such:

python sqlmap.py 
  -u localhost:8080/sql-injection-examples/vulnerable-services/films/ 
  --sql-query=”SELECT a.first_name, a.last_name, count(*) 
               FROM film f 
               JOIN film_actor fa USING (film_id) 
               JOIN actor a USING (actor_id) 
               GROUP BY actor_id, a.first_name, a.last_name 
               ORDER BY count(*) DESC LIMIT 1”

The above query will find the actor who has played in most films:

12-sqli

Not all queries are syntactically injectable into all vulnerabilities on all RDBMS. But the Little Bobby Tables example is not what most intruders are after anyway: Causing damage to your system. Most intruders do not want to be seen in order to be able to silently steal your data.

Other sqlmap features

sqlmap doesn’t stop there. sqlmap allows you to switch database users, figuring out root or DBA passwords by brute force. Once you gain access to a user or role with higher grants and depending on the actual RDBMS, sqlmap will also allow you to:

  • Upload and execute a SQL script
  • Inject and execute custom user-defined functions
  • Read or write files on the database server file system
  • Open up a shell on the database server operating system
  • Manipulate the database server’s Windows registry

On a less serious note, sqlmap can be an excellent database server administration tool, should you have forgotten the credentials of your own local database development environment!

Do not leave SQL injection to chance.

Counteractions

In principle, it is almost impossible to completely prevent vulnerabilities involving SQL. SQL is a dynamically interpreted language and as such, it is much more vulnerable to code injection than static languages like Java. In particular, it is extremely vulnerable as it is a common practice to perform dynamic SQL based on user input criteria, such as search criteria. Obviously, other server-side interpreted languages are equally vulnerable, but SQL also happens to be one of the most popular ones.

However, taking the right measures will make it very hard for an intruder to actually find a vulnerability and exploit it without you noticing their activities first. Remember, the less information you display the longer sqlmap takes and the more traces it leaves in your server logs to even verify if a vulnerability is present.

Here are a couple of measures that you should follow and enforce in your team:

Never trust user input

First and most importantly: Never ever trust user input. Even if your application is only used by 10 users from the same company via intranet, your database might contain salaries or other sensitive data, and a malicious employee can dump such data.

Note that user input should not be trusted in other scenarios either, e.g. when accessing the file system.

Use as little user input as possible

Not only should you distrust user input, you should also use as little user input as possible directly in your SQL. For example, if users can make a choice in an HTML dropdown. Instead of simply embedding the HTTP POST parameter value into your SQL statement, parse it first and encapsulate it in an appropriate pre-defined type. In Java, a good match for HTML options could be an enum type.

You know your data best, and thus, you should validate all user input on the server, immediately after receiving it.

Use bind variables

Try to express your SQL statements as statically as possible. SQL string concatenation makes it very easy for junior developers to make mistakes. If you are an engineering team leader, it is your responsibility to help your team members avoid such mistakes.

The simplest way to prevent SQL injection is by using bind variables. JDBC drivers (if you’re operating with Java) and databases have very few bugs in that area, such that streaming bind variables to the database will not generate any easily exploitable vulnerability.

Use static code analysis tools

If you are using Java and JDBC directly, you may be able to detect some vulnerabilities using static code analysis tools, such as FindBugs™ or Alvor. Obviously, such tools can only detect SQL injection when SQL statements are “relatively static”. If you distribute SQL criteria string generation over several code modules, it is not possible for a static code analysis tool, to detect vulnerabilities.

Use sqlmap

sqlmap is not necessarily a tool for malicious activities. It is free and Open Source, available under the GPLv2 license. You can use it in your continuous integration test suites to detect SQL injection regressions. As you know your own software very well, you can configure sqlmap with a variety of parameters that will give it a “head start” as it will not need to figure out whether you are operating on MySQL, PostgreSQL, Oracle, etc.

Apply SQL abstraction

In addition to a very low-level API (JDBC in Java, native functions in PHP, etc.), most platforms also have a variety of higher-level APIs that abstract the SQL language in a non-string-based way. However, don’t be tricked into thinking that SQL language abstraction itself keeps you safe from code injection. Popular frameworks like Hibernate or JPA use string-based query languages like HQL or JPQL. These languages are less vulnerable than SQL because they’re less expressive. But they are still vulnerable!

In .NET an example of a non-string-based SQL abstraction is LINQ-to-SQL. In Java, examples are SQLJ, JPA CriteriaQuery or jOOQ. Our previous example query would translate to these ones:

Static SQL with SQLJ
Static SQL with SQLJ
Statically typed LINQ-to-SQL with LINQ
Statically typed LINQ-to-SQL with LINQ
Statically typed JPQL with JPA Criteria Query
Statically typed JPQL with JPA Criteria Query
Statically typed SQL with jOOQ
Statically typed SQL with jOOQ

In addition to being much safer from SQL injection through enforcing the use of bind variables, statically typed internal domain-specific languages also help prevent syntax mistakes. In the case of jOOQ, this is further supported by jOOQ’s source code generator, which will generate Java literals for tables and columns.

Carefully craft database GRANTs and security policies

Don’t just blindly use the MySQL root user with no sensible password for everything (or the postgres user in PostgreSQL, or the dbo user in SQL Server, etc.). Create a well-designed security strategy, where users are granted access only to those elements that they really need to access.

Often, a good idea is also to add an additional layer of security indirection inside your database through database views. For instance, web users are granted access to a few read-only views, never to tables directly. This will allow you to dynamically restrict who can manipulate what data.

This measure will not prevent SQL injection, but it will minimize the possible damage an intruder can cause, once they have penetrated your system.

Use firewalls

If you’re developing a web application, you can choose from a variety of firewalls that have some SQL injection protection features. Apart from simple regular-expression based pattern matching (which is not really reliable or useful), such an entry server should also support two very powerful features to help you prevent mistakes when accepting user input through HTTP GET and POST parameters:

  • URL encryption: All application URLs are encrypted and never disclosed to the client. It is thus impossible to tamper with GET parameters. It is even impossible to recognise GET parameters. The disadvantage of this feature is the fact that it can be quite difficult to implement a modern JavaScript-based rich internet application because… you cannot modify GET parameters.
  • Form Protection: All acceptable values that can be chosen in a form are known to the entry server and validated using an encrypted hash. This makes it impossible to tamper with HTML <select>, <input type=”hidden”>, <input type=”checkbox”>, <input type=”radio”> values. Obviously, “threatening” user input can still originate from regular <input type=”text”> or <textarea> values.

An example of an entry server implementing the above is Airlock by a Swiss company called Ergon Informatik AG.

Other measures

A lot of research has been made on the topic of countermeasures against SQL injection. Interesting publications to read are:

Conclusion

Even with recent alternative data storage and access models that have come to be known as NoSQL, SQL as a querying and database interfacing language is still hardly challenged. Major SQL vendors implement better and better features into the SQL standard. SQL is here to stay.

Yet, many development teams are unaware of the magnitude of this threat. In millions of lines of application code, it can be sufficient if one SQL vulnerability is detected by a malicious entity operating with automated tools, such as sqlmap. Such an entity may be able to extract all data from your database, and / or execute malicious code on your servers. This can go up to seizing the server.

It is a software architect’s or technical lead’s responsibility to minimise the risk of creating SQL injection vulnerabilities, as even skilled developers may accidentally create such a vulnerability. The simplest and most effective measure is to use bind variables and static SQL wherever possible. More elaborate countermeasures can be achieved by lots of training, screening, testing, and using entry servers.