Using SQL to Calculate the Popularity (on Stack Overflow) of Derby, H2, and HSQLDB

Few people know about this very very awesome feature of the Stack Exchange platform. The Stack Exchange Data Explorer

se-data-explorer

To be found here:
http://data.stackexchange.com

As you may know, much of the Stack Exchange platform runs on SQL Server (interesting architecture details here: http://stackexchange.com/performance), and the team has had the courtesy of making a lot of data publicly available through a SQL web API. Here’s the schmema that you can query:

se-data-explorer-schema

Using a running total to calculate cumulative daily questions per tag

The amount of analytics possibilities with such a public schema are infinite. Today, we’ll look into a question that has been interesting for a lot of users in the past: What is the most “popular” Java in-memory database among Derby (also known as Java DB, which ships with the JDK), the popular test database H2 (see also our interview with Thomas Müller, its creator), or HSQLDB.

What we’d like to do is sum up the number of questions per database, up to any given date. This should give us one of those nice exponential curves that managers like so much.

Here’s the SQL query that we’ll run:

SELECT 
  d,
  SUM(h2)     OVER (ORDER BY d) AS h2,
  SUM(hsqldb) OVER (ORDER BY d) AS hsqldb,
  SUM(derby)  OVER (ORDER BY d) AS derby
FROM (
  SELECT 
    CAST(CreationDate AS DATE) AS d, 
    COUNT(CASE WHEN Tags LIKE '%<h2>%'     THEN 1 END) AS h2,
    COUNT(CASE WHEN Tags LIKE '%<hsqldb>%' THEN 1 END) AS hsqldb,
    COUNT(CASE WHEN Tags LIKE '%<derby>%'  THEN 1 END) AS derby
  FROM Posts
  GROUP BY CAST(CreationDate AS DATE)
) AS DailyPosts
ORDER BY d ASC

A short explanation:

The nested select "DailyPosts" creates a PIVOT table with the aggregated number of questions per database and date. We could have used the SQL Server PIVOT clause, if the Stack Exchange platform had stored tagging information in a normalised form, but the equivalent COUNT(CASE) expressions work just as nicely (see also our article about PostgreSQL’s aggregation FILTER clause for more inspiration).

Now, that we have the number of posts per tag and day, all that’s left to do is sum up those numbers from the first day to any given day. That is often also called a “running total”, which can be calculated very easily using the SUM() OVER() window function.

Now we’re done. You can run and play around with this query here:
http://data.stackexchange.com/stackoverflow/query/469392/java-in-memory-database-popularity-by-time

The raw result is not very interesting. It’s a lot of numbers and dates. But if we plot that result in a graph / chart, we’re getting this nice-looking curve here:

se-data-explorer-derby-h2-hsqldb

As we can see, all three databases are roughly equivalent in terms of “popularity”, although H2 seems to be catching up momentum while HSQLDB is on a slight decline.

(Obviously, this “popularity” is not representative of true market share. More questions might just mean that people struggle more with the technology, or – less skilled people are using it).

Have fun further exploring the Stack Exchange Data Explorer:
http://data.stackexchange.com

(all Stack Exchange Subscriber Content is licensed under the terms of the CC BY-SA 3.0 license. For more details, see: http://stackexchange.com/legal)

Further articles

Further articles that are interesting in the context of the displayed query:

jOOQ Tuesdays: Thomas Müller Unveils How HSQLDB Evolved into the Popular H2 Database

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

tom_grayscale

We have the pleasure of talking to Thomas Müller in this fifth edition who will be telling us about the exciting history of Java’s most popular embedded database H2

Hi Thomas – Your H2 database is virtually everywhere. It has become many Java developers’ favourite integration test database. How did it grow so popular?

I guess because it’s easy to use, relatively fast, and, up to some point, compatible with popular databases.

I understand that you have a common, personal history with HSQLDB, previously also known as Hypersonic SQL. How and why did H2 evolve from HSQLDB?

Back in 1998, I wanted to learn Java. For fun, I implemented the relatively new skip list algorithm, added a SQL interface, and published it as open source. I got feedback from people who thought it’s useful, so I continued and gave it the name Hypersonic SQL. In 2000 I got a job offer from a Silicon Valley startup, PointBase Inc. The plan was to continue with Hypersonic SQL, and keep it open source. But after I started, the company decided it’s better if I stop. This was a surprise to me. I told them I can’t prevent others from continuing. And so Fred Toussi took the code and started HSQLDB with it. At around 2005, PointBase ran out of money, I wanted to go back to HSQLDB. But I felt more radical changes were needed, and it would be better to start a new project instead, which was then H2.

Very interesting historic facts! … and when will you replace Derby / JavaDB in the JDK? :-)

Hopefully never! If H2 is integrated in the JDK, Oracle would put constraints to the future of H2. I don’t want that. I want to keep H2 independent.

You’ve been in the database industry for a while. What is your most interesting anecdote that you’d like to share?

When Oracle bought MySQL, I was very surprised I got mail from the European Union with a large questionary about the merger, how it will affect the industry and competition. I don’t know how they found my work address, it is not on the web site, and they never asked me by email. And Switzerland is not even part of the European Union. H2 is a very small fish in the “database pond”, but it seems H2 does matter.

It’s a small world or small pond, I guess!

You’re one of the few developers I know who is working both on a SQL database (H2) and on a NoSQL database (JackRabbit, Adobe CRX). Tell us a little bit about how those databases compare.

They are quite different. H2 is a relational database with the traditional SQL and JDBC API, and Jackrabbit is a mix between a file system and a database, with a very different API, and a hierarchical data model. The query language is different as well: for Jackrabbit, XPath is more commonly used, even thought SQL is available as well. Both the relational and the hierarchical models have advantages and disadvantages. The hierarchical model more easily supports semi-structured, JSON style data. The relational model, on the other hand, is more “mature”, and there is more competition.

At Adobe / JackRabbit, you’re heavily involved with implementing storage algorithms in Java. Is the JVM even good at implementing low-level storage stuff?

Yes! The real advantage of Java is that a programmer can concentrate at the algorithms, and doesn’t have to spend so much time with memory management and low level stuff. That way, there is more time to improve the algorithms. And in relational databases, the most important aspect is using the best possible algorithms, for example to reduce I/O. Even for very low level CPU intensive stuff like data compression and encryption, things like concurrency and cache efficiency nowadays are more important than whether to use Java or C.

That’s an interesting thought along the lines of avoiding premature optimisation!

One last question: What problems are you working on right now?

Optimizing the database for solid state disks (SSDs) and new file system. Almost all relational databases still use algorithms optimized for rotating disks, where overwriting small blocks (for example 4 KB) was the way to go. With SSDs and Btrfs this doesn’t work well. H2 version 1.4.x (beta) already uses a new storage subsystem (MVStore) that should work well, however there is still some work needed before it is ready for production.

Other than database stuff, I’m interested in various programming topics. If I implement something that might be useful, I publish it as open source within my H2 database project, in the “tools” directory, until it is used in the database or moved to another project. I wrote a archiving utility (like zip, gzip) called “ArchiveTool” that combines de-duplication with regular compression. It is fast but compresses large directories (source code, databases) very well. As part of my work on the new storage subsystem, I came across minimal perfect hash tables. I invented a new algorithm that needs less space than all known ones (“MinimalPerfectHash”). I would like to publish a paper about that. There are plenty of interesting problems to solve.

Counting Distinct Records in SQL

The SQL language and its depths… Some of you readers might be aware of MySQL’s capability of counting distinct records through the COUNT() aggregate function. The MySQL documentation reads:

COUNT(DISTINCT expr,[expr...])

Returns a count of the number of rows with different non-NULL expr values.

In other words, you can count distinct first and last names very easily:

SELECT COUNT(DISTINCT FIRST_NAME, LAST_NAME)
FROM CUSTOMERS

That’s quite useful, but MySQL-specific (although HSQLDB also supports this particular MySQL syntax). Most other databases, however, do not offer such a feature, even if the SQL-99 standard has specified it long ago:

6.16  <set function specification>

<set function specification> ::=
    COUNT <left paren> <asterisk> <right paren>
  | <general set function>

<general set function> ::=
    <set function type>
        <left paren> [ <set quantifier> ] 
            <value expression> <right paren>

<set function type> ::=
    <computational operation>

<computational operation> ::=
    AVG | MAX | MIN | SUM
  | EVERY | ANY | SOME
  | COUNT

<set quantifier> ::= DISTINCT | ALL

The above was later on merged into 10.9 <aggregate function>, in SQL:2011, with lots of other aggregate function types and features. Now, let’s have a look at 6.23 <value expression>:

6.23 <value expression>

<value expression> ::=
<numeric value expression>
| <string value expression>
| <datetime value expression>
| <interval value expression>
| <boolean value expression>
| <user-defined type value expression>
| <row value expression>                <-- RVE!
| <reference value expression>
| <collection value expression>

Interestingly, you can put a row value expression in some of your aggregate functions. There are additional restrictions, as you cannot SUM() or AVG() records, for instance. But with COUNT() and COUNT(DISTINCT ...) this makes perfect sense. So, according to the SQL standard (and according to HSQLDB’s alternative, and PostgreSQL’s main implementation, which unfortunately isn’t really documented), the following would be the correct way to count distinct first and last names in SQL:

SELECT COUNT(DISTINCT (FIRST_NAME, LAST_NAME))
FROM CUSTOMERS

… which makes perfect sense. Watch out when reading this. This isn’t a DISTINCT() function! This is the DISTINCT keyword applied to a row value expression. See some previous blog posts about row value expressions here:

jOOQ standardises various SQL dialects and hides the above behind the DSL.countDistinct() method.

When you’re not using HSQLDB, MySQL, or PostgreSQL

In other databases, you might resort to some manual SQL transformation. Our example is easy to transform:

SELECT COUNT(*)
FROM (
  SELECT DISTINCT FIRST_NAME, LAST_NAME
  FROM CUSTOMERS
) t

But such query transformation can prove to be tricky, once you have GROUP BY clauses or other aggregate functions involved. Future versions of jOOQ might do that sort of transformation for you.

How to use H2 with jOOQ

Living through the latest developments of the H2 and HSQLDB databases has been very exciting for me as jOOQ developer. Those two databases have a common heritage and share a lot of functionality. Both are progressing at a high pace with a very active community around their respective lead developers. I have recently posted an article about how those databases treat variable binding in a similar manner, one with respect to the other. Both infer a lot of types at compile time, but only few at bind / execution time:

https://lukaseder.wordpress.com/2011/08/31/rdbms-bind-variable-casting-madness

While these details may seem like a minor lack of functionality compared to the big ones (mainly DB2, Oracle, Postgres, SQL Server, Sybase) the two databases really make it up in terms of speed and flexibility. Both H2 and HSQLDB mimick a big set of functions, syntax clauses and other specialties from the “big” databases, which means they can easily be used as test database on integration tests systems or in development environments. This is mainly true for mimicking

  • MySQL
  • Ingres
  • Oracle
  • Postgres

A bit less though, for

  • DB2 (Type system is probably too strong to mimick)
  • SQL Server (T-SQL is a bit different from SQL-92)
  • Sybase SQL Anywhere (T-SQL again…)

When running integration tests for jOOQ, I really like the fact that H2 and HSQLDB are embeddable and high-performing Java databases as well. In the near future, I want to roll out a fully-fledged ready-to-run integration combining any of these tools:

  • Play ! Framework, Wicket, Vaadin as the GUI layer
  • jOOQ as the intermediary
  • H2 / HSQLDB as the data layer

Until then, I’m proud to see the H2 tutorial section about how to use jOOQ with H2:

http://www.h2database.com/html/tutorial.html#using_jooq