Few people know about this very very awesome feature of the Stack Exchange platform. The Stack Exchange Data Explorer
To be found here:
As you may know, much of the Stack Exchange platform runs on SQL Server (interesting architecture details here: http://stackexchange.com/performance), and the team has had the courtesy of making a lot of data publicly available through a SQL web API. Here’s the schmema that you can query:
Using a running total to calculate cumulative daily questions per tag
SUM(h2) OVER (ORDER BY d) AS h2,
SUM(hsqldb) OVER (ORDER BY d) AS hsqldb,
SUM(derby) OVER (ORDER BY d) AS derby
CAST(CreationDate AS DATE) AS d,
COUNT(CASE WHEN Tags LIKE '%<h2>%' THEN 1 END) AS h2,
COUNT(CASE WHEN Tags LIKE '%<hsqldb>%' THEN 1 END) AS hsqldb,
COUNT(CASE WHEN Tags LIKE '%<derby>%' THEN 1 END) AS derby
GROUP BY CAST(CreationDate AS DATE)
) AS DailyPosts
ORDER BY d ASC
Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.
Hi Thomas – Your H2 database is virtually everywhere. It has become many Java developers’ favourite integration test database. How did it grow so popular?
I guess because it’s easy to use, relatively fast, and, up to some point, compatible with popular databases.
I understand that you have a common, personal history with HSQLDB, previously also known as Hypersonic SQL. How and why did H2 evolve from HSQLDB?
Back in 1998, I wanted to learn Java. For fun, I implemented the relatively new skip list algorithm, added a SQL interface, and published it as open source. I got feedback from people who thought it’s useful, so I continued and gave it the name Hypersonic SQL. In 2000 I got a job offer from a Silicon Valley startup, PointBase Inc. The plan was to continue with Hypersonic SQL, and keep it open source. But after I started, the company decided it’s better if I stop. This was a surprise to me. I told them I can’t prevent others from continuing. And so Fred Toussi took the code and started HSQLDB with it. At around 2005, PointBase ran out of money, I wanted to go back to HSQLDB. But I felt more radical changes were needed, and it would be better to start a new project instead, which was then H2.
Very interesting historic facts! … and when will you replace Derby / JavaDB in the JDK? :-)
Hopefully never! If H2 is integrated in the JDK, Oracle would put constraints to the future of H2. I don’t want that. I want to keep H2 independent.
You’ve been in the database industry for a while. What is your most interesting anecdote that you’d like to share?
When Oracle bought MySQL, I was very surprised I got mail from the European Union with a large questionary about the merger, how it will affect the industry and competition. I don’t know how they found my work address, it is not on the web site, and they never asked me by email. And Switzerland is not even part of the European Union. H2 is a very small fish in the “database pond”, but it seems H2 does matter.
It’s a small world or small pond, I guess!
You’re one of the few developers I know who is working both on a SQL database (H2) and on a NoSQL database (JackRabbit, Adobe CRX). Tell us a little bit about how those databases compare.
They are quite different. H2 is a relational database with the traditional SQL and JDBC API, and Jackrabbit is a mix between a file system and a database, with a very different API, and a hierarchical data model. The query language is different as well: for Jackrabbit, XPath is more commonly used, even thought SQL is available as well. Both the relational and the hierarchical models have advantages and disadvantages. The hierarchical model more easily supports semi-structured, JSON style data. The relational model, on the other hand, is more “mature”, and there is more competition.
At Adobe / JackRabbit, you’re heavily involved with implementing storage algorithms in Java. Is the JVM even good at implementing low-level storage stuff?
Yes! The real advantage of Java is that a programmer can concentrate at the algorithms, and doesn’t have to spend so much time with memory management and low level stuff. That way, there is more time to improve the algorithms. And in relational databases, the most important aspect is using the best possible algorithms, for example to reduce I/O. Even for very low level CPU intensive stuff like data compression and encryption, things like concurrency and cache efficiency nowadays are more important than whether to use Java or C.
That’s an interesting thought along the lines of avoiding premature optimisation!
One last question: What problems are you working on right now?
Optimizing the database for solid state disks (SSDs) and new file system. Almost all relational databases still use algorithms optimized for rotating disks, where overwriting small blocks (for example 4 KB) was the way to go. With SSDs and Btrfs this doesn’t work well. H2 version 1.4.x (beta) already uses a new storage subsystem (MVStore) that should work well, however there is still some work needed before it is ready for production.
Other than database stuff, I’m interested in various programming topics. If I implement something that might be useful, I publish it as open source within my H2 database project, in the “tools” directory, until it is used in the database or moved to another project. I wrote a archiving utility (like zip, gzip) called “ArchiveTool” that combines de-duplication with regular compression. It is fast but compresses large directories (source code, databases) very well. As part of my work on the new storage subsystem, I came across minimal perfect hash tables. I invented a new algorithm that needs less space than all known ones (“MinimalPerfectHash”). I would like to publish a paper about that. There are plenty of interesting problems to solve.
6.16 <set function specification>
<set function specification> ::=
COUNT <left paren> <asterisk> <right paren>
| <general set function>
<general set function> ::=
<set function type>
<left paren> [ <set quantifier> ]
<value expression> <right paren>
<set function type> ::=
<computational operation> ::=
AVG | MAX | MIN | SUM
| EVERY | ANY | SOME
<set quantifier> ::= DISTINCT | ALL
The above was later on merged into 10.9 <aggregate function>, in SQL:2011, with lots of other aggregate function types and features. Now, let’s have a look at 6.23 <value expression>:
6.23 <value expression>
<value expression> ::=
<numeric value expression>
| <string value expression>
| <datetime value expression>
| <interval value expression>
| <boolean value expression>
| <user-defined type value expression>
| <row value expression> <-- RVE!
| <reference value expression>
| <collection value expression>
Interestingly, you can put a row value expression in some of your aggregate functions. There are additional restrictions, as you cannot SUM() or AVG() records, for instance. But with COUNT() and COUNT(DISTINCT ...) this makes perfect sense. So, according to the SQL standard (and according to HSQLDB’s alternative, and PostgreSQL’s main implementation, which unfortunately isn’t really documented), the following would be the correct way to count distinct first and last names in SQL:
SELECT COUNT(DISTINCT (FIRST_NAME, LAST_NAME))
… which makes perfect sense. Watch out when reading this. This isn’t a DISTINCT() function! This is the DISTINCT keyword applied to a row value expression. See some previous blog posts about row value expressions here:
Living through the latest developments of the H2 and HSQLDB databases has been very exciting for me as jOOQ developer. Those two databases have a common heritage and share a lot of functionality. Both are progressing at a high pace with a very active community around their respective lead developers. I have recently posted an article about how those databases treat variable binding in a similar manner, one with respect to the other. Both infer a lot of types at compile time, but only few at bind / execution time:
While these details may seem like a minor lack of functionality compared to the big ones (mainly DB2, Oracle, Postgres, SQL Server, Sybase) the two databases really make it up in terms of speed and flexibility. Both H2 and HSQLDB mimick a big set of functions, syntax clauses and other specialties from the “big” databases, which means they can easily be used as test database on integration tests systems or in development environments. This is mainly true for mimicking
A bit less though, for
DB2 (Type system is probably too strong to mimick)
SQL Server (T-SQL is a bit different from SQL-92)
Sybase SQL Anywhere (T-SQL again…)
When running integration tests for jOOQ, I really like the fact that H2 and HSQLDB are embeddable and high-performing Java databases as well. In the near future, I want to roll out a fully-fledged ready-to-run integration combining any of these tools:
Play ! Framework, Wicket, Vaadin as the GUI layer
jOOQ as the intermediary
H2 / HSQLDB as the data layer
Until then, I’m proud to see the H2 tutorial section about how to use jOOQ with H2: