jOOQ Tuesdays: Richard North Makes Database Testing More Reproducible with Testcontainers

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

I’m very excited to feature today Richard North, creator of Testcontainers, a promising new database and UI testing tool for reproducible tests.

Hi Richard – you work at Skyscanner, which means you have tons of travel data to work with. What’s the most exciting thing about working with this data?

Skyscanner is a really data-led company, using data at all levels in decision-making. The volume of data we gather and process really helps us understand what travellers want and serve them better. For example, our destination recommender system helps with people discover new and interesting places to go, based upon a vast amount of data, algorithms and experiments. But it’s interesting how this varies from recommendations in internet media companies – there are far fewer possible destinations than books, songs and movies, yet users’ reasons for travelling and tastes can be more nuanced and varied.

There’s yet more data-oriented work under the surface, too – for example, the infrastructure needed to gather and analyse such large amounts of data, and the running of experiments to help us improve. There are a lot of smart people working hard to make this all happen, and it’s an exciting place to be!

You’ve created an increasingly popular testing framework, Testcontainers. What made you do it? What itch does it scratch?

Well, I think it’s something that scratches several itches at the same time – things that previously we only had isolated solutions to. The common element, though, is reproducibility of test environments. Of all my time developing and writing tests for JVM-based systems, it’s always been the non-JVM dependencies that caused the most complexity, unreliability and maintenance overhead.

I remember my first day as a developer, years ago: I was given a desktop machine and 2 days’ worth of step-by-step instructions that I needed to follow – just so that I’d be able to develop and run tests with all dependencies in place. A few months later I had to repeat the same task many times over when building new CI servers.

A lot has changed since then in terms of how we deploy and manage our production infrastructure, and thankfully Docker has done a lot to further bring prod-parity to developers’ machines.

Testcontainers started out as my effort to bring the full power of Docker to integrated testing on the JVM in two areas that I’ve experienced the most pain: testing against a clean, representative database, and making browser-based selenium testing more reproducible, both for developers and on CI.

Mostly being curious about testing databases, your documentation mentions Testcontainers as an alternative to using H2 as a test database. What are the disadvantages of emulating a database e.g. with H2? Did you make any personal experience with that?

Yes, definitely – it was one of the tipping point factors that triggered me to create Testcontainers. I do think H2 is a fantastic piece of work in what it manages to deliver, and it’s something I’ve used on a number of projects to good effect.

However, compatibility with real databases has often been a sticking point. Back in 2015, before I started Testcontainers, we were struggling with a few MySQL features that didn’t have equivalents in H2. We were facing the unpleasant prospect of having to constrain our implementation to what H2 would allow us to test against. It became fairly obvious that there was a gap in the market for an H2-like tool that was actually a facade to a Docker-based database container – and Testcontainers was born.

How do you think of mocking the database at any layer, including the DAO layer, service layer, etc.?

I’m all in favour of keeping tests small, light and layered, and using mocks to accomplish this. This might sound strange coming from somebody who has developed an integrated testing tool, but it’s true!

Still, I feel that we need to be pragmatic about how we approach automated tests and how we make sure we’re testing the right thing – especially when crossing boundaries. Are we testing how this code behaves against reality, or are we testing against our own (potentially false) understanding of how external components work?

My feeling is that it’s quite straightforward to mock layers of your system that you yourself wrote, or where you can easily jump into the source code, a spec or documentation. With an external component, you can still produce a mock that behaves how you expect, or how you witness the real thing behaving. But does that mock continue to represent the real thing, especially after accretion of other features, or the additional perils of state that a database entails – schema changes and actual data?

My ideal is to mock the data access layer for consumption by higher layers, but to be quite careful about what the data access layer itself talks to in my tests. It should probably be a real database. Hopefully Testcontainers is one tool that helps make this particular thing a little less painful – so that when you find yourself needing to do this, there’s a way to do it easily.

What’s the biggest challenge you’ve faced when testing databases, or other things?

It’s not databases, but I’d say that by far the hardest testing challenge I’ve faced as a developer is mobile apps, especially iOS. I’ve always enjoyed mobile development  as a whole, but when switching from a Java server-side/web project to mobile, it really feels like you’re going back in time. Some of the challenges are harder – such as asynchronicity and platform APIs that make it harder to structure software in a testable way. But it also feels like the tooling is much further behind, and until quite recently received far less attention. I feel the net result has been that developers have been discouraged from investing in automated tests, which is sad given that we know how valuable they can be.

Things are getting better, but I do greatly prefer the testing aspects of working on server-side JVM projects. For all its difficulties, we are actually quite lucky!

jOOQ Tuesdays: Chris Saxon Explains the 3 Things Every Developer Should Know About SQL

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month (today, exceptionally on a Wednesday because of technical issues) where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

chris-saxon-headshot[1]

I’m very excited to feature today Chris Saxon who has worked with Oracle forever, and who is one of the brains behind the famous Ask Tom website.

Chris, you’re part of the famous Ask Tom team. Everyone working with Oracle has wound up on Ask Tom’s website at least once. You’ve answered an incredible amount of questions already. What’s it like to work for such a big community as Oracle’s?

It’s an amazing experience! My first real job was as a PL/SQL developer. My only knowledge of SQL was a couple of vaguely remembered lectures at university. Ask Tom was the main place I learned about SQL and Oracle Database. So it’s a huge honor to be on the other side, helping others get the best out of the technology.

The best part has to be the positive comments when you help someone solve a problem that’s been troubling them for days. That’s why we’re here. To help developers learn more about Oracle and improve their SQL skills. When you use the database to its full extent, you can write better, faster applications with less code!

What were your three most interesting questions, so far?

Any question that has a clear definition and a complete test case is interesting! 😉 Personally I enjoy using SQL to solve complex problems the best. So the first two do just that:

1. Finding the previous row in a different group

The poster had a series of transactions. These were grouped into two types. For each row, they wanted to show the id of the previous transaction from the other group.

At first this sounds like a problem you can solve using LAG or LEAD. But these only search for values within the same group. So you need a different method.

I provided a solution using the model clause. Using this, you can generate columns based on complex, spreadsheet-like formulas. Rows in your table are effectively cells in the sheet. You identify them by defining dimensions which can be other columns or expressions. By setting the transaction type as a dimension, you can then easily reference – and assign – values from one type to the other.

This worked well. But commenters were quick to provide solutions using window functions and 12c’s match_recognize clause. Both of which were faster than my approach!

I like this because it shows the flexibility of SQL. And it shows the value of an engaged community. No one knows everything. By sharing our knowledge and workin together we can all become better developers.

2. Improving SQL that deliberately generates cartesian product

The poster had a set of abbreviations for words. For example, Saint could also be “St.” or “St”. They wanted to take text containing these words. Then generate all combinations of strings using these abbreviations.

The “obvious” solution the user had is to split the text into words. Then for each word, join the abbreviation table, replacing the string as needed. So for a five word string, you have five joins.

There are a couple of problems with this method. The number of joins limits the number of words. So if you have a string with seven words, but only six table joins you won’t abbreviate the final word.

The other issue is performance. Joining the same table N times increases the work you do. If you have long sentences and/or a large number of abbreviations, the query could take a long time to run.

To overcome these you need to ask: “how can I join to the abbreviation table just once?”

The solution to do this starts the same as the original. Split the sentence into a table of words. Then join this to the abbreviations to give a row for each replacement needed.

You can then recursively walk down these rows using CTEs. These build up the sentence again, replacing words with their abbreviations as needed. A scalable solution that only needs a single pass of each table!

The final question relates to performance. Tom Kyte’s mantra was always “if you can do it in SQL, do it in SQL”. The reason is because a pure SQL solution is normally faster than one which combines SQL and other code. Yet a question came in that cast doubt on this:

3. Difference in performance SQL vs PL/SQL

The poster was updating a table. The new values came from another table. He was surprised that PL/SQL using bulk processing came out faster than the pure SQL approach.

The query in question was in the form:

update table1
set col1 = (select col2 from table2 where t1.code = t2.code);

It turned out the reason was due to “missing” indexes. Oracle executes the subquery once for every row in table1. Unless there’s an index on table2 (code), this will full scan table2 once for every row in table1!

The PL/SQL only approach avoided this problem by reading the whole of table2 into an array. So there was only one full scan of table2.
 

The problem here is there was no index on the join condition (t1.code = t2.code). With this in place Oracle does an index lookup of table2 for each row in table1. A massive performance improvement!
 

The moral being if your SQL is “slow”, particularly in compared to a combined SQL + other language method, it’s likely you have a missing index (or two).

This question again showed the strength and value of the Oracle community. Shortly after I posted the explanation, a reviewer was quick to point out the following SQL solution:

merge into table1
using  table2
on   (t1.code = t2.code)
when matched
  then update set t1.col = t2.col;

This came out significantly faster than both the original update and PL/SQL – without needing any extra indexes!

You’re running a YouTube channel called “The Magic of SQL”. Are SQL developers magicians?

Of course they are! In fact, I’d say that all developers are magicians. As Arthur C Clarke said:

“Any sufficiently advanced technology is indistinguishable from magic”

The amount of computing power you carry around in your phone today is mind blowing. Just ask your grandparents!

I think SQL developers have a special kind of magic though :). The ability to answer hard questions with a few lines of SQL is amazing. And for it to adapt to changes in the underlying data to give great performance without you changing it is astounding.

Your Twitter account has a pinned tweet about window functions. I frequently talk to Java developers at conferences, and few of them know about window functions, even if they’ve been in databases like Oracle for a very long time. Why do you think they’re still so “obscure”?

Oracle Database has had window functions has had them since the nineties. But many other RDBMSes have only fully supported them recently. So a combination of writing “database independent” code and people using other databases is certainly a factor.

Use of tools which hide SQL from developers is also a problem. If you’re not actively using SQL, it’s easy to overlook many of its features.

Fortunately I think this is changing. As more and more developers are realizing, SQL is a powerful language. Learning how to use it effectively is a key skill for all developers. Window functions and other SQL features mean you can get write better performing applications with less code. Who doesn’t want that? 😉

What are three things that every developer should know about SQL?

1. Understand set based processing

If you find yourself writing a cursor loop (select … from … loop), and inside that loop you run more SQL, you’re doing it wrong.

Think about it. Do you eat your cornflakes by placing one flake in your bowl, adding the milk, and eating that one piece? Then doing the same for the next. And the next. And so on? Or do you pour yourself a big bowl and eat all the flakes at once?

If you have a cursor loop with more SQL within the loop, you’re effectively doing this. There’s a lot of overhead in executing each SQL statement. This will slow you down if you have a large number of statements that each process one row. Instead you want few statements that process lots of rows where possible.

It’s also possible to do this by accident. As developers we’re taught that code reuse is A Good Thing. So if there’s an API available we’ll often use it. For example, say you’re building a batch process. This finds the unshipped orders, places them on shipments and marks them as sent.

If a ship_order function exists, you could write something like:

select order_id from unshipped_orders loop
  ship_order ( order_id );
end loop;

The problem here is ship_order almost certainly contains SQL. SQL you’ll be executing once for every order awaiting postage. If it’s only a few this may be fine. But if there’s hundreds or thousands this process could take a long time to run.

The way to make this faster is to process all the orders in one go. You can do this with SQL like:

insert into shipments
  select … from unshipped_orders;

update unshipped_orders
set  shipment_date = sysdate;

You may counter there’s other, non-SQL, processing you need to do such as sending emails. So you still need a query to find the order ids.

But you can overcome this! With update’s returning clause, you can get values from all the changed rows:

update unshipped_orders
set  shipment_date = sysdate
returning order_id bulk collect into order_array;

This gives you all the order ids to use as you need.

2. Learn what an execution plan is and how to create and read one

“How can I make my SQL faster” is one of the most common classes of questions posted on Ask Tom. The trouble is there’s scant one-size-fits-all advice when it comes to SQL performance. To help we need to know what your query is, what the tables and indexes are and details about the data. But most importantly we need to know what the query is actually doing!

For example, say you want me to help you figure out a faster route to work. To do this I need to know which route you currently use and how long each part of it takes. We can then compare this against other routes, checking how far they are, expected traffic and predicted speeds. But we need the data to do this!

So when answering performance questions, the first thing we look for is an execution plan. Many confuse this with an explain plan. An explain plan is just a prediction. Often it’s wrong. And even when it’s right, we still don’t know how much work each step does.

An execution plan shows exactly what the database did. It also gives stats about how much work, how often and how long it took to process each step. Combine this with a basic understanding of indexes and join methods and you can often solve your own performance problems.

3. Use bind variables

Sadly data breaches are all too common. There hardly seems to be a week that goes by without news of a major company leaking sensitive data. And the root cause of these attacks is often SQL injection.

This is a simple, well known attack vector. If you write vulnerable SQL on a web enabled application, eventually you’ll be attacked.

And this isn’t something you can avoid by using NoSQL databases. SQL injection like attacks are possible there too!

Fortunately the solution is easy: use bind variables. Not only do these secure your application, they can improve performance too.

Make sure your company is not tomorrow’s data leak headline. Use bind variables!

Last but not least: When will Oracle have a BOOLEAN type? 🙂

We have a BOOLEAN type! It’s just only in PL/SQL ;P

There’s currently a push in the community to for us to add a SQL BOOLEAN type. If this is a feature you’d like to see, you can vote for it on the Database Ideas forum. The more support there is, the more likely we are to implement it! But no promises 😉

jOOQ Tuesdays: Glenn Paulley Gives Insight into SQL’s History

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

glenn-paulley

I’m very excited to feature today Glenn Paulley who has been working with and on SQL for several decades.

Glenn, you have been a part of the database ecosystem since the very early days, having been Director of Engineering at Sybase and representing SAP with the SQL Standard committee. What is it that fascinates you so much about databases?

Data management technology has been my favourite subject within Computer Science since I was an undergraduate student at the University of Manitoba, and so it has remained throughout much of my career. I was privileged to be part of the team that implemented the first online IBM DB2 application – using IMS/TM as the transaction monitor – in Canada at Great-West Life Assurance in the late 1980’s, and we hosted some of the DB2 development team from IBM’s San Jose lab – including Don Haderle – to celebrate that achievement. So yes, I guess you could say that I’ve been around for a while.

Much of my personal expertise lies in the realm of query processing and optimization, having had Paul Larson and Frank Wm. Tompa as my Ph.D. supervisors at the University of Waterloo. But I am interested in many related database subjects: query languages certainly, multidatabase systems, and information retrieval. Two closely-related topics within database systems are of particular interest to me.  One is scale. Companies and organizations have a lot of data; billion-row tables in a relational database are fairly routine today in many companies. Advances in other technologies, such as the Internet of Things, are going to dramatically increase the data management requirements for firms wanting to take advantage of these technologies. Implementing solutions to achieve the scale required is difficult, and is of great interest to me. The related issue, quite naturally, is processing queries over such vast collections and getting the execution time down to something reasonable. So query optimization remains a favourite topic.

We’ve had a very interesting E-Mail conversation about SQL’s experiments related to Object Orientation in the late 1990’s. Today, we have ORDBMS like Oracle, Informix, PostgreSQL. Why didn’t ORDBMS succeed?

ORDBMS implementations were not as successful as their developers expected, to be sure, though I would note that many SQL implementations now contain “object” features even though objects are specifically omitted from the ISO SQL standard. Oracle 12c is a good example – Oracle’s PL/SQL object implementation supports a good selection of object-oriented programming features, such as polymorphism and single inheritance, that when coupled with collection types provide a very rich data model that can handle very complex applications. There are others, too, of course: InterSystems’ Caché product, for example, is still available.

So, while object support in relational systems is present, in many instances, to me the significant issues are (1) the performance of object constructions on larger database instances and (2) the question of where do you want objects to exist in the application stack: in the server proper, or in another tier, implemented in a true object-oriented language? The latter issue is the premise behind object-relational mapping tools, though I think that their usage often causes as many problems as they solve.

How do you see the future of the SQL language – e.g. with respect to Big Data or alternative models like document stores (which have N1QL) or graph stores (which have Open Cypher)?

My personal view is that SQL will continue to evolve; having an independent query language that permits one to query or manipulate a database but avoid writing a “program” in the traditional sense is a good idea. Over time that language might evolve to something different from today’s SQL, but it is likely that we will still call it S-Q-L. I do not expect a revolutionary approach to be successful; there is simply far too much invested in current applications and infrastructure. Don’t get me wrong – I’m not saying SQL doesn’t need improvements. It does.

Unlike the work of e.g. the JCP or w3c, which are public and open, SQL still seems to be a more closed environment. How do you see the interaction between SQL and the end user? Can “ordinary” folks participate in the future of SQL?

The SQL standard is published by the International Standards Organization (ISO), whose member countries contribute to changes to the standard and vote on them on a regular basis. Member countries that contribute to the ISO SQL standard include (at least) the US, Canada, Germany, the United Kingdom, Japan, and Korea. Participation in the standards process requires individuals or companies to belong to these “national bodies” in order to view drafts of the standard and vote on proposed changes. Usually those meetings are held in-person – at least in Canada and the United States – so there is a real cost to participation in the process.

While having SQL as an international ISO standard is, I think, worthwhile, the ISO’s business model is based on activities such as collecting revenue from the sale of standards documents. Unless governments or other benefactors sponsor the development of those standards, then it is difficult to see how the standard can be made more freely available. That issue is a political one, rather than a technical one.

After a brief stop at Conestoga College, you’re heading back to SAP. What will you be working on in the next few years?

I will be continuing to focus on database technology now that I’ve returned to SAP. So far I’ve had a fantastic experience being back at the SAP Waterloo lab, and I have every expectation that will continue into the future.

You’ve spent a lot of time at SAP (and before SAP) with Sybase. How do the different Sybase distributions relate to SAP’s new flagship database HANA?

The various SAP database systems (SQL Anywhere, IQ, ASE) all contain database technology pertinent to the HANA platform.

Last question: What do SQL and Curling have in common? 🙂

As a Level 3 curling coach, one of my tasks as a High Performance Coach for the Ontario Curling Council is to collect performance data on the athletes in our various programs. Naturally I use a database system to store that data, and perform various analyses on it, so the two are not as unrelated as you might think!

 

jOOQ Tuesdays: Rafael Winterhalter is Wrestling Byte Code with Byte Buddy

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

Rafael Winterhalter

We have the pleasure of talking to Rafael Winterhalter in this seventh edition who will be telling us about the depths of Java byte code, and about his library Byte Buddy, which makes working with byte code extremely easy.

Note that Byte Buddy won the 2015 Duke’s Choice award – congratulations to this from our side!

Hi Rafael – You’re the author of the popular Byte Buddy library. What does Byte Buddy do?

Byte Buddy is a code generation and manipulation library. It offers APIs for creating new Java classes at runtime and for changing existing classes before or after they were loaded.

At first glance, this might sound like a very esoteric thing to do, but runtime code generation is used in a large number of Java projects. Code generation tools are often used by library developers to implement aspect-oriented programming. For example, the mocking library Mockito adopted Byte Buddy to create subclasses of mocked classes at runtime. In order to implement a mock, Mockito overrides all methods of a class such that the user’s original code is not invoked when a method is called in a test. And there are plenty of other well-known users of code generation. Spring, for example, uses code generation to implement its annotation aspects such as security or transactions. And Hibernate uses code-generation to lazily load properties from getter methods by overriding those getters to query the database only if they are invoked.

Why is there a need for Byte Buddy when there are alternatives like ASM, CGLIB, AspectJ or Javassist?

Before I started working on Byte Buddy, I was involved in several other open-source projects as a contributor. As mentioned before, code generation is a typical requirement for implementing many libraries and so I got used to working with mostly CGLIB and Javassist. However, I became constantly frustrated with those libraries’ limitations and I wanted to resolve the problems I had discovered. Eventually, I started to write an alternative library that I later published as Byte Buddy.

To understand the limitations of alternative libraries, mocks are a good example use case. Mocks in Mockito were previously created using CGLIB. CGLIB is a rather mature library. It has been around for over 15 years and when it was originally developed, the library’s developers did of course not anticipate features such as annotations, generic types or defender methods. Annotations did however become an important part of many APIs which would not accept a mock instance because any annotations of overridden methods were lost. In Java, annotations on methods are never inherited when they are overridden. And annotations on types are only inherited if they are explicitly declared to be. To overcome this, Byte Buddy allows to copy any annotation to a subclass what is now a feature in Mockito 2.

In contrast, Javassist allows to copy annotations, but I do not personally like the approach of the library. In Javassist, all generated code is represented as Java code contained in strings. As a result, Javassist code evolves similarly unstructured to Java code that only describes SQL as concatenated strings. Besides creating code that is difficult to maintain, this approach also offers vulnerabilities such as Java code injection similar to SQL injection. It is sometimes possible to attack Javassist code by letting it compile arbitrary code what can cause sever damage to an application.

AspectJ is a powerful tool when manipulating existing code. However, Byte Buddy lets you do anything that AspectJ is capable of but in plain and simple Java. This way, developers do not need to learn a new syntax or programming metaphor or install tools for their build-process and IDEs. Furthermore, I do not find the join-point and point-cut terminology intuitive and decided to avoid it altogether. Instead, I decided to mimic terminology that developers already know from the Java programming language to ease the first steps with Byte Buddy.

ASM on the other hand is the basis on top of which Byte Buddy is implemented. ASM is a byte code parser rather than a code generation library. ASM processes single class files  and does not consider type hierarchies. ASM does neither have a concept of class loading and does not include higher-level concepts on top of byte code instructions. Byte Buddy offers however an adapter that exposes the ASM API to users that require the generation of very specific code.

How does one become so involved with low-level Java?

In the beginning, I set myself a goal of only creating a version of CGLIB with annotation support which was what I originally needed. But I quickly found out that a lot of developers were looking for the solution that Byte Buddy has become today. Therefore, I started to plan to make the full feature set of the Java virtual machine accessible. For doing so, learning all the gory details and corner cases of the class file format has become a necessity to implement these features. To be fair, the class file format is fairly trivial once you get the hang of it and I really enjoy to see my library mature.

Between Java byte code (2GL language) and SQL (4GL language), there are many levels of programmatic abstraction. Where do you feel at home the most?

I would want to use the right tool for the right job. Obviously, I enjoy working with byte code, but I would avoid handcrafting byte code when working in a production project. In the end, this is what higher-level abstractions such as Byte Buddy are made for.

Looking at the common use cases, Byte Buddy is however often used for implementing custom features by changing code based on annotations on methods. In a way, Byte Buddy enables developers to implement their own 4G abstraction. Declarative programming is a great abstraction for certain tasks, SQL being one of them.

You’ve become a famous speaker and domain expert in a very short time. What’s your most exciting story, being an influencer?

Mainly, I find it exciting to meet users of my library. I have met folks that implemented internal frameworks with large teams that is based on my software and obviously, it makes me proud that Byte Buddy proves to be that useful.

Thank you very much Rafael

If you want to learn more about Rafael’s work, about byte code or about Byte Buddy, check out his talk at JavaZone:

jOOQ Tuesdays: Thomas Müller Unveils How HSQLDB Evolved into the Popular H2 Database

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

tom_grayscale

We have the pleasure of talking to Thomas Müller in this fifth edition who will be telling us about the exciting history of Java’s most popular embedded database H2

Hi Thomas – Your H2 database is virtually everywhere. It has become many Java developers’ favourite integration test database. How did it grow so popular?

I guess because it’s easy to use, relatively fast, and, up to some point, compatible with popular databases.

I understand that you have a common, personal history with HSQLDB, previously also known as Hypersonic SQL. How and why did H2 evolve from HSQLDB?

Back in 1998, I wanted to learn Java. For fun, I implemented the relatively new skip list algorithm, added a SQL interface, and published it as open source. I got feedback from people who thought it’s useful, so I continued and gave it the name Hypersonic SQL. In 2000 I got a job offer from a Silicon Valley startup, PointBase Inc. The plan was to continue with Hypersonic SQL, and keep it open source. But after I started, the company decided it’s better if I stop. This was a surprise to me. I told them I can’t prevent others from continuing. And so Fred Toussi took the code and started HSQLDB with it. At around 2005, PointBase ran out of money, I wanted to go back to HSQLDB. But I felt more radical changes were needed, and it would be better to start a new project instead, which was then H2.

Very interesting historic facts! … and when will you replace Derby / JavaDB in the JDK? 🙂

Hopefully never! If H2 is integrated in the JDK, Oracle would put constraints to the future of H2. I don’t want that. I want to keep H2 independent.

You’ve been in the database industry for a while. What is your most interesting anecdote that you’d like to share?

When Oracle bought MySQL, I was very surprised I got mail from the European Union with a large questionary about the merger, how it will affect the industry and competition. I don’t know how they found my work address, it is not on the web site, and they never asked me by email. And Switzerland is not even part of the European Union. H2 is a very small fish in the “database pond”, but it seems H2 does matter.

It’s a small world or small pond, I guess!

You’re one of the few developers I know who is working both on a SQL database (H2) and on a NoSQL database (JackRabbit, Adobe CRX). Tell us a little bit about how those databases compare.

They are quite different. H2 is a relational database with the traditional SQL and JDBC API, and Jackrabbit is a mix between a file system and a database, with a very different API, and a hierarchical data model. The query language is different as well: for Jackrabbit, XPath is more commonly used, even thought SQL is available as well. Both the relational and the hierarchical models have advantages and disadvantages. The hierarchical model more easily supports semi-structured, JSON style data. The relational model, on the other hand, is more “mature”, and there is more competition.

At Adobe / JackRabbit, you’re heavily involved with implementing storage algorithms in Java. Is the JVM even good at implementing low-level storage stuff?

Yes! The real advantage of Java is that a programmer can concentrate at the algorithms, and doesn’t have to spend so much time with memory management and low level stuff. That way, there is more time to improve the algorithms. And in relational databases, the most important aspect is using the best possible algorithms, for example to reduce I/O. Even for very low level CPU intensive stuff like data compression and encryption, things like concurrency and cache efficiency nowadays are more important than whether to use Java or C.

That’s an interesting thought along the lines of avoiding premature optimisation!

One last question: What problems are you working on right now?

Optimizing the database for solid state disks (SSDs) and new file system. Almost all relational databases still use algorithms optimized for rotating disks, where overwriting small blocks (for example 4 KB) was the way to go. With SSDs and Btrfs this doesn’t work well. H2 version 1.4.x (beta) already uses a new storage subsystem (MVStore) that should work well, however there is still some work needed before it is ready for production.

Other than database stuff, I’m interested in various programming topics. If I implement something that might be useful, I publish it as open source within my H2 database project, in the “tools” directory, until it is used in the database or moved to another project. I wrote a archiving utility (like zip, gzip) called “ArchiveTool” that combines de-duplication with regular compression. It is fast but compresses large directories (source code, databases) very well. As part of my work on the new storage subsystem, I came across minimal perfect hash tables. I invented a new algorithm that needs less space than all known ones (“MinimalPerfectHash”). I would like to publish a paper about that. There are plenty of interesting problems to solve.

jOOQ Tuesdays: Axel Fontaine Predicts Exciting Times as our Industry is Maturing

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

afontaine

We have the pleasure of talking to Axel Fontaine in this fourth edition who will be telling us about the exciting times that are ahead as our industry is maturing.

Hi Axel – Citing your website: From “Software is what I do” to “Architecting for Continuous Delivery & Zero Downtime”. Where did your personal passion for devops come from?

Over the last 18 years in our industry I have seen a lot of projects for numerous clients in many different sectors. And it always struck me. The intersection between development and operations was almost invariably the place where the largest efficiency gains could be realized. Over time this then became the place where I naturally started to focus my attention, initially for consultancy and training, and then later also for public speaking and the product business.

Very interesting! I suspect that the Flyway download rates also confirm the increasing need for devops tooling, right?

I am glad you mentioned it. Yes, Flyway download numbers have been going through the roof. In March we had one download every 42 seconds and we are well on track to break the 1 million mark in 2015! This picture actually tells us a number of things. First of all it confirms that the need for database migrations is a very widespread itch and that Flyway’s no-nonsense pragmatic and lightweight approach to scratch it resonates with the industry. The other thing this tells us, is that our industry as a whole is maturing and continuous delivery and deployment in particular is becoming more mainstream. This increased pace of delivery comes with a pressing need for a higher level of automation. And modern lightweight tools like Flyway are a great fit for this.

Now, you are converting that passion into a product called Boxfuse. Can you tell us a little more about it?

I have always been a strong believer in lightweight and open solutions that are a pleasure to use and minimize lock-in. When it comes to deploying an application to the cloud, so far you had to choose one or the other. You could either have something that is a joy to use (PaaS), but leaving you stuck in a walled garden. Or you could have something open (IaaS) that unfortunately leaves you with a lot of operational concerns to deal with. Boxfuse aims for a fresh new middle ground: the convenience of PaaS with the freedom of IaaS. To achieve that Boxfuse does away with a bunch of cargo cult. No more general purpose operation system, no more traditional provisioning. Instead Boxfuse analyses your application and in just a few seconds, fuses it into a tiny bootable image that is about 1% percent of the size of a regular system. Boxfuse then offers you a sophisticated blue/green deployment process for zero downtime updates. All of this, out of the box and with a single command.

We’re entrepreneurs ourselves. Marketing a product is a long road. What are your own key takeaways from your experience as both a consultant and entrepreneur?

I couldn’t agree more. What marketing really is is competing for attention, and turning that attention into mind share through a series of arguments, both technical and emotional. It is an exercise that is often overlooked in our technical circles, where the code is front and center. But the best technology in the world isn’t worth a dime if you can’t convince someone else to use it. And that takes perseverance, drive, and dedication. What seems easy on the outside of often the fruit of lots of hard labor behind the scenes. Eventually you will succeed, but give it a little time. Overnight success takes years.

Where do you see the future of our industry, from your perspective?

Believe me when I say these are exciting times. While it still feels like a big adventure with great new frontiers to explore, our industry is maturing rapidly. Overly complex tools are going the way of the dodo. User expectations are rising fast. And the only way to meet them is through modern  lightweight tools and services that integrate seamlessly with our workflow and perform the undifferentiated heavy lifting for us. We need to eliminate all unnecessary complexity and so we can focus on what matters most: delighting our customers.

jOOQ Tuesdays: Vlad Mihalcea Gives Deep Insight into SQL and Hibernate

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

vlad_mihalcea

We have the pleasure of talking to Vlad Mihalcea in this third edition who will be telling us about the skills developers need to acquire when working with Java, SQL, and Hibernate.

Hi Vlad – You’re blog explodes with excellent posts about Hibernate. It looks like you love digging deep into the most popular persistence API in the market, right?

I really mean when saying that “teaching is my way of learning” and to master a certain technology, you have to go beyond the reference documentation. Hibernate has been around for 10 years now and there’s a plethora of projects built on top of it. The Hibernate Master Class focuses on some proven ORM design patterns, like concurrency control, caching and batching.

You’ve recently told me about your realisation of the lack of SQL insight in our industry. How did that come to be?

The Object-Relational mismatch is only the tip of the iceberg, when it comes to accessing data. The biggest problem we face in enterprise systems, is the Enterprise-Database developer mismatch.

A developer knows about the programming languages, design patterns and application architecturing, but database skills are always attributed to the Database Administrator role. This is a very dangerous assumption.

It’s as if we developed on Linux without ever wanting to learn how the operating system works, relying solely on the System Administrator knowledge. If you develop enterprise applications, you have no escape but learning how a database works. Reading the excellent “SQL Performance Explained” book, made me realize how little I knew about the inner-workings of relational database systems. This book is meant for developers and it’s a must-read for every enterprise developer professional.

What can we do to improve the situation for our industry? Is there a chance for a tighter integration of JPA and SQL? Or specifically, of Hibernate and jOOQ?

First, it’s the mindset that needs to change. We need to acknowledge that there’s no such thing as a one-size-fits-all framework, and that applies to database access as well. When I write unit tests, I don’t limit myself to JUnit. I also use Mockito and Hamcrest, a testing stack being a better alternative.

JPA excels when writing data, because you can the INSERT/UPDATE statements are automatically updated, whenever the persistence model changes. The implicit and explicit locking allow us to protect against lost updates, especially in long conversation workflows.

But while abstracting the SQL write statements is a doable task, when it comes to reading data, nothing can beat native SQL. The most commonly-used RDBMS have implemented non-standard data access techniques (window functions, Common Table Expressions, PIVOT) and the SQL-92 JPA abstraction layer can only focus on common functionalities. That’s why native querying is unavoidable on almost any enterprise system.

jOOQ has done a very good job promoting SQL knowledge into the Java ecosystem. Java is ruling the enterprise software development and SQL skills have always been the Achilles heel of most enterprise development teams.

While you can fire native queries from JPA, there’s no support for dynamic native query building. jOOQ allows you to build type-safe dynamic native queries, strengthening your application against SQL-injection attacks. jOOQ can be integrated with JPA, as I already proven on my blog, and the JPA-jOOQ combo can provide a solid data access stack.

Tell us a little bit about your Hibernate Master Class, and your personal blogging strategy.

The Hibernate Master Class blog series is actually a book in the making. Because I work a full-time job, it’s difficult to commit to a fixed writing schedule, so I can only write as much as my spare times allows me.

Once all topics are covered, I’ll turn all this info into a book, that I’m going to self-publish, following the “SQL Performance Explained” example.

[ Edit ] The book has been finished and is available here:

https://leanpub.com/high-performance-java-persistence

Where will you be in 5 years?

I enjoy both software architecture, as well as writing about it. I will continue on this journey and see where the wind will carry me.