Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month (today, exceptionally on a Wednesday because of technical issues) where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.
I’m very excited to feature today Chris Saxon who has worked with Oracle forever, and who is one of the brains behind the famous Ask Tom website.
Chris, you’re part of the famous Ask Tom team. Everyone working with Oracle has wound up on Ask Tom’s website at least once. You’ve answered an incredible amount of questions already. What’s it like to work for such a big community as Oracle’s?
It’s an amazing experience! My first real job was as a PL/SQL developer. My only knowledge of SQL was a couple of vaguely remembered lectures at university. Ask Tom was the main place I learned about SQL and Oracle Database. So it’s a huge honor to be on the other side, helping others get the best out of the technology.
The best part has to be the positive comments when you help someone solve a problem that’s been troubling them for days. That’s why we’re here. To help developers learn more about Oracle and improve their SQL skills. When you use the database to its full extent, you can write better, faster applications with less code!
What were your three most interesting questions, so far?
Any question that has a clear definition and a complete test case is interesting! ;) Personally I enjoy using SQL to solve complex problems the best. So the first two do just that:
1. Finding the previous row in a different group
The poster had a series of transactions. These were grouped into two types. For each row, they wanted to show the id of the previous transaction from the other group.
At first this sounds like a problem you can solve using LAG or LEAD. But these only search for values within the same group. So you need a different method.
I provided a solution using the model clause. Using this, you can generate columns based on complex, spreadsheet-like formulas. Rows in your table are effectively cells in the sheet. You identify them by defining dimensions which can be other columns or expressions. By setting the transaction type as a dimension, you can then easily reference – and assign – values from one type to the other.
This worked well. But commenters were quick to provide solutions using window functions and 12c’s match_recognize clause. Both of which were faster than my approach!
I like this because it shows the flexibility of SQL. And it shows the value of an engaged community. No one knows everything. By sharing our knowledge and workin together we can all become better developers.
2. Improving SQL that deliberately generates cartesian product
The poster had a set of abbreviations for words. For example, Saint could also be “St.” or “St”. They wanted to take text containing these words. Then generate all combinations of strings using these abbreviations.
The “obvious” solution the user had is to split the text into words. Then for each word, join the abbreviation table, replacing the string as needed. So for a five word string, you have five joins.
There are a couple of problems with this method. The number of joins limits the number of words. So if you have a string with seven words, but only six table joins you won’t abbreviate the final word.
The other issue is performance. Joining the same table N times increases the work you do. If you have long sentences and/or a large number of abbreviations, the query could take a long time to run.
To overcome these you need to ask: “how can I join to the abbreviation table just once?”
The solution to do this starts the same as the original. Split the sentence into a table of words. Then join this to the abbreviations to give a row for each replacement needed.
You can then recursively walk down these rows using CTEs. These build up the sentence again, replacing words with their abbreviations as needed. A scalable solution that only needs a single pass of each table!
The final question relates to performance. Tom Kyte’s mantra was always “if you can do it in SQL, do it in SQL”. The reason is because a pure SQL solution is normally faster than one which combines SQL and other code. Yet a question came in that cast doubt on this:
3. Difference in performance SQL vs PL/SQL
The poster was updating a table. The new values came from another table. He was surprised that PL/SQL using bulk processing came out faster than the pure SQL approach.
The query in question was in the form:
set col1 = (select col2 from table2 where t1.code = t2.code);
It turned out the reason was due to “missing” indexes. Oracle executes the subquery once for every row in table1. Unless there’s an index on table2 (code), this will full scan table2 once for every row in table1!
The PL/SQL only approach avoided this problem by reading the whole of table2 into an array. So there was only one full scan of table2.
The problem here is there was no index on the join condition (t1.code = t2.code). With this in place Oracle does an index lookup of table2 for each row in table1. A massive performance improvement!
The moral being if your SQL is “slow”, particularly in compared to a combined SQL + other language method, it’s likely you have a missing index (or two).
This question again showed the strength and value of the Oracle community. Shortly after I posted the explanation, a reviewer was quick to point out the following SQL solution:
merge into table1
on (t1.code = t2.code)
then update set t1.col = t2.col;
This came out significantly faster than both the original update and PL/SQL – without needing any extra indexes!
You’re running a YouTube channel called “The Magic of SQL”. Are SQL developers magicians?
Of course they are! In fact, I’d say that all developers are magicians. As Arthur C Clarke said:
“Any sufficiently advanced technology is indistinguishable from magic”
The amount of computing power you carry around in your phone today is mind blowing. Just ask your grandparents!
I think SQL developers have a special kind of magic though :). The ability to answer hard questions with a few lines of SQL is amazing. And for it to adapt to changes in the underlying data to give great performance without you changing it is astounding.
Your Twitter account has a pinned tweet about window functions. I frequently talk to Java developers at conferences, and few of them know about window functions, even if they’ve been in databases like Oracle for a very long time. Why do you think they’re still so “obscure”?
Oracle Database has had window functions has had them since the nineties. But many other RDBMSes have only fully supported them recently. So a combination of writing “database independent” code and people using other databases is certainly a factor.
Use of tools which hide SQL from developers is also a problem. If you’re not actively using SQL, it’s easy to overlook many of its features.
Fortunately I think this is changing. As more and more developers are realizing, SQL is a powerful language. Learning how to use it effectively is a key skill for all developers. Window functions and other SQL features mean you can get write better performing applications with less code. Who doesn’t want that? ;)
What are three things that every developer should know about SQL?
1. Understand set based processing
If you find yourself writing a cursor loop (select … from … loop), and inside that loop you run more SQL, you’re doing it wrong.
Think about it. Do you eat your cornflakes by placing one flake in your bowl, adding the milk, and eating that one piece? Then doing the same for the next. And the next. And so on? Or do you pour yourself a big bowl and eat all the flakes at once?
If you have a cursor loop with more SQL within the loop, you’re effectively doing this. There’s a lot of overhead in executing each SQL statement. This will slow you down if you have a large number of statements that each process one row. Instead you want few statements that process lots of rows where possible.
It’s also possible to do this by accident. As developers we’re taught that code reuse is A Good Thing. So if there’s an API available we’ll often use it. For example, say you’re building a batch process. This finds the unshipped orders, places them on shipments and marks them as sent.
If a ship_order function exists, you could write something like:
select order_id from unshipped_orders loop
ship_order ( order_id );
The problem here is ship_order almost certainly contains SQL. SQL you’ll be executing once for every order awaiting postage. If it’s only a few this may be fine. But if there’s hundreds or thousands this process could take a long time to run.
The way to make this faster is to process all the orders in one go. You can do this with SQL like:
insert into shipments
select … from unshipped_orders;
set shipment_date = sysdate;
You may counter there’s other, non-SQL, processing you need to do such as sending emails. So you still need a query to find the order ids.
But you can overcome this! With update’s returning clause, you can get values from all the changed rows:
set shipment_date = sysdate
returning order_id bulk collect into order_array;
This gives you all the order ids to use as you need.
2. Learn what an execution plan is and how to create and read one
“How can I make my SQL faster” is one of the most common classes of questions posted on Ask Tom. The trouble is there’s scant one-size-fits-all advice when it comes to SQL performance. To help we need to know what your query is, what the tables and indexes are and details about the data. But most importantly we need to know what the query is actually doing!
For example, say you want me to help you figure out a faster route to work. To do this I need to know which route you currently use and how long each part of it takes. We can then compare this against other routes, checking how far they are, expected traffic and predicted speeds. But we need the data to do this!
So when answering performance questions, the first thing we look for is an execution plan. Many confuse this with an explain plan. An explain plan is just a prediction. Often it’s wrong. And even when it’s right, we still don’t know how much work each step does.
An execution plan shows exactly what the database did. It also gives stats about how much work, how often and how long it took to process each step. Combine this with a basic understanding of indexes and join methods and you can often solve your own performance problems.
3. Use bind variables
Sadly data breaches are all too common. There hardly seems to be a week that goes by without news of a major company leaking sensitive data. And the root cause of these attacks is often SQL injection.
This is a simple, well known attack vector. If you write vulnerable SQL on a web enabled application, eventually you’ll be attacked.
And this isn’t something you can avoid by using NoSQL databases. SQL injection like attacks are possible there too!
Fortunately the solution is easy: use bind variables. Not only do these secure your application, they can improve performance too.
Make sure your company is not tomorrow’s data leak headline. Use bind variables!
Last but not least: When will Oracle have a BOOLEAN type? :)
We have a BOOLEAN type! It’s just only in PL/SQL ;P
There’s currently a push in the community to for us to add a SQL BOOLEAN type. If this is a feature you’d like to see, you can vote for it on the Database Ideas forum. The more support there is, the more likely we are to implement it! But no promises ;)