Hah! Got nerd-sniped again:
https://stackoverflow.com/questions/43099226/how-to-make-jooq-to-use-arrays-in-the-in-clause/43102102
A jOOQ user was wondering why
jOOQ would generate an
IN
list for a predicate like this:
Java
SQL
… when in fact there could have been the following predicate being generated, instead:
In the second case, there would have been only one single bind variable instead of 4, and the SQL generation and parsing work would have been “much” less (maybe not for the
IN
list of size 4, but let’s imagine a list of 50 values).
A disclaimer
First off, a disclaimer: In databases that have a cursor cache / plan cache (e.g. Oracle or SQL Server), you should be careful with long
IN
lists, because they will probably trigger a hard parse every time you run them, as by the time you run the exact same predicate (with 371 elements in the list) again, the execution plan will have been purged from the cache. So, you cannot really profit from the cache.
I’m aware of this problem, and it will be topic of another blog post, soon. Let’s stick to PostgreSQL
whose “plan cache” isn’t really that sophisticated.
Measure, don’t guess
The question was about improving the speed of parsing a SQL statement. Parsers are really fast, so parsing shouldn’t be a problem. Generating an execution plan certainly does cost more time, but again, since PostgreSQL’s plan cache isn’t very sophisticated, this won’t play into the issue here. So the question is really:
Is an
IN
list really that bad in PostgreSQL?
Would an array bind variable be much better?
Since our recent post about benchmarking, we now know that we shall never guess, but always measure. I’m using again the
Sakila database to run these two queries:
-- IN list
SELECT *
FROM film
JOIN film_actor USING (film_id)
JOIN actor USING (actor_id)
WHERE film_id IN (?, ?, ?, ?)
-- Array
SELECT *
FROM film
JOIN film_actor USING (film_id)
JOIN actor USING (actor_id)
WHERE film_id = ANY(?)
Let’s try lists of length 4, first. The benchmark is here:
DO $$
DECLARE
v_ts TIMESTAMP;
v_repeat CONSTANT INT := 1000;
rec RECORD;
v_e1 INT := 1;
v_e2 INT := 2;
v_e3 INT := 4;
v_e4 INT := 8;
v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4];
BEGIN
FOR r IN 1..5 LOOP
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT *
FROM film
JOIN film_actor USING (film_id)
JOIN actor USING (actor_id)
WHERE film_id IN (v_e1, v_e2, v_e3, v_e4)
) LOOP
NULL;
END LOOP;
END LOOP;
RAISE INFO 'Run %, Statement 1: %',
r, (clock_timestamp() - v_ts);
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT *
FROM film
JOIN film_actor USING (film_id)
JOIN actor USING (actor_id)
WHERE film_id = ANY(v_any_arr)
) LOOP
NULL;
END LOOP;
END LOOP;
RAISE INFO 'Run %, Statement 2: %',
r, (clock_timestamp() - v_ts);
END LOOP;
END$$;
The result being:
INFO: Run 1, Statement 1: 00:00:00.112195
INFO: Run 1, Statement 2: 00:00:00.450461
INFO: Run 2, Statement 1: 00:00:00.109792
INFO: Run 2, Statement 2: 00:00:00.446518
INFO: Run 3, Statement 1: 00:00:00.105413
INFO: Run 3, Statement 2: 00:00:00.44298
INFO: Run 4, Statement 1: 00:00:00.108249
INFO: Run 4, Statement 2: 00:00:00.476527
INFO: Run 5, Statement 1: 00:00:00.120229
INFO: Run 5, Statement 2: 00:00:00.448214
Interesting. So, the
IN
list outperforms the array bind variable
every time by a factor of 4 (which is the size of the array / list!) So, let’s try 8 values, then. Here are the values and the adapted query 1:
-- values
v_e1 INT := 1;
v_e2 INT := 2;
v_e3 INT := 4;
v_e4 INT := 8;
v_e5 INT := 16;
v_e6 INT := 32;
v_e7 INT := 64;
v_e8 INT := 128;
v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8];
-- adapted query 1 ...
WHERE film_id IN (v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8)
-- ...
The result is still impressive:
INFO: Run 1, Statement 1: 00:00:00.182646
INFO: Run 1, Statement 2: 00:00:00.63624
INFO: Run 2, Statement 1: 00:00:00.184814
INFO: Run 2, Statement 2: 00:00:00.685976
INFO: Run 3, Statement 1: 00:00:00.188108
INFO: Run 3, Statement 2: 00:00:00.634903
INFO: Run 4, Statement 1: 00:00:00.184933
INFO: Run 4, Statement 2: 00:00:00.626616
INFO: Run 5, Statement 1: 00:00:00.185879
INFO: Run 5, Statement 2: 00:00:00.636723
The
IN
list query now takes almost 2x as long (but not quite 2x), whereas the array query now takes around 1.5x as long. It looks as though arrays become the better choice when their size increases. So, let’s do this! With 32 bind variables in the
IN
list, or 32 array elements respectively:
INFO: Run 1, Statement 1: 00:00:00.905064
INFO: Run 1, Statement 2: 00:00:00.752819
INFO: Run 2, Statement 1: 00:00:00.760475
INFO: Run 2, Statement 2: 00:00:00.758247
INFO: Run 3, Statement 1: 00:00:00.777667
INFO: Run 3, Statement 2: 00:00:00.895875
INFO: Run 4, Statement 1: 00:00:01.308167
INFO: Run 4, Statement 2: 00:00:00.789537
INFO: Run 5, Statement 1: 00:00:00.788606
INFO: Run 5, Statement 2: 00:00:00.776159
Both are about equally fast. 64 bind values!
INFO: Run 1, Statement 1: 00:00:00.915069
INFO: Run 1, Statement 2: 00:00:01.058966
INFO: Run 2, Statement 1: 00:00:00.951488
INFO: Run 2, Statement 2: 00:00:00.906285
INFO: Run 3, Statement 1: 00:00:00.907489
INFO: Run 3, Statement 2: 00:00:00.892393
INFO: Run 4, Statement 1: 00:00:00.900424
INFO: Run 4, Statement 2: 00:00:00.903447
INFO: Run 5, Statement 1: 00:00:00.961805
INFO: Run 5, Statement 2: 00:00:00.951697
Still about the same. OK… INTERN! Get over here. I need you to “generate” 128 bind values on this query.
Yep, as expected. Finally, arrays start to outperform
IN
lists:
INFO: Run 1, Statement 1: 00:00:01.122866
INFO: Run 1, Statement 2: 00:00:01.083816
INFO: Run 2, Statement 1: 00:00:01.416469
INFO: Run 2, Statement 2: 00:00:01.134882
INFO: Run 3, Statement 1: 00:00:01.122723
INFO: Run 3, Statement 2: 00:00:01.087755
INFO: Run 4, Statement 1: 00:00:01.143148
INFO: Run 4, Statement 2: 00:00:01.124902
INFO: Run 5, Statement 1: 00:00:01.236722
INFO: Run 5, Statement 2: 00:00:01.113741
Using Oracle
Oracle also has array types (although you have to declare them as nominal types first, but that’s not a problem here).
Here are some benchmark results (as always, not actual benchmark results, but anonymised units of measurement. I.e. these aren’t seconds but… Larrys):
4 bind values
Run 1, Statement 1 : 01.911000000
Run 1, Statement 2 : 02.852000000
Run 2, Statement 1 : 01.659000000
Run 2, Statement 2 : 02.680000000
Run 3, Statement 1 : 01.628000000
Run 3, Statement 2 : 02.664000000
Run 4, Statement 1 : 01.629000000
Run 4, Statement 2 : 02.657000000
Run 5, Statement 1 : 01.636000000
Run 5, Statement 2 : 02.688000000
128 bind values
Run 1, Statement 1 : 04.010000000
Run 1, Statement 2 : 06.275000000
Run 2, Statement 1 : 03.749000000
Run 2, Statement 2 : 05.440000000
Run 3, Statement 1 : 03.985000000
Run 3, Statement 2 : 05.387000000
Run 4, Statement 1 : 03.807000000
Run 4, Statement 2 : 05.688000000
Run 5, Statement 1 : 03.782000000
Run 5, Statement 2 : 05.803000000
The size of the number of bind values doesn’t seem to matter really. There’s always a constant overhead of using the array bind variable compared to the
IN
list, but that might as well be a benchmarking error. For instance, when I add the
/*+GATHER_PLAN_STATISTICS*/
hint to both queries, interestingly, the one with the array got significantly faster, whereas the
IN
list one was not affected… Weird? An explanation can be found in the comments,
specifically Kim Berg Hansen’s comment:
It looks like the default cardinality of the collection is assumed by the optimizer to be 8168 (at least in my 12.1.0.2 db). So probably full scan of actor with hash join.
Indeed, the
TABLE()
constructor (in this case) always seems to yield a constant cardinality estimate of 8168, despite the array containing much less data. So hinting approximate cardinalities might help here, to get nested loop joins for small arrays.
Conclusion
This article doesn’t go into why there’s such a big difference for small lists when the benefit is only apparent for quite large lists.
But it has once again shown, that we must not optimise prematurely in SQL, but
measure, measure, measure things.
IN
lists in dynamic SQL queries can be a big issue in production when they lead to cursor cache / plan cache saturation and a lot of “hard parsing”. So, the benefit of using the array is much more drastic when the content is big, as we can recycle execution plans much more often than with
IN
lists.
But chances are, that
IN
lists may be faster for single executions.
In any case: Choose carefully when following advice that you find somewhere on the Internet. Also, when following this advice. I ran the benchmark on PostgreSQL 9.5 and Oracle 11gR2 XE. Both are not the latest database versions. Try to measure things again on your side, to be sure that your “improvement” is really an actual improvement! And if in doubt, don’t optimise, until you’re sure you actually have a problem.
Like this:
Like Loading...
(Mostly) out of curiosity: does JOOQ bypass, by default, the 1000 values limitation for the IN clause in Oracle?
Yes, in case of IN predicates, the single predicate is split into multiple ones of the form:
In the case of
NOT IN
there is an equivalent splitting taking place:Awesome, thanks!
Pretty good post!
One question: Is there any limit in amount of items when using array approach in Oracle?
I’m not aware of such a limit. Although, at some point, I’d probably prefer a temp table instead. Should have included that in the post (batch / bulk insertion into the temp table and then semi join)
For what its worth a long time ago I did some testing on Postgres 9.1 (yes a long time ago) and as soon as you went above 500 or so with the default config (I can’t recall the exact number) things got really slow.
Regardless I rarely use “in” clauses for unique ids (for enums or other nominal values I do) but instead usean unlogged (postgresql nomenclature for in memory table) IDs table. To save re-explaining myself you can see this SO post: https://stackoverflow.com/a/11119642/318174 . Of course I’m sure Lukas is well aware of this technique but others might not be.
Very interesting… So, the limit was with 500 IN list elements or with 500 array elements?
But yes, the key message should have been not to use IN predicates for unique ids. It’s usually a “smell”, but sometimes, it’s not avoidable. For large lists, a temp table is probably more optimal, although it has its own overheads. I should’ve benchmarked that too, though…
A guess as to the cause of /*+ gather_plan_statistics */ making a difference in Oracle:
If I do something like this:
It looks like the default cardinality of the collection is assumed by the optimizer to be 8168 (at least in my 12.1.0.2 db). So probably full scan of actor with hash join.
The gather_plan_statistics would probably find the correct cardinality. Or alternatively specify the cardinality like:
That would probably switch to nested loop with 5 index lookups.
(Whether that beats the long IN list I haven’t benchmarked ;-)
Cheerio
/Kim
Yeah, I’m aware of this problem, but I keep forgetting it. Thanks for documenting this. I can definitely reproduce this. One would think that Oracle’s optimiser is smart enough to figure out the cardinality of a constant array literal, but I guess no important customer ever complained about this bug. If we’re hinting the exact cardinality for each query, we’re back to the N SQL_ID situation for N-sized IN-lists that we wanted to avoid, so I guess the cardinality hint would have to be approximated logarithmically similar to what is suggested in this article: https://blog.jooq.org/2018/04/13/when-using-bind-variables-is-not-enough-dynamic-in-lists. Thus, any of these:
Enter dynamic sampling hint:
“2” is here a level (see for example http://www.oracle.com/technetwork/testcontent/o19asktom-086775.html) – I guess for a case like this it does not matter too much what level. Maybe a best practice might be level 10 to be “certain” (as certain as you can get ;-) ) that the sampling takes place and that the sampling samples everything rather than just a sample?
Egh. SQL is a 4GL, they said ;-)