FROMclause, there are no tables in scope (yet):
These things are weird, because the lexical order of operations does not match the logical order of operations. We humans may sometimes (often) intuitively understand this ordering difference. E.g. we know that we’re about to select from the customer table. But the IDE doesn’t know this.
-- Don't you wish this would be completed to first_name? SELECT first_na... -- Aaah, now it works: SELECT first_na... FROM customer
GROUP BY contributes the most confusionWhen a junior developer / SQL beginner starts working with SQL, quite quickly, they will find out about aggregation and
GROUP BY. And they’ll quickly write things like:
Yay, we have 200 customers! And then:
SELECT count(*) FROM customer
Wow, 90 of them are called Steve! Interesting. Let’s find out how many we have per name…
SELECT count(*) FROM customer WHERE first_name = 'Steve'
SELECT first_name, count(*) FROM customer GROUP BY first_name
FIRST_NAME COUNT ------------------ Steve 90 Jane 80 Joe 20 Janet 10Very nice. But are they all the same? Let’s check out the last name, too
SELECT first_name, last_name, count(*) FROM customer GROUP BY first_name
ORA-00979: not a GROUP BY expressionJeez, what does it mean? (note, unfortunately, MySQL users that do not use the STRICT mode will still get a result here with arbitrary last names!, so a new MySQL user won’t understand their mistake) How do you easily explain this to a SQL newbie? It seems obvious to “pros”, but is it really obvious? Is it obvious enough that you can explain it easily to a junior? Think about it. Why are each one of these statements semantically correct or wrong?
-- Wrong SELECT first_name, count(*) FROM customer WHERE count(*) > 1 GROUP BY first_name -- Correct SELECT first_name, count(*) FROM customer GROUP BY first_name HAVING count(*) > 1 -- Correct SELECT first_name, count(*) FROM customer GROUP BY first_name ORDER BY count(*) DESC -- Wrong SELECT first_name, last_name, count(*) FROM customer GROUP BY first_name -- Correct SELECT first_name, MAX(last_name), count(*) FROM customer GROUP BY first_name -- Wrong SELECT first_name || ' ' || last_name, count(*) FROM customer GROUP BY first_name -- Correct SELECT first_name || ' ' || MAX(last_name), count(*) FROM customer GROUP BY first_name -- Correct SELECT MAX(first_name || ' ' || last_name), count(*) FROM customer GROUP BY first_name
The problem is syntax relatedThe SQL syntax works in a similar way like the English language. It is a command. We start commands with verbs. The verb is
DROP, etc. etc.) Unfortunately, human language is incredibly ill-suited for the much more formal world of programming. While it offers some consolation to new users (possibly non-programmers) who are absolute beginners, it just makes stuff hard for everyone else. All the different SQL clauses have extremely complex interdependencies. For instance:
- In the presence of a
GROUP BYclause, only expressions built from
GROUP BYexpressions (or functional dependencies thereof), or aggregate functions can be used in
- For simplicity reasons, let’s not even talk about
- In fact, there are even a few cases in which
GROUP BYis implied. E.g. if you write a “naked”
- A single aggregate function in the
SELECTclause (in the absence of
GROUP BY) will force aggregation into a single row
- In fact, this can also be implied by putting that aggregate function in
ORDER BY(for whatever reason)
- You can
ORDER BYquite a few expressions that reference any columns from the
SELECTing them. But that’s no longer true if you write
Can this ever be understood?Luckily, yes! There’s a simple trick, which I’m always explaining to the delegates that visit my SQL Masterclass. The lexical (syntactical) order of SQL operations (clauses) does not correspond at all to the logical order of operations (although, sometimes, they do coincidentally). Thanks to modern optimisers, the order also doesn’t correspond to the actual order of operations, so we really have: syntactical -> logical -> actual order, but let’s leave that aside for now. The logical order of operations is the following (for “simplicity” I’m leaving out vendor specific things like
UNPIVOTand all the others):
FROM: This is actually the first thing that happens, logically. Before anything else, we’re loading all the rows from all the tables and join them. Before you scream and get mad: Again, this is what happens first logically, not actually. The optimiser will very probably not do this operation first, that would be silly, but access some index based on the
WHEREclause. But again, logically, this happens first. Also: all the
JOINclauses are actually part of this
JOINis an operator in relational algebra. Just like
-are operators in arithmetics. It is not an independent clause, like
WHERE: Once we have loaded all the rows from the tables above, we can now throw them away again using
GROUP BY: If you want, you can take the rows that remain after
WHEREand put them in groups or buckets, where each group contains the same value for the
GROUP BYexpression (and all the other rows are put in a list for that group). In Java, you would get something like:
Map<String, List<Row>>. If you do specify a
GROUP BYclause, then your actual rows contain only the group columns, no longer the remaining columns, which are now in that list. Those columns in the list are only visible to aggregate functions that can operate upon that list. See below.
- aggregations: This is important to understand. No matter where you put your aggregate function syntactically (i.e. in the
SELECTclause, or in the
ORDER BYclause), this here is the step where aggregate functions are calculated. Right after
GROUP BY. (remember: logically. Clever databases may have calculated them before, actually). This explains why you cannot put an aggregate function in the
WHEREclause, because its value cannot be accessed yet. The
WHEREclause logically happens before the aggregation step. Aggregate functions can access columns that you have put in “this list” for each group, above. After aggregation, “this list” will disappear and no longer be available. If you don’t have a
GROUP BYclause, there will just be one big group without any key, containing all the rows.
HAVING: … but now you can access aggregation function values. For instance, you can check that
count(*) > 1in the
GROUP BY(or implies
GROUP BY), we can no longer access columns or expressions that were not
WINDOW: If you’re using the awesome window function feature, this is the step where they’re all calculated. Only now. And the cool thing is, because we have already calculated (logically!) all the aggregate functions, we can nest aggregate functions in window functions. It’s thus perfectly fine to write things like
sum(count(*)) OVER ()or
row_number() OVER (ORDER BY count(*)). Window functions being logically calculated only now also explains why you can put them only in the
ORDER BYclauses. They’re not available to the
WHEREclause, which happened before. Note that PostgreSQL and Sybase SQL Anywhere have an actual
SELECT: Finally. We can now use all the rows that are produced from the above clauses and create new rows / tuples from them using
SELECT. We can access all the window functions that we’ve calculated, all the aggregate functions that we’ve calculated, all the grouping columns that we’ve specified, or if we didn’t group/aggregate, we can use all the columns from our
FROMclause. Remember: Even if it looks like we’re aggregating stuff inside of
SELECT, this has happened long ago, and the sweet sweet
count(*)function is nothing more than a reference to the result.
SELECT, even if it is put before your
SELECTcolumn list, syntax-wise. But think about it. It makes perfect sense. How else can we remove distinct rows, if we don’t know all the rows (and their columns) yet?
UNION, INTERSECT, EXCEPT: This is a no-brainer. A
UNIONis an operator that connects two subqueries. Everything we’ve talked about thus far was a subquery. The output of a union is a new query containing the same row types (i.e. same columns) as the first subquery. Usually. Because in wacko Oracle, the penultimate subquery is the right one to define the column name. Oracle database, the syntactic troll ;)
ORDER BY: It makes total sense to postpone the decision of ordering a result until the end, because all other operations might use hashmaps, internally, so any intermediate order might be lost again. So we can now order the result. Normally, you can access a lot of rows from the
ORDER BYclause, including rows (or expressions) that you did not
SELECT. But when you specified
DISTINCT, before, you can no longer order by rows / expressions that were not selected. Why? Because the ordering would be quite undefined.
OFFSET: Don’t use offset
LIMIT, FETCH, TOP: Now, sane databases put the
LIMIT(MySQL, PostgreSQL) or
FETCH(DB2, Oracle 12c, SQL Server 2012) clause at the very end, syntactically. In the old days, Sybase and SQL Server thought it would be a good idea to have
TOPas a keyword in
SELECT. As if the correct ordering of
SELECT DISTINCTwasn’t already confusing enough.
Why does it work? Because:
-- Doesn't work, cannot put window functions in GROUP BY SELECT ntile(4) ORDER BY (age) AS bucket, MIN(age), MAX(age) FROM customer GROUP BY ntile(4) ORDER BY (age) -- Works: SELECT bucket, MIN(age), MAX(age) FROM ( SELECT age, ntile(4) ORDER BY (age) AS bucket FROM customer ) c GROUP BY bucket
- In the derived table,
FROMhappens first, and then the
WINDOWis calculated, then the bucket is
- The outer
SELECTcan now treat the result of this window function calculation like any ordinary table in the
GROUP BYan ordinary column, then aggregate, then
-- Wrong: Because aggregate functions are calculated -- *after* GROUP BY, and WHERE is applied *before* GROUP BY SELECT first_name, count(*) FROM customer WHERE count(*) > 1 GROUP BY first_name -- logical order -- available columns after operation ------------------------------------------------------------- FROM customer -- customer.* WHERE ??? > 1 -- customer.* (count not yet available!) GROUP BY first_name -- first_name (customer.* for aggs only) <aggregate> count(*) -- first_name, count SELECT first_name, count -- first_name, count -- Correct: Because aggregate functions are calculated -- *after* GROUP BY but *before* HAVING, so they're -- available to the HAVING clause. SELECT first_name, count(*) FROM customer GROUP BY first_name HAVING count(*) > 1 -- logical order -- available columns after operation ------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) <aggregate> count(*) -- first_name, count HAVING count > 1 -- first_name, count SELECT first_name, count -- first_name, count -- Correct: Both SELECT and ORDER BY are applied *after* -- the aggregation step, so aggregate function results are -- available SELECT first_name, count(*) FROM customer GROUP BY first_name ORDER BY count(*) DESC -- logical order -- available columns after operation ------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) <aggregate> count(*) -- first_name, count SELECT first_name, count -- first_name, count ORDER BY count DESC -- first_name, count -- Wrong: Because the GROUP BY clause creates groups of -- first names, and all the remaining customer columns -- are aggregated into a list, which is only visiblbe to -- aggregate functions SELECT first_name, last_name, count(*) FROM customer GROUP BY first_name -- logical order -- available columns after operation ----------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) <aggregate> count(*) -- first_name, count -- first_name, count (last_name removed) SELECT first_name, ???, count -- Correct: Because now, we're using an aggregate function -- to access one of the columns that have been put into that -- list of columns that are otherwise no longer available -- after the GROUP BY clause SELECT first_name, MAX(last_name), count(*) FROM customer GROUP BY first_name -- logical order -- available columns after operation ----------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) -- first_name, max, count <aggregate> MAX(last_name), count(*) -- first_name, max, count SELECT first_name, max, count -- Wrong: Because we still cannot access the last name column -- which is in that list after the GROUP BY clause. SELECT first_name || ' ' || last_name, count(*) FROM customer GROUP BY first_name -- logical order -- available columns after operation ----------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) <aggregate> count(*) -- first_name, count -- first_name, count (last_name removed) SELECT first_name || ' ' || ???, count -- Correct: Because we can access the last name column from -- aggregate functions, which can see that list SELECT first_name || ' ' || MAX(last_name), count(*) FROM customer GROUP BY first_name -- logical order -- available columns after operation ----------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) -- first_name, max, count <aggregate> MAX(last_name), count(*) -- first_name, max, count (no last_name) SELECT first_name || ' ' || max, count -- Correct: Because both GROUP BY columns and aggregated -- columns are available to aggregate functions SELECT MAX(first_name || ' ' || last_name), count(*) FROM customer GROUP BY first_name -- logical order -- available columns after operation ----------------------------------------------------------------- FROM customer -- customer.* GROUP BY first_name -- first_name (customer.* for aggs only) -- first_name, max, count <aggregate> MAX(first_name || ' ' || last_name), count(*) SELECT max, count -- first_name, max, count
Always think about the logical order of operationsIf you’re not a frequent SQL writer, the syntax can indeed be confusing. Especially
GROUP BYand aggregations “infect” the rest of the entire
SELECTclause, and things get really weird. When confronted with this weirdness, we have two options:
- Get mad and scream at the SQL language designers
- Accept our fate, close our eyes, forget about the snytax and remember the logical operations order
SUM(amount)aggregate function) inside of the cumulative revenue (
SUM(...) OVER (...)window function):
… because aggregations logically happen before window functions.
SELECT payment_date, SUM(SUM(amount)) OVER (ORDER BY payment_date) AS revenue FROM payment GROUP BY payment_date
Caveat: ORDER BY clauseThere are some caveats around the
ORDER BYclause, which might be contributing to further confusion. By default, continue assuming that the logical order of operations is correct. But then, there are some special cases, in particular:
- In the absence of a
- In the absence of set operations like
ORDER BY, which are not projected by
SELECT. The following query is perfectly fine in most databases:
There’s a “virtual” / implicit
SELECT first_name, last_name FROM actor ORDER BY actor_id
ACTOR_IDprojection, as if we had written:
But then, removed the
SELECT first_name, last_name, actor_id FROM actor ORDER BY actor_id
ACTOR_IDcolumn again from the result. This is very convenient, although it might lead to some confusion about the semantics and the order of operations. Specifically, you cannot use e.g.
DISTINCTin such a situation. The following query is invalid:
Because, what if there are two actors by the same name but with very different IDs? The ordering would now be undefined. With set operations, it is even more clear why this isn’t permitted:
SELECT DISTINCT first_name, last_name FROM actor ORDER BY actor_id -- Oops
In this case, the
SELECT first_name, last_name FROM actor UNION SELECT first_name, last_name FROM customer ORDER BY actor_id -- Oops
ACTOR_IDcolumn isn’t present on the
CUSTOMERtable, so the query makes no sense at all.
33 thoughts on “A Beginner’s Guide to the True Order of SQL Operations”
I wonder how many man-years could have been saved by using a sane order like in your examples?
It would be just a bit longer like
or not at all if aggregation was allowed to be aggregated in SELECT like in
Nice explanation, I’ll bookmark it for when I run into problems. But no matter how good I learn it, the problem of thinking A and having to write B remains.
Absolutely right. Your suggestions would be brilliant and should definitely be implemented. But, unfortunately, these minor improvements have been suggested many times before, and still haven’t made it in any mainstream database… :/
Not really surprising. Otherwise some people (including myself) might start loving SQL, but the confused ordering (and therefore missing autocomplete) and keyword-infested retro-looking case-insensitive syntax are too bad for a poor Java guy like me.
By the way, doesn’t the WHERE+HAVING combo go against the SQL “idea” of giving the user no control (and no idea) about what really happens? Couldn’t I put all the conditions into the HAVING clause and let the optimizer pick out conditions to be executed already in the WHERE phase?
“poor Java guy” – you really think that Java syntax is much better? :)
Why should WHERE+HAVING be against the SQL “idea” (i.e. against declarative programming)? The two clauses are different, semantically. But no one guarantees any order of operations.
Indeed. Java is far from perfect:
– It’s rather verbose and some things are terrible to express (e.g., BigInteger arithmetic).
– A better syntax would represent code as data just like Lisp does, which would make e.g. AOP easily expressible. But Lisp is a parenthesized mess and I know no language achieving this in a readable way.
– `private` and `final` should be the default.
– Getters and setters are a pain, but when I say Java, I means Java+Lombok.
But these are small issues.
– Unlike SQL, a Java method gets written in the order it gets executed.
– Unlike SQL, any complex Java piece can be refactored in simple reusable parts. SQL has views and CTEs, but views are global and CTEs are not reusable.
– Unlike SQL, Java has a fantastic IDE support and basically unique formatting rules. We could argue about tabs vs. spaces, placement of braces, etc., but apart from this bikeshedding, everyone agrees.
Both are conditions. How would you explain the difference to a newbie without referring to the execution order?
Oh, that’s a big drawback imperative languages have over declarative ones. It is a good thing that the execution order can be different from the logical order. Come take my SQL Masterclass. I’ll convince you :)
Views are structured in catalogs and schemas. That ought to be sufficient, no?
But I grant you the IDE support criticism :)
No, that’s a huge advantage:
– You can exactly imagine what’s going on. If you can’t, you can debug it.
– In case of performance problems, you exactly know what to measure.
– Sometimes, you can optimize it by changing the order manually (e.g., filter first using the most effective test). The SQL engine does it automatically, but it’s not always right, and when it’s not, then you have a really hard time fixing it.
– Under the hood, the VM and the CPU change the execution order a bit, anyway, but you don’t need to care.
But what I mainly meant is that SQL is a parody of English, which leads to a wrong ordering (as you’ve shown) and a monolithic mess. Writing SQL feels like writing a whole class in a single Java line.
SQL? No way. But I might come to learn jOOQ one day. Or not, as I can imagine me learning it easily (as it seems to be like I would do if I had the time and experience).
Is there something like a “jOOQ for Hibernate users training”?
Isn’t actually jOOQ the proof of the Java syntax superiority? Or is there someone supporting the “poor SQL guys” by expressing Java code via SQL syntax? :D
No. Whenever anything gets too long, I want to refactor it into manageable parts and make them only as visible as necessary. For this, views would have to work as 1. locals to a query or 2. locals to a stored procedure or 3. locals to some packages/whatever or … or globals. Both catalogs and schemas are much too global things.
Not only views, but also conditions and ORDER-BY-expressions and list of columns and whatever parts could use some flexibility.
Why should there be. As soon as you accept SQL for what it is, you’ll understand. As long as you reject SQL based on syntax, you won’t listen. Until then, I just give up :)
Great article. One minor observation, though. I guess many readers will not get the last querywithout a short description:
First, the GROUP BY will divide the payments by payment_date, and the SUM(amount) will be calculated per payment_date.
Second, the WINDOW FUNCTION will kick in and will calculate the SUM of all records starting from the first payment_date until the current processing payment_date.
The query is much more easy to understand if written like this:
So, while `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` is implied by default, for those readers that have never ever used WINDOW FUNCTIONS, it’s much more clear what happens during query execution.
Why is your version of the last query more easy to understand than mine? Because of formatting? (I mean, the “first” / “second” distinction was explained several times in the article).
Btw, your version is wrong. The default window frame in the presence of an ORDER BY clause is
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, not
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
RANGEkinda works like
ROWS "WITH TIES"here… Also, I’m not sure if that additional clause might be more confusing in the context of this article.
Good point about ROWS vs RANGE. As for this last query being obvious without explaining it, it’s up to each reader to decide ;)
Thanks — outstanding article. I’m not an SQL beginner, but this sure helps me understand some of the reasons I often struggle to write queries that work correctly.
Thanks for the feedback. I’m glad it helped!
Teradata actually has an additional evaluation stage in SQL querying that you don’t mention here that is pretty handy. Specifically, QUALIFY. It’s sort of like HAVING, but for windowing/analytic functions.
I often run into the situation where I want to filter the results of a query based on an analytic function. Analytic functions can’t be included in the WHERE clause because they’re obviously evaluated almost last in execution order. But this means that I usually have to take my original query, force it down into a subquery, and SELECT * FROM the subquery (or perhaps the WITH clause/CTE).
This makes for a lot more code to read and understand. Quite nice just to be able to directly filter results from a window function in the query. More info here: http://jrandrews.net/the-joy-of-qualify/ (yes it’s my website but I hope the article is actually worthwhile in explaining.) Do you want to include qualify in your execution order?
Also, do you know where recursive CTEs fall in this order? And Oracle CONNECT BY? Just curious, I think perhaps in the same order of execution as a JOIN but just checking.
Thank you very much for your comment. I wasn’t aware of the Teradata
QUALIFYclause. Very useful indeed. I wish other databases and the SQL standard adopted it, it makes total sense.
I won’t add it to the order of execution here, as I wanted to avoid these edge cases / vendor specific extensions. For instance, Oracle has a ton, like the
CONNECT BYclause that you’ve mentioned. The manual states:
Good question about the recursive CTE. Might be worth an entire blog post on its own :)
So now Snowflake, BigQuery, and Postgres all support QUALIFY. And likely there are more databases out there as well that do, I just know these three off the top of my head. Would you consider adding it now :)?
Thank you for this article! Memorizing the logical order of a query has been very helpful to me.
One nuance about GROUP BY not mentioned in the article: if you GROUP BY a primary key, you can still reference other columns dependent on that key without having to use aggregate functions.
Thanks for the feedback. Glad it helped. Yes, that feature was mentioned in the middle. Look for the term “functional dependencies”
This is an amazing resource that has helped me immensely, thank you!
Where exactly do calculated fields fall into this list? For example, sum(a)/sum(b). Is this division calculated in the aggregations? Would just a/b be calculated in the select part? What about something like sum(a)/sum(a) over()?
My novice assumption has always been that calculated fields were part of the select piece, but I ran into an issue that is making me question that. I had a sum(a)/sum(b) that resulted in a division by 0 error, and the solution I came up with was to exclude sum(b) = 0 records (i.e., using “having sum(b) > 0)”. However, I still got a division by 0 error. I expected the logical order to be as follows: compute sum(a) and sum(b), remove records where sum(b)=0, then compute sum(a)/sum(b). How come this didn’t work?
(As a side note, I’m aware of other alternatives, such as nullif, but I really just want to understand why the above didn’t work :) )
Thanks for your nice words. I’m glad it helped.
That’s an interesting question of yours. To answer it correctly, think of expressions like
sum(b)as symbols that are present somewhere syntactically in your SQL query, but logically, they may appear entirely elsewhere. As mentioned in the article, the aggregation step happens logically after
GROUP BYand way before
SELECT. This means that logically these sums have been calculated and are made available as ordinary columns (or values) once you reach the
SELECTclause. So, when you then proceed to evaluating the
SELECTclause, where the division is specified, you just have to divide
someValue / someOtherValuewhere the two values have already been calculated.
It’s the same with
sum(a) / sum(b) over(). The
sum(a)part is calculated in the aggregation “step”, the
sum(b) over()part is calculated in the window “step”, and the division is calculated in the select “step”.
Now, all of this is being done logically and in theory. The actual execution plan can be quite different because it may make total sense to invert the order of operations, as long as the logical semantics is preserved. If you keep getting a division by 0 error in your case, then that looks like a bug in your SQL engine. I’m assuming, MySQL?
Ah, that makes sense. So it sounds like my query was likely being executed slightly differently than I’d expected logically. One interesting note to make is that I was able to replicate the denominator ‘sum(b)’ with ‘sum(sum b) over (partition by z)’, and that worked as expected using a having clause. This is actually in Netezza.
Yes indeed. A nice side-effect of the logical operations order is the fact that you can nest aggregate functions (sum(b)) in window functions (sum(sum(b)) over (…)), because when window functions are calculated, the result of an aggregate function is already available.
Not that this is very readable ;-)
Amazing article. I’ve found myself struggling with SQL due to it’s declarative nature and feeling “uncomfortable” not knowing what order a query will execute in. Thanks!
Adrian, I’m glad to hear the article was useful to you. Just to be sure, always remember, the article describes the logical order of operations, not the actual order of operations. Optimisers are very likely to re-order the operations again in a compatible way, to speed up your queries.
How does SQL logically compute aggregations with case expressions inside? For example:
select sum(case when field1=something then field2 else 0 end) as field3
It would seem to me that, logically, the optimizer computes the case expression first and creates a new “temporary” field (for lack of better words) that is eventually summed with the aggregation. Do you agree? And does that jive with the ordering listed above? I would think the case expression is associated with the select portion, which as we know, is logically computed after the aggregation.
Thanks again for this resource!
You could definitely think of it this way – that there’s a “temporary” field being added prior to aggregating. This would be akin to what the SQL standard specifies in order to allow for an ORDER BY clause to reference expressions that are absent from the SELECT clause, see: https://blog.jooq.org/2018/07/13/how-sql-distinct-and-order-by-are-related
The standard calls these extended sort key columns. Another way of looking at aggregating arbitrary expressions is to simply… aggregate arbitrary expressions, right at the aggregation step.
If you think that a “temporary” field helps understand how things behave, then no, it is definitely not calculated in the SELECT clause, but right after the FROM clause, before the GROUP BY clause.
Suppose we have a query where FROM…WHERE results in 4 rows, GROUP BY…HAVING consolidates it into two rows, and SELECT clause reads “COUNT(id), USER_DEFINED_FN()”. How many times will the USER_DEFINED_FN execute? 2 or 4?
More importantly, why is it valid in the first place when it is not specified in the GROUP BY clause. I thought only columns from GROUP BY were valid inside SELECT unless wrapped inside an aggregate function.
Of course, depending on the optimiser implementation, there’s no guarantee that you will get 2 executions of USER_DEFINED_FN(), but since the only logical thing would be to execute it twice, that should be the answer.
The answer to why it is valid is the same as the answer to why using the + operator is valid. You can use any expression like count(id) + count(id) in SELECT, because the operator is just a function. The restriction you’re referring to of being able to access column expressions from GROUP BY is really applicable to column expressions only. Of course, by consequence, you cannot pass arbitrary column expressions to your USER_DEFINED_FN(). E.g. this is only possible if you GROUP BY both columns COL1, COL2: USER_DEFINED_FN(COL1, COL2). If you GROUP BY only COL1, you can still do USER_DEFINED_FN(COL1, count(COL2))
My biggest confusion as a beginner which this article does kind of go over is how is an aggregate function that comes after the group by able to find out the number of records needed for the calculation. For example if there are 3 repeating records and we perform group by on it, I would think that this would only return 1 record. Performing any aggregate functions like sum(), count() would be useless after that point. But in this article you mentioned that SQL does keep the list of all the repeating columns and you also wrote “Those columns in the list are only visible to aggregate functions that can operate upon that list”. Would you please tell me how is this list kept and what does it look like?
Think of it this way. Before grouping:
After group by A, there’s no longer really a column B. Instead, the values of B are collapsed into a “group”, which can be aggregated using aggregate functions:
Now, you can aggregate all elements in (B) per group of A, e.g. SUM(B), COUNT(*):
Some databases also allow for aggregating the contents of a group into an array or string, where the way this works is most visible, in my opinion:
This is immensely helpful as a starter. Thank you for this post!
countgets evaluated before
Interesting point, which the article didn’t mention. Within the aggregation step, such optional aggregation characteristics must be implemented before the actual aggregation – at least logically.
Think about it this way. If count was already calculated, how could we then still remove the distinct values?