In 95% of the cases,
cartesian products originate from accidental cross join operations and cause unnecessary high load on a database. Maybe the results aren’t even wrong, as someone may have applied a UNION or a DISTINCT keyword, to remove unwanted duplicates.
But there are those 5% of SQL queries, where the cartesian product is intentional, even desireable. Here’s a nice example of such a query. Let’s assume we have these two tables (for simplicity, constraints are omitted):
create table content_hits (
content_id int,
subscriber_id int
);
create table content_tags (
content_id int,
tag_id int
);
The above tables model some blog content, where content_hits shows how many subscribers are concerned with a given blog entry, and content_tags shows which tag is applied to a given blog entry. So let’s say, we want to have some statistics about how many times any given subscriber/tag combination was actually seen. We cannot just join and group by subscriber_id, tag_id if we want the complete list of combinations (including the ones that have a zero count). So a cross join comes to the rescue!
SELECT combinations.tag_id,
combinations.subscriber_id,
-- correlated subquery to count the actual hits by tag/subscriber when
-- joining the two tables using content_id
(SELECT count(*)
FROM content_hits AS h
JOIN content_tag AS t ON h.content_id = t.content_id
WHERE h.subscriber_id = combinations.subscriber_id
AND t.tag_id = combinations.tag_id) as cnt
-- Create all combinations of tag/subscribers first, before counting
-- anything. This will be necessary to have "zero-counts" for any
-- combination of tag/subscriber
FROM (
SELECT DISTINCT tag_id, subscriber_id
FROM content_tag
CROSS JOIN content_hits
) AS combinations
Instead of using a correlated subquery, for performance reasons (depending on the database) it might be more interesting to left outer join the first subquery to the “combinations” relation and then group by subscriber_id, tag_id, and count(content_id) for every group, resulting in this query:
SELECT combinations.tag_id,
combinations.subscriber_id,
count(content_relation.content_id)
-- Create all combinations of tag/subscribers first, before counting
-- anything. This will be necessary to have "zero-counts" for any
-- combination of tag/subscriber
FROM (
SELECT DISTINCT tag_id, subscriber_id
FROM content_tag
CROSS JOIN content_hits
) AS combinations
-- Now, outer join every expected combination to its
-- actual relation joined on content_id
LEFT OUTER JOIN (
SELECT h.content_id,
h.subscriber_id,
t.tag_id
FROM content_hits AS h
JOIN content_tag AS t ON h.content_id = t.content_id
) AS content_relation
ON combinations.tag_id = content_relation.tag_id
AND combinations.subscriber_id = content_relation.subscriber_id
-- And now, finally, group, in order to get the count(content_id)
GROUP BY combinations.tag_id,
combinations.subscriber_id
The original Stack Overflow question can be seen here:
https://stackoverflow.com/questions/9924433/mysql-select-query-using-countLike this:
Like Loading...
Published by lukaseder
I made jOOQ
View all posts by lukaseder