This article is Part 3 of a series on Apache Druid Query Performance.
Best Practice: Whenever possible, structure your analytical questions to fit the Timeseries or TopN patterns to leverage their specialized, high-performance execution paths.
Answer: The single most important filtering strategy in Druid is to always filter on the __time column over the smallest possible interval. The second most impactful strategy is to use equality filters (= or IN) on low-to-medium cardinality dimensions.
WHERE __time BETWEEN... clause allows the Broker to perform “segment pruning,” immediately ignoring any segments outside the query’s time range. A query without a time filter is a worst-case scenario, forcing a full scan of the entire datasource.WHERE country = 'US' can use these indexes to instantly identify matching rows without scanning the column’s values.WHERE clause (e.g., WHERE UPPER(product) = 'BIKE') prevents index usage and forces a full scan and computation on every row.!= or LIKE '%value' are expensive as they cannot use indexes efficiently.Answer: EXPLAIN PLAN is a powerful Druid SQL command that shows you how Druid will execute your query. It translates your SQL into the native JSON query format and provides a query plan, allowing you to see exactly what operations will be performed.
To use it, simply prepend EXPLAIN PLAN FOR to your SQL query:
EXPLAIN PLAN FOR SELECT "page", COUNT(*) AS "Edits" FROM "wikipedia" WHERE "countryName" IS NOT NULL GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
The output will be a JSON object containing two key pieces of information:
PLAN: A textual representation of the execution steps. This will show you if Druid is using a GroupBy, TopN, or Timeseries engine, and what operations like filtering and sorting are being applied.RESOURCES: A list of the datasources the query will touch.QUERY: The full native JSON query that Druid generated. This is invaluable for deep debugging.By inspecting the plan, you can verify if your query is being optimized as expected. For example, you can check if a GROUP BY... ORDER BY... LIMIT query is correctly being translated into a more performant TopN query.
Answer: Subqueries and UNION ALL are supported in Druid SQL, but they have performance implications that are important to understand.
JOIN). If the subquery returns a large number of rows, this can create a memory bottleneck on the Broker. Druid has configuration limits (maxSubqueryRows and maxSubqueryBytes) to prevent out-of-memory errors.UNION ALL: A UNION ALL query is executed by running a separate query for each table in the union. The Broker then merges the results. This is generally more efficient than a subquery approach but can still be. If you find yourself frequently writing UNION ALL queries over datasources with identical schemas, consider whether they should be ingested into a single datasource instead. If you must use UNION ALL, it can sometimes be faster to run the individual queries from your client application and perform the final merge there.Optimizing this layer of your Druid cluster can be complex. For expert guidance and a professional review of your Druid architecture, consider reaching out to the specialists at Apache Druid professionals at Iunera.
This post is the summary of a series on Apache Druid Query Performance. Apache Druid… Read More
This article is Part 4 of a series on Apache Druid Query Performance. Apache Druid… Read More
This article is Part 2 of a series on Apache Druid Query Performance. Apache Druid… Read More
This article is Part 1 of a series on Apache Druid Query Performance. Apache Druid… Read More
Unlock the full potential of your Apache Druid cluster. This series introduction explains why performance… Read More
This guide builds on Infrastructure Setup for Enterprise Apache Druid on Kubernetes – Building the… Read More