Writing Performant Apache Druid Queries

This article is Part 3 of a series on Apache Druid Query Performance.

Apache Druid Query Performance Bottlenecks: A Q&A Guide
The Foundations of Apache Druid Performance Tuning: Data & Segments
Apache Druid Advanced Data Modeling for Peak Performance
Writing Performant Apache Druid Queries (You are here)
Apache Druid Cluster Tuning & Resource Management
Apache Druid Query Performance Bottlenecks: Series Summary

Best Practice: Whenever possible, structure your analytical questions to fit the Timeseries or TopN patterns to leverage their specialized, high-performance execution paths.

Answer: The single most important filtering strategy in Druid is to always filter on the __time column over the smallest possible interval. The second most impactful strategy is to use equality filters (= or IN) on low-to-medium cardinality dimensions.

Time Filter is King: Druid is a time-series database first. Data is physically partitioned by time. A WHERE __time BETWEEN... clause allows the Broker to perform “segment pruning,” immediately ignoring any segments outside the query’s time range. A query without a time filter is a worst-case scenario, forcing a full scan of the entire datasource.
Use Indexes with Equality Filters: Druid builds bitmap indexes for dimension columns. An equality filter like WHERE country = 'US' can use these indexes to instantly identify matching rows without scanning the column’s values.
Avoid Performance Killers:
- Functions on Columns: Applying a function in the WHERE clause (e.g., WHERE UPPER(product) = 'BIKE') prevents index usage and forces a full scan and computation on every row.
- Inequality and Leading Wildcards: Filters like != or LIKE '%value' are expensive as they cannot use indexes efficiently.

Answer: EXPLAIN PLAN is a powerful Druid SQL command that shows you how Druid will execute your query. It translates your SQL into the native JSON query format and provides a query plan, allowing you to see exactly what operations will be performed.

To use it, simply prepend EXPLAIN PLAN FOR to your SQL query:

EXPLAIN PLAN FOR
SELECT
  "page",
  COUNT(*) AS "Edits"
FROM "wikipedia"
WHERE "countryName" IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

The output will be a JSON object containing two key pieces of information:

PLAN: A textual representation of the execution steps. This will show you if Druid is using a GroupBy, TopN, or Timeseries engine, and what operations like filtering and sorting are being applied.
RESOURCES: A list of the datasources the query will touch.
QUERY: The full native JSON query that Druid generated. This is invaluable for deep debugging.

By inspecting the plan, you can verify if your query is being optimized as expected. For example, you can check if a GROUP BY... ORDER BY... LIMIT query is correctly being translated into a more performant TopN query.

Answer: Subqueries and UNION ALL are supported in Druid SQL, but they have performance implications that are important to understand.

Subqueries: Druid often executes subqueries by materializing their full result set on the Broker before using them in the outer query (e.g., as the right-hand side of a JOIN). If the subquery returns a large number of rows, this can create a memory bottleneck on the Broker. Druid has configuration limits (maxSubqueryRows and maxSubqueryBytes) to prevent out-of-memory errors.
Best Practice: Minimize the use of subqueries. If possible, pre-calculate the subquery’s result and ingest it as a separate lookup table. When a subquery is necessary, ensure it is as restrictive as possible, returning the minimum number of rows and columns needed.

UNION ALL: A UNION ALL query is executed by running a separate query for each table in the union. The Broker then merges the results. This is generally more efficient than a subquery approach but can still be. If you find yourself frequently writing UNION ALL queries over datasources with identical schemas, consider whether they should be ingested into a single datasource instead. If you must use UNION ALL, it can sometimes be faster to run the individual queries from your client application and perform the final merge there.

Optimizing this layer of your Druid cluster can be complex. For expert guidance and a professional review of your Druid architecture, consider reaching out to the specialists at Apache Druid professionals at Iunera.

Let us know your challenges or support us by sharing the article

Check iunera.com to learn more about what we do!

Categories:

Apache Druid Time Series Analytics

Tags:

Apache druid big data bigdata performance query

Writing Performant Apache Druid Queries

Let us know your challenges or support us by sharing the article

Need expert help with Apache Druid?

Search

Recent Posts

Latest Changes

Archives

Categories

Meta