This article is Part 2 of a series on Apache Druid Query Performance.
The most profound performance optimizations in Druid happen before a single query is ever written. Apache Druid Advanced Data Modeling is about structuring your data at ingestion time to align with Druid’s architectural strengths. Decisions made here have a lasting impact on query speed, storage costs, and overall cluster efficiency.
Answer: Rollup. Rollup is the process of pre-aggregating raw data during ingestion based on the dimensions and queryGranularity you specify. It is the most powerful feature for reducing data size and accelerating queries in Druid.
For example, if you have thousands of click events per minute, each with the same dimensions (e.g., country, browser), rollup can combine these into a single row in Druid. The metric columns in this row would contain the aggregated values (e.g., SUM(clicks), COUNT(*)).
The performance impact is massive:
Best Practice: Always enable rollup unless you absolutely need to query every raw event. Set the queryGranularity to the coarsest level that satisfies your business requirements. If you only need to analyze data at a minute-level resolution, do not ingest it at the millisecond level.
Answer: While time is Druid’s primary partition key, secondary partitioning (partitionedBy or clusteredBy in the partitionsSpec) organizes data within each time chunk. It physically sorts and groups data based on the values of one or more specified dimensions.
When a query then filters on that partitioned dimension, the performance gains are significant. The query engine can leverage the sorted nature of the data to quickly seek to the relevant blocks within a segment file, avoiding a full column scan. This improves data locality and dramatically reduces the amount of data that needs to be read from disk and processed.
Best Practice:
__time) that appear most frequently in WHERE clauses.partitionedBy or clusteredBy spec. It’s often best to start with a low-to-medium cardinality dimension.dimensionsSpec also defines a sort order. Place your most frequently filtered dimensions first in this list to gain additional locality benefits.Answer: High-cardinality dimensions are a common challenge and require specific modeling techniques to avoid performance degradation, especially during ingestion and for GROUP BY queries.
HLLSketch or thetaSketch. These sketches allow for extremely fast and accurate approximate COUNT(DISTINCT...) calculations at query time without the overhead of a high-cardinality dimension column. This is the recommended approach for cardinality estimation in Druid.product_id could be broken into product_category and product_model.Answer: Druid’s architecture is optimized for scanning and aggregating single, large fact tables. While Druid supports JOIN operations, its performance characteristics are different from traditional relational databases.
JOINs. The most common and performant type is a broadcast join, where a smaller dimension table (or subquery result) is broadcast to all data nodes to be joined with the large, distributed fact table. This works well when the right-hand side of the join is small enough to fit in memory on each data node.Best Practice: Denormalize whenever possible. Use query-time JOINs for enriching your fact table with smaller dimension tables (lookups) that might change over time. Avoid large distributed joins if sub-second query latency is required.
Answer: Druid has native support for ingesting and querying nested JSON data using the COMPLEX<json> data type. It can automatically detect nested structures during ingestion and provides a suite of SQL functions (JSON_VALUE, JSON_QUERY, etc.) to extract values at query time.
LONG, DOUBLE, STRING). This provides the best of both worlds: the flexibility to store the raw JSON for ad-hoc exploration and the high performance of columnar storage for common query patterns.Best Practice: For nested data, ingest the full object as a COMPLEX<json> column. For fields that you filter or group on frequently, use the flattenSpec (in native ingestion) or SQL functions during ingestion to extract them into their own top-level columns. This ensures optimal performance for your most common queries.
Answer: Virtual columns are temporary columns that are calculated on-the-fly at query time. They allow you to apply transformations or calculations to one or more base columns without permanently storing the result.
concat(firstName, ' ', lastName)).WHERE clause.Best Practice: Use virtual columns for ad-hoc exploration and for transformations that are not known at ingestion time. If you find yourself repeatedly using the same virtual column in performance-sensitive queries, it is a strong signal that you should materialize that column during ingestion for much better performance.
This post is the summary of a series on Apache Druid Query Performance. Apache Druid… Read More
This article is Part 4 of a series on Apache Druid Query Performance. Apache Druid… Read More
This article is Part 3 of a series on Apache Druid Query Performance. Apache Druid… Read More
This article is Part 1 of a series on Apache Druid Query Performance. Apache Druid… Read More
Unlock the full potential of your Apache Druid cluster. This series introduction explains why performance… Read More
This guide builds on Infrastructure Setup for Enterprise Apache Druid on Kubernetes – Building the… Read More