Q: What is the single most important concept in Druid data modeling for query speed?

Rollup. Rollup is the process of pre-aggregating raw data during ingestion based on the dimensions and queryGranularity you specify. It is the most powerful feature for reducing data size and accelerating queries in Druid. The performance impact is massive because it drastically reduces the amount of data Druid needs to scan. Always enable rollup unless you absolutely need to query every raw event.

Q: Are JOINs slow in Druid? When should I use them versus denormalizing data?

Denormalize whenever possible. Druid's architecture is optimized for scanning and aggregating single, large fact tables. The fastest approach is to denormalize your data by joining fact and dimension tables before or during ingestion . Query-time JOIN s are supported and perform well for broadcast joins, where a smaller dimension table is broadcast to all data nodes. However, large fact-to-fact joins can be slow as they may require shuffling large amounts of data between nodes, which is not Druid's primary strength.

Q: How does Druid handle nested JSON data, and what are the performance implications?

Druid has native support for nested JSON data using the COMPLEX data type. While flexible, querying raw nested JSON can be slower than querying standard flat columns because the JSON structure must be parsed for each query. Best Practice: For frequently accessed nested fields, use the flattenSpec (in native ingestion) or SQL functions during ingestion to extract them into their own optimized, columnar format. This provides the high performance of columnar storage for common query patterns while retaining the full object for ad-hoc exploration.

Question 1

What is the single most important concept in Druid data modeling for query speed?

Accepted Answer

Rollup. Rollup is the process of pre-aggregating raw data during ingestion based on the dimensions and queryGranularity you specify. It is the most powerful feature for reducing data size and accelerating queries in Druid.

The performance impact is massive because it drastically reduces the amount of data Druid needs to scan. Always enable rollup unless you absolutely need to query every raw event.

Question 2

How should I choose my partitioning keys (partitionedBy, clusteredBy)?

Accepted Answer

While time is Druid's primary partition key, secondary partitioning (using partitionedBy or clusteredBy) physically sorts and groups data within each time chunk based on specified dimension values.

When a query filters on that partitioned dimension, the engine can quickly seek to the relevant blocks within a segment, avoiding a full column scan. This dramatically reduces the amount of data read from disk.

Best Practice: Identify the dimension that appears most frequently in your WHERE clauses and use it for secondary partitioning. This will provide the most significant performance gain.

Question 3

My data has high-cardinality dimensions (like user IDs). How do I handle them without killing performance?

Accepted Answer

High-cardinality dimensions are a common challenge. The best solutions involve avoiding direct operations on them:

Avoid Grouping Directly: The best solution is to avoid grouping on the raw high-cardinality dimension. Instead, use sketch aggregators at ingestion time.
Use Sketch Aggregators: Ingest high-cardinality columns (like user IDs) into a metric column using a sketch algorithm like HLLSketch or thetaSketch. This allows for extremely fast and accurate approximate COUNT(DISTINCT...) calculations at query time without the overhead of a high-cardinality dimension.
Filter, Then Group: If you must query the raw values, always apply the most restrictive filters possible before grouping on the high-cardinality column.

Question 4

Are JOINs slow in Druid? When should I use them versus denormalizing data?

Accepted Answer

Denormalize whenever possible. Druid's architecture is optimized for scanning and aggregating single, large fact tables. The fastest approach is to denormalize your data by joining fact and dimension tables before or during ingestion.

Query-time JOINs are supported and perform well for broadcast joins, where a smaller dimension table is broadcast to all data nodes. However, large fact-to-fact joins can be slow as they may require shuffling large amounts of data between nodes, which is not Druid's primary strength.

Question 5

How does Druid handle nested JSON data, and what are the performance implications?

Accepted Answer

Druid has native support for nested JSON data using the COMPLEX data type. While flexible, querying raw nested JSON can be slower than querying standard flat columns because the JSON structure must be parsed for each query.

Best Practice: For frequently accessed nested fields, use the flattenSpec (in native ingestion) or SQL functions during ingestion to extract them into their own optimized, columnar format. This provides the high performance of columnar storage for common query patterns while retaining the full object for ad-hoc exploration.

Question 6

What are virtual columns, and when should I use them?

Accepted Answer

Virtual columns are temporary columns calculated on-the-fly at query time. They are useful for transformations, on-the-fly filtering, and testing new dimensions.

However, because they are computed for every row processed by a query, they can add significant overhead. If you find yourself repeatedly using the same virtual column in performance-sensitive queries, it is a strong signal that you should materialize that column during ingestion for much better performance.

Apache Druid Advanced Data Modeling for Peak Performance

What is the single most important concept in Druid data modeling for query speed?

How should I choose my partitioning keys (partitionedBy, clusteredBy)?

My data has high-cardinality dimensions (like user IDs). How do I handle them without killing performance?

Are JOINs slow in Druid? When should I use them versus denormalizing data?

How does Druid handle nested JSON data, and what are the performance implications?

What are virtual columns, and when should I use them?

Let us know your challenges or support us by sharing the article

Need expert help with Apache Druid?

Search

Recent Posts

Latest Changes

Archives

Categories

Meta