Apache Druid Advanced Data Modeling for Peak Performance

This article is Part 2 of a series on Apache Druid Query Performance.

The most profound performance optimizations in Druid happen before a single query is ever written. Apache Druid Advanced Data Modeling is about structuring your data at ingestion time to align with Druid’s architectural strengths. Decisions made here have a lasting impact on query speed, storage costs, and overall cluster efficiency.

What is the single most important concept in Druid data modeling for query speed?

Answer: Rollup. Rollup is the process of pre-aggregating raw data during ingestion based on the dimensions and queryGranularity you specify. It is the most powerful feature for reducing data size and accelerating queries in Druid.

For example, if you have thousands of click events per minute, each with the same dimensions (e.g., country, browser), rollup can combine these into a single row in Druid. The metric columns in this row would contain the aggregated values (e.g., SUM(clicks), COUNT(*)).

The performance impact is massive:

  • Drastic Data Reduction: Rollup can reduce the total storage footprint by one or two orders of magnitude. This is the ultimate form of compression.
  • Massively Faster Scans: Because the number of rows is significantly reduced, queries have to scan far less data. A query that would have needed to scan a billion raw rows might now only need to scan ten million rolled-up rows, resulting in a proportional decrease in query time.

Best Practice: Always enable rollup unless you absolutely need to query every raw event. Set the queryGranularity to the coarsest level that satisfies your business requirements. If you only need to analyze data at a minute-level resolution, do not ingest it at the millisecond level.

How should I choose my partitioning keys (partitionedBy, clusteredBy)?

Answer: While time is Druid’s primary partition key, secondary partitioning (partitionedBy or clusteredBy in the partitionsSpec) organizes data within each time chunk. It physically sorts and groups data based on the values of one or more specified dimensions.

When a query then filters on that partitioned dimension, the performance gains are significant. The query engine can leverage the sorted nature of the data to quickly seek to the relevant blocks within a segment file, avoiding a full column scan. This improves data locality and dramatically reduces the amount of data that needs to be read from disk and processed.

Best Practice:

  1. Identify Primary Filter Dimensions: Analyze your most common and performance-critical queries. Identify the dimension columns (other than __time) that appear most frequently in WHERE clauses.
  2. Order by Cardinality: Choose one or more of these dimensions for your partitionedBy or clusteredBy spec. It’s often best to start with a low-to-medium cardinality dimension.
  3. Align with Ingestion Sort: The order of dimensions in your dimensionsSpec also defines a sort order. Place your most frequently filtered dimensions first in this list to gain additional locality benefits.

My data has high-cardinality dimensions (like user IDs). How do I handle them without killing performance?

Answer: High-cardinality dimensions are a common challenge and require specific modeling techniques to avoid performance degradation, especially during ingestion and for GROUP BY queries.

  • Problem: High-cardinality dimensions create very large dictionaries and indexes, which can slow down ingestion, increase segment size, and put pressure on memory. Queries that group on these dimensions can create massive intermediate result sets that bottleneck the Broker.
  • Solutions:
    1. Avoid Grouping Directly: The best solution is to avoid grouping on the raw high-cardinality dimension. Instead, ask if the business question can be answered with an approximate count-distinct.
    2. Use Sketch Aggregators: At ingestion time, instead of storing the raw user ID as a dimension, ingest it into a metric column using a sketch algorithm like HLLSketch or thetaSketch. These sketches allow for extremely fast and accurate approximate COUNT(DISTINCT...) calculations at query time without the overhead of a high-cardinality dimension column. This is the recommended approach for cardinality estimation in Druid.
    3. Filter, Then Group: If you must query the raw values, always apply the most restrictive filters possible before grouping on the high-cardinality column. This reduces the number of unique values the grouping engine has to process.
    4. Split the Dimension: In some cases, you can split a high-cardinality dimension into multiple lower-cardinality ones. For example, a product_id could be broken into product_category and product_model.

Are JOINs slow in Druid? When should I use them versus denormalizing data?

Answer: Druid’s architecture is optimized for scanning and aggregating single, large fact tables. While Druid supports JOIN operations, its performance characteristics are different from traditional relational databases.

  • Ingestion-Time Joins (Denormalization): The fastest and most recommended approach is to denormalize your data before or during ingestion. This means joining your fact table with smaller dimension tables to create a single, wide table in Druid. This “pre-joining” ensures that all data needed for a query is co-located, allowing Druid’s query engine to operate at maximum efficiency.
  • Query-Time Joins: Druid also supports query-time JOINs. The most common and performant type is a broadcast join, where a smaller dimension table (or subquery result) is broadcast to all data nodes to be joined with the large, distributed fact table. This works well when the right-hand side of the join is small enough to fit in memory on each data node.
  • Performance Considerations: Large fact-to-fact joins can be slow because they may require shuffling large amounts of data between nodes, which is not Druid’s primary strength.

Best Practice: Denormalize whenever possible. Use query-time JOINs for enriching your fact table with smaller dimension tables (lookups) that might change over time. Avoid large distributed joins if sub-second query latency is required.

How does Druid handle nested JSON data, and what are the performance implications?

Answer: Druid has native support for ingesting and querying nested JSON data using the COMPLEX<json> data type. It can automatically detect nested structures during ingestion and provides a suite of SQL functions (JSON_VALUE, JSON_QUERY, etc.) to extract values at query time.

  • Performance: While flexible, querying raw nested JSON can be slower than querying standard flat columns. For each query, Druid must parse the JSON structure to extract the requested fields.
  • Optimization: For frequently accessed nested fields, Druid offers a powerful optimization. During ingestion, you can define dimensions that automatically extract specific fields from the raw JSON into their own optimized, columnar format (e.g., LONG, DOUBLE, STRING). This provides the best of both worlds: the flexibility to store the raw JSON for ad-hoc exploration and the high performance of columnar storage for common query patterns.

Best Practice: For nested data, ingest the full object as a COMPLEX<json> column. For fields that you filter or group on frequently, use the flattenSpec (in native ingestion) or SQL functions during ingestion to extract them into their own top-level columns. This ensures optimal performance for your most common queries.

What are virtual columns, and when should I use them?

Answer: Virtual columns are temporary columns that are calculated on-the-fly at query time. They allow you to apply transformations or calculations to one or more base columns without permanently storing the result.

  • Use Cases:
    • Transformations: Concatenating strings, performing mathematical calculations, or extracting parts of a string (e.g., concat(firstName, ' ', lastName)).
    • On-the-fly Filtering: Creating a derived column to use in a WHERE clause.
    • Schema Flexibility: Testing out new derived dimensions before deciding to materialize them in your ingestion spec.
  • Performance: Because virtual columns are computed for every row processed by a query, they can add significant overhead. A query with a complex expression-based virtual column will be slower than a query on a pre-calculated, materialized column.

Best Practice: Use virtual columns for ad-hoc exploration and for transformations that are not known at ingestion time. If you find yourself repeatedly using the same virtual column in performance-sensitive queries, it is a strong signal that you should materialize that column during ingestion for much better performance.