Q: How does segmentGranularity (e.g., DAY, HOUR) affect performance and storage?

segmentGranularity defines the time window for which segments are created and is critical for time-based partitioning. Queries over short intervals: A finer granularity (e.g., HOUR) can be faster as Druid has to scan less data. Queries over long intervals: A coarser granularity (e.g., DAY) is often better because the query will touch fewer segments, reducing overhead. Best Practice: Choose a segmentGranularity that aligns with your most common query patterns. If most queries are over a few hours, HOUR is a good choice. If they typically span multiple days or weeks, DAY is more appropriate.

Question 1

My queries are incredibly slow, and the system schema reveals tens of thousands of tiny segments. Will compaction solve this, and why is it so important?

Accepted Answer

Yes, compaction is one of the most effective and critical solutions for this problem. A large number of small segments is a primary and severe cause of poor query performance in Druid.

The architecture of Druid is designed for massively parallel processing, where queries are distributed across data servers (Historicals and Indexers) that process segments concurrently. However, this parallelism is not infinite, and every segment, regardless of its size, incurs a fixed scheduling and processing overhead. When a query's time interval covers thousands or tens of thousands of tiny segments, this overhead accumulates and can overwhelm the system.

Compaction directly addresses this by functioning as a background re-indexing process. It merges numerous small segments into a smaller number of larger, more optimized ones. This fundamental change offers several benefits:

Reduced Overhead: The per-segment scheduling and I/O overhead is drastically lowered.
Efficient Resource Utilization: Each thread in the processing pool now operates on a much larger, more significant chunk of data.
Improved Data Locality: Compaction can also improve data locality and the effectiveness of rollup.

This issue often arises from a fundamental tension in real-time ingestion pipelines. Compaction should not be viewed as an occasional cleanup task but as an integral, non-negotiable component of any Apache Druid Ingestion pipeline.

Question 2

What are the ideal segment size settings? The documentation mentions 5 million rows and 300-700MB. Which one should I prioritize?

Accepted Answer

Prioritize the number of rows per segment, targeting approximately 5 million rows. The segment byte size (300-700MB) is a useful secondary heuristic but should not be the primary target.

The reasoning lies in how Druid manages work. A single thread processes one segment at a time. Targeting a consistent row count ensures a balanced and predictable distribution of work across the cluster's processing threads, maximizing parallelism. A 500MB segment could contain 15 million rows (too much work) or 500,000 rows (too much overhead), depending on the data's shape, making the row count a more reliable target.

Question 3

How does segmentGranularity (e.g., DAY, HOUR) affect performance and storage?

Accepted Answer

segmentGranularity defines the time window for which segments are created and is critical for time-based partitioning.

Queries over short intervals: A finer granularity (e.g., HOUR) can be faster as Druid has to scan less data.
Queries over long intervals: A coarser granularity (e.g., DAY) is often better because the query will touch fewer segments, reducing overhead.

Best Practice: Choose a segmentGranularity that aligns with your most common query patterns. If most queries are over a few hours, HOUR is a good choice. If they typically span multiple days or weeks, DAY is more appropriate.

Question 4

Besides merging small segments, what other benefits does compaction offer?

Accepted Answer

While merging small segments is the primary benefit, compaction also offers other powerful optimization capabilities:

Re-partitioning: Change the secondary partitioning scheme within a time chunk to sort data by a frequently filtered dimension, improving data locality.
Schema Changes: Apply certain schema changes, like adding new filtered dimensions or changing metric definitions, without a full re-ingestion.
Further Rollup: If your initial ingestion had a fine queryGranularity, you can run a compaction job with a coarser one to further roll up historical data, reducing storage and speeding up queries.

The Foundations of Apache Druid Performance Tuning – Data & Segments

My queries are incredibly slow, and the system schema reveals tens of thousands of tiny segments. Will compaction solve this, and why is it so important?

What are the ideal segment size settings? The documentation mentions 5 million rows and 300-700MB. Which one should I prioritize?

How does segmentGranularity (e.g., DAY, HOUR) affect performance and storage?

Besides merging small segments, what other benefits does compaction offer?

Let us know your challenges or support us by sharing the article

Need expert help with Apache Druid?

Search

Recent Posts

Latest Changes

Archives

Categories

Meta