Infrastructure Setup for Enterprise Apache Druid on Kubernetes – Building the Foundation

Series: Installing a Production-Ready Apache Druid Cluster on Kubernetes — Part 1: Infrastructure Foundation

This is part of a series about installing a production-ready Apache Druid cluster on Kubernetes. This first part covers the infrastructure foundation; upcoming parts will cover deployment preparation, cluster configuration and tuning, authentication and authorization, and day‑2 operations.

Introduction

Setting up Apache Druid for enterprise production environments requires a solid infrastructure foundation that goes far beyond simply deploying the Druid cluster itself. These are the prerequisites for production ready installation: a robust secrets management, a reliable GitOps workflow, and properly configured persistency layers for both metadata and deep storage.

Modern enterprises are increasingly turning to Apache Druid for their time series analytics needs, but many underestimate the infrastructure complexity required for production-ready deployments. Unlike traditional database setups, Druid’s distributed architecture demands careful orchestration of multiple storage backends, sophisticated secret management, enterprise-grade security practices and benefits from proper devops tooling.

This comprehensive guide walks you through the essential infrastructure setup that must be completed before deploying Apache Druid on Kubernetes. We’ll cover secrets management preparation with SOPS encryption, GitOps configuration using FluxCD, and the setup of both PostgreSQL metadata storage and S3-compatible object storage for deep storage requirements.

Summary

This article provides a complete infrastructure foundation setup for enterprise Apache Druid deployments on Kubernetes. You’ll learn how to implement secure secret management using SOPS, configure GitOps workflows with FluxCD, set up PostgreSQL for metadata storage with proper authentication, and configure S3 compatible object storage (MinIO or AWS S3) for deep storage with appropriate IAM policies. The guide includes production-ready configurations, security best practices, and troubleshooting tips essential for enterprise environments.

Authorization Layer Preparation

Understanding Druid’s Authorization Architecture

Apache Druid’s enterprise deployment requires a sophisticated authorization architecture that manages secrets across multiple components. The secrets management serves as the foundation for authenticated communication between Druid services and metadata storage, deep storage backends and external services.

The key principle behind enterprise Druid authorization is the separation of concerns: authentication credentials, encryption keys, and access policies must be managed independently from application configurations. This approach ensures that sensitive information never appears in plain text within your GitOps repository.

SOPS Configuration for Secret Management

SOPS (Secrets OPerationS) provides the encryption backbone for your secrets management. In our enterprise setup, we utilize AGE encryption for its simplicity and security. The configuration begins with a .sops.yaml file that defines encryption rules:

# creation rules are evaluated sequentially, the first match wins
creation_rules:
  - path_regex: .enc.yaml
    encrypted_regex: ^(data|stringData)$
    age: >-
      age<keyId1>,
      age<keyId1>,

This configuration ensures that any file matching the .enc.yaml pattern will have its data and stringData sections encrypted using the specified AGE public keys. The dual-key approach provides redundancy and supports key rotation scenarios. Like the SOPS documentation described you can easily configure central managed keys for that purpose like AWS KMS, GCP KMS or Azure Keystores. You have to ensure that the decrypting Kubernetes Cluster has permission to access the keys in the Keystores.

Creating Database Credentials with GitOps

A critical secret in your Druid infrastructure is the postgres-password key, which must be created before any database deployment (e.g. with openssl rand -hex 32. This secret follows the GitOps principle by being version-controlled in encrypted form:

# Create the encrypted secret file
sops kubernetes/sops-secrets/druid/iuneradruid-metastore-postgres-secret.enc.yaml

The secret structure should follow this pattern:

apiVersion: v1
kind: Secret
metadata:
  name: iuneradruid-metastore-postgres-secret
  namespace: druid
type: Opaque
data:
  postgres-password: <base64-encoded-password>

This approach ensures that database credentials are managed through your GitOps workflow while maintaining security through SOPS encryption. The secret will be automatically decrypted and applied by FluxCD during deployment.

Security Best Practices for Secret Management

Enterprise secret management requires adherence to several security principles. First, implement the principle of least privilege by creating dedicated service accounts for each component. Second, use separate encryption keys for different environments (development, staging, production). Third, establish a key rotation schedule and document the rotation procedures.

Consider implementing master data management principles when organizing your secrets. Treat authentication credentials as master data that requires careful governance, versioning, and lifecycle management.

GitOps Configuration with FluxCD and SOPS

The Importance of GitOps for Enterprise Druid

GitOps represents a paradigm shift in how enterprise infrastructure is managed. For Apache Druid deployments, GitOps provides several critical advantages: declarative configuration management, automated deployment pipelines, audit trails for all changes, and the ability to quickly rollback problematic deployments.

Traditional Druid deployments often suffer from configuration drift, where production environments gradually diverge from documented configurations. GitOps eliminates this problem by ensuring that your Git repository serves as the single source of truth for all infrastructure and application configurations.

FluxCD Integration with SOPS

FluxCD serves as the GitOps operator that continuously monitors your Git repository and applies changes to your Kubernetes cluster. The integration with SOPS enables secure handling of encrypted secrets throughout the deployment pipeline.

The FluxCD configuration includes two primary kustomizations:

---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: common
  namespace: flux-system
spec:
  interval: 10m0s
  path: ./kubernetes/common
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: sops-secrets
  namespace: flux-system
spec:
  decryption:
    provider: sops
    secretRef:
      name: sops-age
  interval: 10m0s
  path: ./kubernetes/sops-secrets
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system

The common kustomization manages general infrastructure components, while the sops-secrets kustomization specifically handles encrypted secrets with SOPS decryption enabled.

Note: Later will we implement the druid-configuration in a dedicated repo including dedicated kustomizations.

Kustomization Best Practices

Effective kustomization organization follows a hierarchical structure that separates concerns while maintaining flexibility. Base configurations define common settings, while overlays provide environment-specific customizations.

For Druid deployments, consider organizing kustomizations by functional area: networking, storage, security, and application layers. This approach simplifies troubleshooting and enables selective updates to specific infrastructure components.

Persistency Layer Setup

PostgreSQL for Metadata Storage

Apache Druid relies heavily on PostgreSQL for storing metadata about segments, datasources, and cluster configuration like lookups. The metadata store serves as the central nervous system of your Druid cluster, making its reliability and performance critical to overall system health.

Production-Ready PostgreSQL Configuration

The PostgreSQL deployment uses the Bitnami Helm chart with enterprise-focused configurations:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: postgres
  namespace: druid
spec:
  releaseName: postgres
  targetNamespace: druid
  chart:
    spec:
      chart: postgresql
      version: ">=15.0.0"
      sourceRef:
        kind: HelmRepository
        name: bitnami-charts
        namespace: druid
  values:
    global:
      postgresql:
        auth:
          username: postgres
          existingSecret: iuneradruid-metastore-postgres-secret
          secretKeys:
            adminPasswordKey: postgres-password
    primary:
      persistence:
        enabled: true
        existingClaim: iuneradruid-metastore-postgres-pvc
      labels: 
        app.kubernetes.io/networkpolicy-group: druid
    metrics:
      enabled: true

This configuration demonstrates several enterprise best practices. The existingSecret reference connects to the SOPS-encrypted credentials created in the secrets management. The existingClaim ensures data persistence across pod restarts. Network policy labels enable fine-grained traffic control between Druid components. Of course you can use any kind of postgres installation or Postgres as a Service (e.g. AWS RDS) as long it’s accessible via the standard JDBC driver.

Performance Considerations and Sizing

PostgreSQL performance directly impacts Druid query response times and ingestion throughput. For enterprise deployments, consider these sizing guidelines:

  • CPU: Minimum 2 cores, recommended 4-8 cores for high-throughput environments
  • Memory: Minimum 4GB, recommended 8-16GB with proper buffer pool configuration
  • Storage: Use high-performance SSDs with at least 1000 IOPS capability
  • Network: Ensure low-latency connectivity between PostgreSQL and Druid coordinators

Monitor key PostgreSQL metrics including connection pool utilization, query execution times, and lock contention. These metrics provide early warning signs of capacity constraints that could impact Druid performance.

Object Storage Configuration

Deep storage serves as the permanent repository for Druid segments, making it essential for data durability and disaster recovery. Both MinIO and AWS S3 provide enterprise-grade object storage capabilities, though each has distinct advantages.

S3 Bucket Requirements

Its recommended to have two dedicated S3 buckets for proper operation:

  1. Deep Storage Bucket: Used for storing Druid segments permanently
  2. Indexing Logs Bucket: Used for storing indexing task logs and temporary files

These buckets must be accessible through either an AWS IAM instance role (recommended for EC2-based deployments) or an IAM user with appropriate permissions.

IAM User Configuration with SOPS

When using an IAM user for S3 access, the credentials must be securely managed through SOPS encryption. Create a secret called iuneradruid-s3iam-secrets containing the following keys:

apiVersion: v1
kind: Secret
metadata:
  name: iuneradruid-s3iam-secrets
  namespace: druid
type: Opaque
stringData:
  AWS_REGION: eu-central-1
  AWS_ACCESS_KEY_ID: <access-key>
  AWS_SECRET_ACCESS_KEY: <secret-key>

This secret should be encrypted using SOPS before committing to your GitOps repository:

# Create the encrypted secret file
sops kubernetes/sops-secrets/druid/iuneradruid-s3iam-secrets.enc.yaml

AWS S3 IAM Policies

When using AWS S3 for deep storage, proper IAM policies ensure secure access while maintaining the principle of least privilege. The Terraform configuration demonstrates enterprise-grade S3 setup:

resource "aws_iam_policy" "deepstorage" {
  name = "${local.prefix}-access-deepstorage"
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:ListBucket", 
          "s3:PutObject",
          "s3:DeleteObject",
          "s3:GetBucketAcl",
          "s3:PutObjectAcl"
        ]
        Resource = [
          aws_s3_bucket.deepstorage.arn,
          "${aws_s3_bucket.deepstorage.arn}/*"
        ]
      },
      {
        Effect = "Allow"
        Action = [
          "kms:Decrypt",
          "kms:GetPublicKey", 
          "kms:Encrypt",
          "kms:GenerateDataKey"
        ]
        Resource = [local.kms_arn]
      }
    ]
  })
}

This IAM policy provides the minimum permissions required for Druid deep storage operations while including KMS permissions for encryption at rest. The policy can be attached to either IAM users (with credentials managed via SOPS) or instance profiles for EC2-based deployments.

For detailed S3 configuration requirements, refer to the official Apache Druid documentation which provides comprehensive guidance on S3 extension configuration and troubleshooting.

Storage Backend Comparison

FeatureMinIOAWS S3
DeploymentSelf-hostedManaged service
CostInfrastructure onlyPay-per-use
LatencyNetwork dependentRegion dependent
DurabilityConfigurable99.999999999%
ComplianceSelf-managedAWS compliance
IntegrationS3 API compatibleNative S3

Infrastructure Approach Comparison

AspectTraditional SetupKubernetes-Native
DeploymentManual configurationDeclarative manifests
ScalingManual interventionAutomatic scaling
UpdatesDowntime requiredRolling updates
MonitoringExternal toolsIntegrated observability
SecurityManual certificate managementAutomated TLS
BackupScript-basedOperator-managed

GitOps vs Traditional Deployment

FactorGitOps ApproachTraditional Deployment
Change ManagementGit-based workflowManual procedures
Audit TrailComplete Git historyLimited documentation
RollbackGit revertManual restoration
Environment ConsistencyGuaranteedConfiguration drift
SecurityEncrypted secrets in GitExternal secret stores
CollaborationPull request reviewsDirect server access

Technical Examples

Complete SOPS Secret Creation

# Generate AGE key pair
age-keygen -o age-key.txt

# Create SOPS configuration
cat << EOF > .sops.yaml
creation_rules:
  - path_regex: .enc.yaml
    encrypted_regex: ^(data|stringData)$
    age: $(cat age-key.txt | grep public | cut -d: -f2 | tr -d ' ')
EOF

# Create encrypted secret
sops kubernetes/sops-secrets/druid/postgres-secret.enc.yaml

FluxCD Bootstrap Command

# Bootstrap FluxCD with SOPS support
flux bootstrap git \
  --url=https://github.com/your-org/k8s-config \
  --branch=main \
  --path=kubernetes/flux-system \
  --components-extra=image-reflector-controller,image-automation-controller \
  --token-auth=true

PostgreSQL Connection Verification

# Test PostgreSQL connectivity
kubectl exec -it postgres-0 -n druid -- psql -U postgres -c "\l"

# Verify secret mounting
kubectl get secret iuneradruid-metastore-postgres-secret -n druid -o yaml

Troubleshooting Common Issues

SOPS Decryption Failures

When FluxCD fails to decrypt SOPS secrets, verify the AGE private key is correctly mounted:

# Check SOPS secret in flux-system namespace
kubectl get secret sops-age -n flux-system -o yaml

# Verify kustomization status
flux get kustomizations sops-secrets

PostgreSQL Connection Issues

Database connectivity problems often stem from network policies or incorrect credentials:

# Check PostgreSQL pod status
kubectl get pods -n druid -l app.kubernetes.io/name=postgresql

# Verify service endpoints
kubectl get endpoints postgres -n druid

# Test database connection
kubectl run postgres-client --rm -it --image postgres:15 -- psql -h postgres.druid.svc.cluster.local -U postgres

Conclusion

Building a robust infrastructure foundation is the cornerstone of successful enterprise Apache Druid deployments. After extensive experience with various deployment approaches, deploying Apache Druid on Kubernetes represents the optimal path for enterprise environments. The container orchestration capabilities, automated scaling, and declarative configuration management that Kubernetes provides are perfectly aligned with Druid’s distributed architecture requirements.

GitOps is not just recommended—it’s essential for production Druid deployments. The complexity of managing secrets, configurations, and multi-component deployments makes traditional manual approaches both error-prone and unsustainable. GitOps ensures consistency, provides complete audit trails, and enables rapid rollbacks when issues arise. The three-layer approach we’ve outlined—secrets management, GitOps workflows, and persistent storage—provides the security, reliability, and maintainability that enterprise environments demand.

The secrets management layer, built on SOPS encryption and proper credential handling, ensures that sensitive information never appears in plain text while maintaining the benefits of version-controlled infrastructure. The GitOps configuration with FluxCD provides automated, auditable deployments with the ability to quickly rollback problematic changes. The persistency layer, comprising both PostgreSQL metadata storage and S3-compatible deep storage, forms the data foundation that Druid depends on for both operational metadata and long-term segment storage.

When comparing NoSQL database options, Apache Druid’s infrastructure requirements may seem complex, but this complexity enables the high-performance analytics capabilities that make Druid invaluable for time series workloads. The Kubernetes-native approach we’ve established here scales seamlessly from development environments to enterprise production deployments.

Remember that infrastructure setup is not a one-time activity. Regular monitoring, capacity planning, and security updates ensure that your foundation continues to support growing analytics demands. The GitOps approach facilitates these ongoing maintenance activities by providing a structured, version-controlled method for implementing changes.

With this infrastructure foundation in place, you’re ready to proceed with the actual Apache Druid deployment. In the next article of this series, we’ll dive deep into the Druid cluster configuration itself, covering the deployment of Druid services, performance tuning, and integration with the infrastructure components we’ve established here.

For additional enterprise infrastructure patterns and configurations, explore the Iunera Helm Charts repository which provides production-tested configurations for various analytics and data processing workloads.

Why is this infrastructure setup necessary before deploying Druid?

Apache Druid’s distributed architecture requires robust foundations for security, data persistence, and operational management. Without proper secrets management, GitOps workflows, and storage backends, production deployments face security vulnerabilities, configuration drift, and data loss risks. This infrastructure ensures enterprise-grade reliability and maintainability.

What are the prerequisites for implementing this infrastructure foundation?

You need a Kubernetes cluster with sufficient resources, Git repository access for GitOps, and basic familiarity with Kubernetes concepts. Additionally, you’ll need access to object storage (AWS S3 or self-hosted MinIO) and the ability to generate encryption keys for SOPS.

Why use SOPS instead of other secret management solutions?

How do I secure database credentials in a GitOps workflow?

Use SOPS to encrypt secrets before committing to Git. Create Kubernetes secrets with encrypted data sections, and configure FluxCD with SOPS decryption capabilities. Never store plain-text credentials in your repository. The iuneradruid-metastore-postgres-secret example demonstrates this pattern.

What happens if I lose my SOPS encryption keys?

Lost encryption keys make encrypted secrets unrecoverable. Implement key backup strategies using multiple AGE keys or cloud-based key management services. Store backup keys securely offline and establish key recovery procedures as part of your disaster recovery plan.

What are the minimum resource requirements for PostgreSQL metadata storage?

For production environments, allocate minimum 4GB RAM, 2 CPU cores, and high-performance SSD storage with 1000+ IOPS. Monitor connection pool utilization and query performance to determine scaling needs. Consider PostgreSQL clustering for high availability.

Should I use MinIO or AWS S3 for deep storage?

Choose MinIO for on-premises deployments, cost control, and data sovereignty requirements. Select AWS S3 for managed service benefits, global availability, and integration with other AWS services. Both provide enterprise-grade durability, but AWS S3 offers better managed service features.

Why do I need two separate S3 buckets?

Druid requires separate buckets for deep storage (permanent segment storage) and indexing logs (temporary task logs and metadata). This separation improves security, enables different retention policies, and simplifies access control management.

How do I handle secret rotation in a GitOps environment?

Implement automated secret rotation using external secret operators or scheduled jobs. Update encrypted secrets in Git, and use FluxCD’s reconciliation to deploy changes. Maintain backward compatibility during rotation periods and test rotation procedures regularly.

What backup strategies should I implement for metadata storage?

Configure automated PostgreSQL backups with point-in-time recovery capabilities. Store backups in separate object storage buckets with cross-region replication. Test backup restoration procedures regularly in non-production environments and document recovery time objectives.

How do I monitor infrastructure health before Druid deployment?

Deploy monitoring stack (Prometheus, Grafana) alongside infrastructure components. Monitor PostgreSQL metrics (connections, query performance), object storage metrics (throughput, error rates), and Kubernetes cluster health (resource utilization, pod status). Set up alerting for critical infrastructure failures.