Installing a Production-Ready Apache Druid Cluster on Kubernetes — Part 2: Druid Deployment Preparation

This article is part of a series on deploying a production‑ready Apache Druid cluster on Kubernetes. In Part 2, we focus on deployment preparation: selecting and installing the right Druid operator and implementing end‑to‑end TLS. Other parts cover the infrastructure foundation, cluster configuration and tuning, authentication and authorization, and day‑2 operations. See Part 1

Introduction

Deploying Apache Druid on Kubernetes is a production engineering task, not a container exercise. The objective is an enterprise‑grade, secure, and operable analytics platform. Success depends on upfront preparation—most notably, selecting a maintained Druid operator aligned with Kubernetes‑native patterns and implementing robust TLS across all traffic.

This guide details the preparation steps that reliably reduce operational risk in enterprise Druid environments. Whether you support big data time‑series applications or other high‑throughput analytics, decisions made here directly influence stability, security, and total cost of ownership.

As Adheip Singh has shown in operator‑driven Druid deployments, the right foundation turns ongoing operations from reactive maintenance into predictable automation. Proper preparation is essential.

Summary

This guide covers two critical steps to prepare a production‑ready Apache Druid deployment on Kubernetes: selecting and installing the right Druid operator and implementing end‑to‑end TLS/SSL for both north‑south and east‑west traffic. We explain why the datainfrahq operator is the preferred choice, provide step‑by‑step installation guidance, and show how to generate and package Java keystore/truststore artifacts for secure operation.

Key outcomes:

  • Install the datainfrahq Apache Druid operator on Kubernetes and verify it using a minimal cluster.
  • Set up GitOps with FluxCD for consistent, auditable deployments.
  • Generate Java keystores/truststores and package them as encrypted Kubernetes Secrets using SOPS.
  • Enable TLS for internal Druid services and user‑facing endpoints.

Prerequisites:

  • A Kubernetes cluster (v1.25+) and kubectl configured; cluster‑admin permissions.
  • FluxCD installed (or ready to install) for GitOps workflows. (See Part 1 – Infrastructure Setup for Enterprise Apache Druid on Kubernetes)
  • SOPS with AGE keys configured (see Part 1) and access to a secrets repository.
  • Java keytool and OpenSSL available locally; optional pwgen or openssl.
  • Dedicated Kubernetes namespaces for Druid and for the operator (e.g. druid, druid-operator-system).
  • Access to a Git repository for druid‑cluster‑config like this one.

Table of contents


1) Install the Apache Druid Operator (datainfrahq)

Why datainfrahq? In our production reference, it aligns with Kubernetes‑native Druid patterns (e.g., no ZooKeeper for service discovery, Kubernetes Job‑based ingestion, autoscaling patterns), provides current CRDs, and exhibits controller behavior suitable for enterprise clusters.

Note: The similarly named druid‑io operator has historically lagged in maintenance and feature coverage and may not align with the Kubernetes‑native Druid patterns used here. For production deployments, prefer the datainfrahq operator.

Installation via Manifest (GitOps‑friendly)

The repository includes a compiled manifest that creates the CRD and installs the controller into druid-operator-system. In the following examples we refer to the https://github.com/iunera/druid-cluster-config repository.

# 0) Create namespace idempotently (safe if it already exists)
kubectl create namespace druid-operator-system --dry-run=client -o yaml | kubectl apply -f -

# 1) Apply the operator bundle (CRDs + controller)
kubectl apply -f druid-cluster-config/kubernetes/druid-operator-system/druid-operator.manifest.yaml

Verify the Druid operator installation

# CRD present?
kubectl get crd druids.druid.apache.org

# Controller pod(s) running?
kubectl -n druid-operator-system get pods -l app.kubernetes.io/name=druid-operator -o wide

# Controller rollout healthy?
kubectl -n druid-operator-system rollout status deploy -l app.kubernetes.io/name=druid-operator

# Inspect logs (tail and follow)
kubectl -n druid-operator-system logs -l app.kubernetes.io/name=druid-operator --tail=200 -f

Test the operator with a tiny cluster

To confirm the operator works correctly, deploy a minimal Druid cluster using the example from the repository, verify it’s running, and then clean it up:

# Create druid namespace if it doesn't exist
kubectl create namespace druid --dry-run=client -o yaml | kubectl apply -f -

# Deploy the tiny cluster example
kubectl apply -f https://raw.githubusercontent.com/datainfrahq/druid-operator/master/examples/tiny-cluster-mmless.yaml

# Wait for the cluster to be ready (may take a few minutes)
kubectl -n druid get druid tiny-cluster -w

# Verify pods are running
kubectl -n druid get pods -l druid_cr=tiny-cluster

# Check cluster status
kubectl -n druid describe druid tiny-cluster

# Clean up the test cluster
kubectl delete -f https://raw.githubusercontent.com/datainfrahq/druid-operator/master/examples/tiny-cluster-mmless.yaml

This test confirms that the operator can successfully create and manage Druid clusters. The tiny cluster uses minimal resources and demonstrates basic functionality without requiring external dependencies like deep storage.

Common troubleshooting

  • CRD missing after apply: re‑run apply, then kubectl describe crd druids.druid.apache.org to surface errors.
  • Namespace mismatch: controller resources expect the druid-operator-system namespace. Ensure it exists and that ServiceAccount/RoleBindings point there.
  • Admission webhooks failing: verify certs in the manifest were installed; check kubectl -n druid-operator-system get validatingwebhookconfigurations.
  • Permissions: kubectl auth can-i --list --namespace druid-operator-system to confirm RBAC for the operator’s ServiceAccount.

Optional operator configuration examples

Limit the operator’s scope to your Druid namespace(s) by setting a watch namespace on the Deployment (common for kubebuilder controllers):

# Constrain operator to watch only the 'druid' namespace (example)
kubectl -n druid-operator-system set env deployment/druid-operator WATCH_NAMESPACE=druid

# Verify env is set
kubectl -n druid-operator-system describe deploy druid-operator | sed -n '/Environment:/,$p' | head -n 20

Combine these manifests in your GitOps repository for final deployment.

GitOps with FluxCD: Dedicated Apache Druid Repository

For production environments, we recommend separating the Druid cluster configuration into a dedicated repository that’s managed via https://fluxcd.io/flux/. This approach provides better isolation, security, and change management for your Druid deployment.

Setting up the druid-cluster-config Repository

The GitOps approach uses a separate repository (e.g., druid-cluster-config) that contains all Druid-specific configurations, manifests, and secrets. This separation allows for:

  • Isolated access control: Different teams can manage infrastructure vs. Druid configurations
  • Independent release cycles: Druid updates don’t require changes to the main infrastructure repository
  • Enhanced security: Druid-specific secrets and configurations are contained in a dedicated repository

FluxCD Configuration for Druid Repository

Create a GitRepository source that points to your druid-cluster-config repository:

apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: druid-cluster-config
  namespace: flux-system
spec:
  interval: 1m0s
  ref:
    branch: main
  timeout: 60s
  url: ssh://git@github.com/iunera/druid-cluster-config

Then create a Kustomization that deploys the Druid configuration:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: druid-cluster-config
  namespace: flux-system
spec:
  interval: 10m0s
  path: ./kubernetes/
  prune: true
  sourceRef:
    kind: GitRepository
    name: druid-cluster-config

Repository Structure

The druid-cluster-config repository should follow this structure:

kubernetes/
├── druid/
│   ├── druidcluster/
│   │   └── iuneradruid-cluster.yaml
│   ├── secrets/
│   │   ├── keystores-secret.enc.yaml
│   │   └── postgres-secret.enc.yaml
│   └── kustomization.yaml
├── druid-operator-system/
│   ├── druid-operator.manifest.yaml
│   └── kustomization.yaml
└── kustomization.yaml

If you don’t like to store the encrypted secrets in the repository, you can create a separate repository like we have already done in the first part of this series.

This GitOps setup ensures that your Druid deployment is fully managed through code, providing the reliability and traceability required for enterprise production environments.


2) Configure TLS: Certificates, Keystore, and Truststore

Enterprise Druid requires TLS for internal traffic (brokers, coordinators, historicals, ingestors, routers) and for user access to the web console. Strong defaults plus automation make audits simpler and reduce operational risk.

Generate keystore and truststore (Java JKS)

Commands aligned with this repository’s reference (Java keytool + strong password):

# Generate keystore with a long random password (requires pwgen). The password is also saved for later use.
keytool -keystore keystore.jks -storepass $(pwgen 64 -n1 -s | tr -d '\n' | tee keystorepassword) -keypass  $(cat keystorepassword) -genkey -alias druid  -keyalg RSA -keysize 4096 -validity 3650 -dname "CN=druid" -storetype JKS

# Export the certificate in PEM format
keytool -export -alias druid -keystore keystore.jks -rfc -file druid.cert -storepass $(cat keystorepassword)

# OPTIONAL: base truststore from Java defaults
cp -v $JAVA_HOME/lib/security/cacerts ./truststore.jks

# Trust the new Druid cert (default cacerts password is 'changeit')
keytool -import -file druid.cert -storepass changeit -alias druid -keystore truststore.jks -noprompt -trustcacerts -storetype JKS

Notes:

  • If pwgen is unavailable, use openssl rand -base64 64 | tr -d '\n' | tee keystorepassword.
  • You can use a real CN/SAN that matches how services connect (FQDN or service name) or the enterprise PKI and create from that a Keystore.

Package TLS into a Kubernetes Secret

Store artifacts in the druid namespace as a single secret that workloads can mount or reference.

# Apply directly
kubectl --namespace=druid \
  create secret generic keystores \
  --from-file=keystore.jks \
  --from-file=keystorepassword \
  --from-file=truststore.jks

# Or: render YAML for GitOps (encrypt this file with SOPS before committing)
kubectl --namespace=druid \
  create secret generic keystores \
  --from-file=keystore.jks \
  --from-file=keystorepassword \
  --from-file=truststore.jks \
  --dry-run=client -o yaml \
  > druid-jks-keystores-secret.yaml

# Encrypt the secret file with SOPS (using the .sops.yaml config from part 1)
sops -e druid-jks-keystores-secret.yaml > druid-jks-keystores-secret.enc.yaml

# Remove the unencrypted file for security
rm druid-jks-keystores-secret.yaml

# Move to your GitOps repository structure (adjust path as needed)
mv druid-jks-keystores-secret.enc.yaml kubernetes/sops-secrets/druid/

Validate your TLS artifacts

# Inspect keystore contents
keytool -list -v -keystore keystore.jks -storepass $(cat keystorepassword) | head -n 40

# Inspect the exported certificate
openssl x509 -in druid.cert -noout -text | head -n 40

# Confirm secret exists and contains keys
kubectl -n druid get secret keystores -o yaml | sed -n '1,80p'

Automation & security recommendations

  • Put the JKS generation and Secret rendering behind a Make target or script; keep it idempotent and environment‑aware (dev/stage/prod).
  • Use SOPS for at‑rest encryption of YAML secrets in Git.
  • Rotate certificates on a schedule (every 6–12 months) and script the rollout with zero‑downtime restarts.

Comparison: Druid Operators

Aspectdatainfrahq operatordruid‑io operator
Production alignmentKubernetes‑native Druid patterns, current CRDsDivergent feature set and cadence
Namespacedruid-operator-system (in this ref)varies
VerificationCRD: druids.druid.apache.orgdifferent CRDs
PackagingWorks with manifest or Helm; integrates with GitOpsdiffers
Helm charts ecosystemIunera Helm Charts

Comparison: TLS certificate management approaches

ApproachProsConsTypical use
Self‑signed JKS (per‑cluster)Fast, fully offline, simple toolingRequires managing truststore; rotation overheadInternal clusters and POCs that still need TLS
Corporate CA‑signed (internal PKI)Enterprise trust, auditable issuance, centralized lifecycleRequires PKI access, approval workflowsRegulated enterprises, prod
ACME/Ingress terminationAutomated issuance/renewalNot ideal for pod‑to‑pod encryption; dependency on ingressPublic endpoints; combine with internal JKS for east‑west

Comparison: Manual vs. automated certificate deployment

DeploymentProsConsNotes
Manual secret applySimple, no extra toolingHuman error, drift riskUse only for tests
GitOps + SOPSAuditable, consistent, encrypted at restRequires GitOps setupPreferred for prod
External secret operatorIntegrates vault/KMS dynamicallyExtra controller to manageGreat for large orgs

Comparison: Security considerations by method

AreaManualGitOps + SOPSExternal secret operator
At‑rest secret encryptionOptionalStrong (SOPS)Outsourced (vault/KMS)
RBAC blast radiusHigherScoped via kustomizationsScoped via external policies
Rotation playbooksAd‑hocDocumented in GitCentralized with vault policies

Conclusion

A production‑ready Apache Druid deployment on Kubernetes depends on two foundations: installing the maintained datainfrahq Druid operator and enabling end‑to‑end TLS from day one. With the operator running in druid‑operator‑system and a hardened JKS secret in the druid namespace, you have the prerequisites in place for a secure, enterprise‑grade cluster. In the next part of this series, we use these building blocks to deploy and validate the full cluster.


FAQ

Which Druid operator should I choose for production deployments?

Use the datainfrahq Druid operator. It aligns with Kubernetes‑native Druid patterns and the CRDs used in this guide.

How do I verify that the operator is working?

Check for the CRD (druids.druid.apache.org) and a healthy controller pod in druid-operator-system. Use the kubectl verification commands above.

kubectl apply succeeded but I don’t see the CRD—what’s wrong?

Re‑apply, then describe the CRD for errors. Ensure your kube‑RBAC allows the operator to create CRDs and that your kubectl context is correct.

How should I manage TLS certificate rotation?

Automate it. Regenerate keystores on a schedule, push the updated Secret YAML encrypted with SOPS, and roll your Druid pods with zero downtime (staggered restarts).

Can I terminate TLS at the ingress only?

You can, but for enterprise environments you should also encrypt east‑west (pod‑to‑pod) traffic with a keystore/truststore to protect internal data paths.