HA cluster and disaster-recovery guides

Igor

Two new guides for operators running PodWarden in anger: how to stand up an HA cluster, and the runbook for the day etcd quorum dies.

Two new guides went live this week for operators running PodWarden in anger: a full walk-through for standing up a high-availability cluster, and a companion runbook for recovering when a node, disk, or the etcd quorum decides today is the day.

The HA guide covers the three things teams ask about most often: how the control plane elects a leader, what the storage backend expects when you add or remove a node, and the smallest cluster shape that still survives a single-host outage without manual intervention.

When the worst happens

The disaster-recovery runbook is the one nobody wants to need and everyone is glad exists. It walks through the four failure modes we've actually seen in the wild — single-node disk loss, etcd corruption, split-brain after a network partition, and a full host replacement mid-rolling-upgrade — with the exact commands for each.

Both guides live under /docs/guides. If your scenario doesn't fit cleanly into one of the four, please open a ticket — we'd rather extend the runbook than have you improvise at 2 a.m.

HA cluster and disaster-recovery guides — PodWarden News | PodWarden Hub