PodWarden
GuidesHealth System & Doctor

Health system & Doctor — engineering deep dive

How PodWarden detects drift, posts system messages, dampens flapping, and applies one-click remediations safely.

This is the engineering counterpart to the System Messages, Doctor, and drift detection operator guides. It is intended for SREs, platform engineers, and contributors who want to know how the loop is wired, what each row in system_messages means at the database layer, what guarantees the Doctor framework offers under concurrent operators, and what every failure mode we have hit in production looked like before we hardened it.

Read the operator guides first if you have not used the UI before — this document does not re-derive what a "Stuck PVC" alert means; it tells you what happens between the Kubernetes API saying so and your inbox getting an email about it 30 minutes later.

At a glance

PodWarden's health system is three cooperating components:

  1. Health checks are pure functions that observe the cluster and return CheckResult records. They never write to the database directly. There is one check module per category — infra_drift, pod_health, node_naming, cluster_baseline, ha_control_plane, tls_certs, longhorn_*, system_apps, ingress_drift, etc. — and each emits a stable check_key fingerprint per finding.

  2. The runner (backend/app/services/health/runner.py) is the only writer to system_messages. It runs on a 15-minute loop (HEALTH_CHECK_INTERVAL = 15 * 60, runner.py:24), upserts findings by check_key, runs the auto-resolve sweep, and emits the email digest.

  3. The Doctor framework (backend/app/services/doctor/) is a propose-only remediation layer. Each check_type may register a Recipe that exposes one or more named actions. Recipes preview a diff, the operator confirms, and the executor takes a row lock, re-previews, and applies under a strict write contract.

The remaining pages drill into each layer:

  • Lifecycle of a system message — how check_key, clean_cycles, severity, and resolution_kind interact through a message's life from first emission to 7-day cleanup.
  • Doctor framework internals — recipe protocol, the FOR UPDATE NOWAIT lock chain, RBAC gating (min_role), the writer-owns contract, and the multi-action model.
  • Edge cases & failure modes — every flapping, stuck, orphaned, or structurally-unreachable bug we have shipped a fix for, with the issue numbers and the fingerprint of each root cause.

Why this design exists

The health system is the answer to a hard product question: how do you surface real problems without becoming the alert source operators learn to mute?

We do this with three load-bearing primitives:

  • Stable fingerprints (check_key). Every finding has an idempotent identifier so the same problem does not produce a new row each cycle. The ON CONFLICT (check_key) DO UPDATE upsert (runner.py:347) is the heart of the loop — without it, every refresh would flood the bell.

  • clean_cycles dampener. A finding that disappears for one cycle is often noise (kubelet restarted, etcd quorum blipped, DNS resolver was slow). We require two consecutive clean cycles by default (≈30 min) before marking the row resolved. This converts "flapping check" into "reliable signal" without any per-check tuning.

  • Propose-only Doctor. No recipe acts without a confirmed preview. Even when the operator clicks Apply, the executor re-runs preview() under the row lock and refuses to proceed if the diff has changed (PreviewMismatch). Two operators racing each other end up with one success and one polite 409, never two competing writes to the same etcd key.

If you understand those three, the rest of the system is mechanism.

Where to read the code

The map below is the canonical pointer. Everything in the next three pages ultimately resolves to one of these files.

ConcernFileNotes
Loop driverbackend/app/services/health/runner.pyhealth_check_loop() at runner.py:567
Per-category check modulesbackend/app/services/health/checks/*.pyOne file per category; pure functions
Suppression rulesbackend/app/services/health/suppression.pyparse_check_key(), glob match
Recipe protocolbackend/app/services/doctor/recipe.pyRecipe, RecipeAction, RecipeContext, RecipeOutcome, errors
Doctor executorbackend/app/services/doctor/executor.pyrun_preview(), run_apply(), lock chain
Recipe registrybackend/app/services/doctor/registry.pycheck_type → Recipe map
Recipe implementationsbackend/app/services/doctor/recipes/*.pyOne module per recipe
HTTP surfacebackend/app/api/doctor.pyMaps DoctorError subclasses to HTTP status
Frontend message pagefrontend/src/app/system/messages/page.tsxList, filter, suppress UI
Frontend Doctor modalfrontend/src/components/doctor/doctor-modal.tsxPreview echo, confirmation, errors
Migration historybackend/migrations/0[3-9]*_*.sqlSchema evolution; see lifecycle page
Health system & Doctor — engineering deep dive | PodWarden Hub