Health system & Doctor — engineering deep dive
How PodWarden detects drift, posts system messages, dampens flapping, and applies one-click remediations safely.
This is the engineering counterpart to the System Messages,
Doctor, and drift detection
operator guides. It is intended for SREs, platform engineers, and contributors
who want to know how the loop is wired, what each row in system_messages
means at the database layer, what guarantees the Doctor framework offers under
concurrent operators, and what every failure mode we have hit in production
looked like before we hardened it.
Read the operator guides first if you have not used the UI before — this document does not re-derive what a "Stuck PVC" alert means; it tells you what happens between the Kubernetes API saying so and your inbox getting an email about it 30 minutes later.
At a glance
PodWarden's health system is three cooperating components:
-
Health checks are pure functions that observe the cluster and return
CheckResultrecords. They never write to the database directly. There is one check module per category —infra_drift,pod_health,node_naming,cluster_baseline,ha_control_plane,tls_certs,longhorn_*,system_apps,ingress_drift, etc. — and each emits a stablecheck_keyfingerprint per finding. -
The runner (
backend/app/services/health/runner.py) is the only writer tosystem_messages. It runs on a 15-minute loop (HEALTH_CHECK_INTERVAL = 15 * 60,runner.py:24), upserts findings bycheck_key, runs the auto-resolve sweep, and emits the email digest. -
The Doctor framework (
backend/app/services/doctor/) is a propose-only remediation layer. Eachcheck_typemay register aRecipethat exposes one or more named actions. Recipes preview a diff, the operator confirms, and the executor takes a row lock, re-previews, and applies under a strict write contract.
The remaining pages drill into each layer:
- Lifecycle of a system message — how
check_key,clean_cycles, severity, andresolution_kindinteract through a message's life from first emission to 7-day cleanup. - Doctor framework internals — recipe protocol, the
FOR UPDATE NOWAITlock chain, RBAC gating (min_role), the writer-owns contract, and the multi-action model. - Edge cases & failure modes — every flapping, stuck, orphaned, or structurally-unreachable bug we have shipped a fix for, with the issue numbers and the fingerprint of each root cause.
Why this design exists
The health system is the answer to a hard product question: how do you surface real problems without becoming the alert source operators learn to mute?
We do this with three load-bearing primitives:
-
Stable fingerprints (
check_key). Every finding has an idempotent identifier so the same problem does not produce a new row each cycle. TheON CONFLICT (check_key) DO UPDATEupsert (runner.py:347) is the heart of the loop — without it, every refresh would flood the bell. -
clean_cyclesdampener. A finding that disappears for one cycle is often noise (kubelet restarted, etcd quorum blipped, DNS resolver was slow). We require two consecutive clean cycles by default (≈30 min) before marking the row resolved. This converts "flapping check" into "reliable signal" without any per-check tuning. -
Propose-only Doctor. No recipe acts without a confirmed preview. Even when the operator clicks Apply, the executor re-runs
preview()under the row lock and refuses to proceed if the diff has changed (PreviewMismatch). Two operators racing each other end up with one success and one polite 409, never two competing writes to the same etcd key.
If you understand those three, the rest of the system is mechanism.
Where to read the code
The map below is the canonical pointer. Everything in the next three pages ultimately resolves to one of these files.
| Concern | File | Notes |
|---|---|---|
| Loop driver | backend/app/services/health/runner.py | health_check_loop() at runner.py:567 |
| Per-category check modules | backend/app/services/health/checks/*.py | One file per category; pure functions |
| Suppression rules | backend/app/services/health/suppression.py | parse_check_key(), glob match |
| Recipe protocol | backend/app/services/doctor/recipe.py | Recipe, RecipeAction, RecipeContext, RecipeOutcome, errors |
| Doctor executor | backend/app/services/doctor/executor.py | run_preview(), run_apply(), lock chain |
| Recipe registry | backend/app/services/doctor/registry.py | check_type → Recipe map |
| Recipe implementations | backend/app/services/doctor/recipes/*.py | One module per recipe |
| HTTP surface | backend/app/api/doctor.py | Maps DoctorError subclasses to HTTP status |
| Frontend message page | frontend/src/app/system/messages/page.tsx | List, filter, suppress UI |
| Frontend Doctor modal | frontend/src/components/doctor/doctor-modal.tsx | Preview echo, confirmation, errors |
| Migration history | backend/migrations/0[3-9]*_*.sql | Schema evolution; see lifecycle page |