Lifecycle of a system message
From check emission to 7-day cleanup — fingerprints, the clean_cycles dampener, severity ladders, and resolution_kind.
A system_messages row is a small state machine driven by one writer (the
health-check runner) and one optional intervener (the Doctor executor). This
page walks the row through every transition it can make.
The state machine
Three terminal-ish states (auto, fix, acceptance) all eventually hit the
7-day cleanup, and any of them can flip back to Active if the underlying
condition reappears — system_messages is the source of truth for
"currently broken", not "ever broken".
The fingerprint contract: check_key
Every CheckResult carries a check_key. The runner's upsert
(runner.py:345) keys on this column:
INSERT INTO system_messages
(check_key, check_type, severity, category, title, details, payload, ...)
VALUES (...)
ON CONFLICT (check_key) DO UPDATE SET
severity = EXCLUDED.severity,
title = EXCLUDED.title,
payload = EXCLUDED.payload,
-- crucially:
resolved_at = NULL,
clean_cycles = 0,
updated_at = NOW();The fingerprint determines identity. Two findings with the same check_key
are the same row, even across runner restarts and weeks of uptime. The format
varies by category but always includes enough discriminator to distinguish
findings of the same check_type:
| Category | Example check_key |
|---|---|
| Infrastructure drift | dns_mismatch:rule:42 |
| Cluster baseline | k3s_arg:host:7:write-kubeconfig-mode |
| Pod health | crashloop:pw-prod:dubbing/manager |
| Longhorn | longhorn_setting:cluster:abc:storage-over-provisioning-percentage |
| TLS | tls_k8s:cluster:abc:default/wildcard.example.com |
| HA control plane | cp_promotion_failed:host:14 |
When the runner picks a check_key it is committing to a stable identity
across cycles. Choosing it wrong is the most common source of flapping bugs
(see Edge cases).
What suppression keys off
backend/app/services/health/suppression.py parses check_key into
(check_type, target) and matches against per-rule globs. Suppression is
matched by check_key, not by row ID, so a rule keeps suppressing even if the
underlying row is resolved-and-recreated.
Severity ladder
Severities are: critical, error, warning, info. The ladder appears
in the UI as the bell badge color, in the email digest's threshold filter,
and in three runner-internal places:
- Auto-resolve sweep treats severities equally — a critical with no
matching finding still goes through the same
clean_cyclesdampener. - Email digest filters
severity >= configured_min_severityand only emails newly created rows (those withcreated_at > last_digest_at). Re-appearing rows that were already resolved come through as new. - Bell badge picks the highest unread severity to color the badge.
Severity is per-emission, not per-row. A check can emit the same
check_key at warning one cycle and error the next; the upsert flips
the severity in place. This is by design — TLS expiry warnings escalate
to errors at 7 days remaining without producing two messages.
clean_cycles: the flap dampener
This is the single most important field for understanding why the bell looks
quiet. The runner increments and resolves in a two-phase sweep
(runner.py:401–464) deliberately ordered to avoid a race:
The two SQL statements run in this order: resolve first, then increment.
If we incremented first, a row at threshold-1 could be pushed to threshold
and resolved in the same cycle it became absent, defeating the whole point
of the dampener. The current order means: a row needs to be absent for
threshold distinct cycles before it can be resolved.
threshold defaults to 2 (≈30 min wall-clock) and is settable per cluster
via health_checks.clean_cycles_to_resolve in system_config. The
"immediate resolve" admin button in the UI hits the same path with
threshold = 0.
The category-scope subtlety
The sweep only operates on rows whose category ran in the current
cycle. A row whose category never runs is structurally unreachable by the
sweep — it cannot be incremented and cannot be resolved.
This was the root cause of #915:
admission-bypass audit messages were inserted with category='admission_bypass',
but no check emits that category, so once written they were stuck active
forever. The fix was to insert audit rows already-resolved
(resolved_at = NOW(), resolution_kind = 'auto') so they remain visible in
timeline views without polluting the active list.
Rule of thumb: anything that lives in system_messages must be
emit-able by some check, or written already-resolved. Halfway is a bug.
Auto-resolve in detail
The sweep at runner.py:405–464 actually has four branches, not two —
because the runner separates "I have results" from "this category is empty
this cycle":
| Branch | When | Action |
|---|---|---|
| 1 (upsert) | Result emitted with this check_key | clean_cycles := 0, resolved_at := NULL |
| 2 (increment, present cats) | Active row in a category that ran, key not in results, clean_cycles < threshold | clean_cycles += 1 |
| 3 (resolve, present cats) | Same as 2 but clean_cycles >= threshold | resolved_at := NOW(), resolution_kind := 'auto' |
| 4 (empty cats) | Active row in a category that ran but emitted zero results — same logic as 2/3 | Increment or resolve |
Branches 2 and 4 look the same but are distinct because the SQL WHERE
clause matches on category list and on whether check_key != ALL(active_keys).
The 4-branch shape lets a check that detects nothing in a cycle still drive
prior findings to resolution.
What category is a row in?
The category column is set on insert and never changes. A row's category
must match the categories the runner enumerates per cycle (see the
enabled_categories setting in Settings → Drift Detection). If you
rename a category, every existing row with the old value goes structurally
unreachable — same shape as #915. Don't do that without a migration.
resolution_kind: how was this fixed?
Migration 079 (backend/migrations/079_resolution_kind.sql) added a CHECK
constraint:
ALTER TABLE system_messages
ADD CONSTRAINT system_messages_resolution_kind_check
CHECK (resolution_kind IN ('auto', 'fix', 'acceptance'));The three values mean:
auto— the issue stopped being detected; the dampener resolved it. Set only by the runner sweep.fix— Doctor applied a real on-host change that should have removed the underlying condition (e.g. patched a Longhorn setting, restarted a pod, deleted an orphaned replica).acceptance— Doctor accepted the message without fixing the underlying state — used for actions like "Abandon" on a failed CP promotion, where the operator decides "I'll handle the host manually, just clear the row."
The Doctor executor writes resolution_kind inline within the same
transaction it holds the row lock in (executor.py:491–497). Recipes do
not write system_messages themselves — see Doctor internals.
Re-appearance: a "resolved" row that comes back
If a previously-resolved row's check_key shows up again, the upsert sets
resolved_at = NULL, clean_cycles = 0. This is intentional: from the
operator's point of view the issue is back, and they should see it in the
active list again. The email digest will treat it as a new event because
the row's created_at is older than last_digest_at, but the runner
also tracks last_seen_at (used by the digest) so re-appearances are
re-emailed.
This also means resolution_kind is not durable history — if the row
flaps back to active, the kind is reset on the next resolution. If you
need durable history, query doctor_executions (audit trail of every
Doctor apply) joined to system_messages.id.
7-day cleanup
Resolved rows older than 7 days are deleted by a daily housekeeping job. The cleanup window matters in two cases:
- Linking from issue reports. A "this happened last Tuesday" report may
no longer have a row — fall back to
doctor_executions(preserved longer) or the API access log. - Suppression. Suppression rules outlive rows. Re-creating a row matching a still-suppressed glob will be suppressed silently on first re-appearance.
The 7-day window also bounds the size of the system_messages table — a
dense incident week can produce ~10k rows per cluster, and we have observed
~50–80k active+resolved on healthy long-running deployments. Index
maintenance on (check_key), (resolved_at), and (category, resolved_at)
all assume this scale.
Schema evolution
If you are tracing why a column behaves a certain way, here is the migration genealogy. Every change had a reason.
| Migration | What it added | Why |
|---|---|---|
040_system_messages.sql | Base table | Replace ad-hoc alert dicts |
045_clean_cycles.sql | clean_cycles column + threshold default 2 | Stop flapping (the original sin) |
049_suppression.sql | system_message_suppressions table | Operators wanted to mute known-noisy findings |
075_check_type.sql | check_type column extracted from check_key | Doctor recipe lookup needs a stable type |
078_cluster_id_fk_cascade.sql | ON DELETE CASCADE on cluster_id | Deleting a cluster shouldn't leave orphan rows |
079_resolution_kind.sql | resolution_kind CHECK constraint | Distinguish auto vs Doctor outcomes |
check_type (075) and resolution_kind (079) together are what make
the Doctor framework possible — without check_type the
registry can't dispatch, and without resolution_kind the runner can't
distinguish a real fix from a noise-resolved row.
Health system & Doctor — engineering deep dive
How PodWarden detects drift, posts system messages, dampens flapping, and applies one-click remediations safely.
Doctor framework internals
Recipe protocol, the FOR UPDATE NOWAIT lock chain, multi-action recipes, RBAC, and the writer-owns-resolve contract.