PodWarden
GuidesHealth System & Doctor

Lifecycle of a system message

From check emission to 7-day cleanup — fingerprints, the clean_cycles dampener, severity ladders, and resolution_kind.

A system_messages row is a small state machine driven by one writer (the health-check runner) and one optional intervener (the Doctor executor). This page walks the row through every transition it can make.

The state machine

Three terminal-ish states (auto, fix, acceptance) all eventually hit the 7-day cleanup, and any of them can flip back to Active if the underlying condition reappears — system_messages is the source of truth for "currently broken", not "ever broken".

The fingerprint contract: check_key

Every CheckResult carries a check_key. The runner's upsert (runner.py:345) keys on this column:

INSERT INTO system_messages
    (check_key, check_type, severity, category, title, details, payload, ...)
VALUES (...)
ON CONFLICT (check_key) DO UPDATE SET
    severity = EXCLUDED.severity,
    title    = EXCLUDED.title,
    payload  = EXCLUDED.payload,
    -- crucially:
    resolved_at  = NULL,
    clean_cycles = 0,
    updated_at   = NOW();

The fingerprint determines identity. Two findings with the same check_key are the same row, even across runner restarts and weeks of uptime. The format varies by category but always includes enough discriminator to distinguish findings of the same check_type:

CategoryExample check_key
Infrastructure driftdns_mismatch:rule:42
Cluster baselinek3s_arg:host:7:write-kubeconfig-mode
Pod healthcrashloop:pw-prod:dubbing/manager
Longhornlonghorn_setting:cluster:abc:storage-over-provisioning-percentage
TLStls_k8s:cluster:abc:default/wildcard.example.com
HA control planecp_promotion_failed:host:14

When the runner picks a check_key it is committing to a stable identity across cycles. Choosing it wrong is the most common source of flapping bugs (see Edge cases).

What suppression keys off

backend/app/services/health/suppression.py parses check_key into (check_type, target) and matches against per-rule globs. Suppression is matched by check_key, not by row ID, so a rule keeps suppressing even if the underlying row is resolved-and-recreated.

Severity ladder

Severities are: critical, error, warning, info. The ladder appears in the UI as the bell badge color, in the email digest's threshold filter, and in three runner-internal places:

  • Auto-resolve sweep treats severities equally — a critical with no matching finding still goes through the same clean_cycles dampener.
  • Email digest filters severity >= configured_min_severity and only emails newly created rows (those with created_at > last_digest_at). Re-appearing rows that were already resolved come through as new.
  • Bell badge picks the highest unread severity to color the badge.

Severity is per-emission, not per-row. A check can emit the same check_key at warning one cycle and error the next; the upsert flips the severity in place. This is by design — TLS expiry warnings escalate to errors at 7 days remaining without producing two messages.

clean_cycles: the flap dampener

This is the single most important field for understanding why the bell looks quiet. The runner increments and resolves in a two-phase sweep (runner.py:401–464) deliberately ordered to avoid a race:

The two SQL statements run in this order: resolve first, then increment. If we incremented first, a row at threshold-1 could be pushed to threshold and resolved in the same cycle it became absent, defeating the whole point of the dampener. The current order means: a row needs to be absent for threshold distinct cycles before it can be resolved.

threshold defaults to 2 (≈30 min wall-clock) and is settable per cluster via health_checks.clean_cycles_to_resolve in system_config. The "immediate resolve" admin button in the UI hits the same path with threshold = 0.

The category-scope subtlety

The sweep only operates on rows whose category ran in the current cycle. A row whose category never runs is structurally unreachable by the sweep — it cannot be incremented and cannot be resolved.

This was the root cause of #915: admission-bypass audit messages were inserted with category='admission_bypass', but no check emits that category, so once written they were stuck active forever. The fix was to insert audit rows already-resolved (resolved_at = NOW(), resolution_kind = 'auto') so they remain visible in timeline views without polluting the active list.

Rule of thumb: anything that lives in system_messages must be emit-able by some check, or written already-resolved. Halfway is a bug.

Auto-resolve in detail

The sweep at runner.py:405–464 actually has four branches, not two — because the runner separates "I have results" from "this category is empty this cycle":

BranchWhenAction
1 (upsert)Result emitted with this check_keyclean_cycles := 0, resolved_at := NULL
2 (increment, present cats)Active row in a category that ran, key not in results, clean_cycles < thresholdclean_cycles += 1
3 (resolve, present cats)Same as 2 but clean_cycles >= thresholdresolved_at := NOW(), resolution_kind := 'auto'
4 (empty cats)Active row in a category that ran but emitted zero results — same logic as 2/3Increment or resolve

Branches 2 and 4 look the same but are distinct because the SQL WHERE clause matches on category list and on whether check_key != ALL(active_keys). The 4-branch shape lets a check that detects nothing in a cycle still drive prior findings to resolution.

What category is a row in?

The category column is set on insert and never changes. A row's category must match the categories the runner enumerates per cycle (see the enabled_categories setting in Settings → Drift Detection). If you rename a category, every existing row with the old value goes structurally unreachable — same shape as #915. Don't do that without a migration.

resolution_kind: how was this fixed?

Migration 079 (backend/migrations/079_resolution_kind.sql) added a CHECK constraint:

ALTER TABLE system_messages
    ADD CONSTRAINT system_messages_resolution_kind_check
    CHECK (resolution_kind IN ('auto', 'fix', 'acceptance'));

The three values mean:

  • auto — the issue stopped being detected; the dampener resolved it. Set only by the runner sweep.
  • fix — Doctor applied a real on-host change that should have removed the underlying condition (e.g. patched a Longhorn setting, restarted a pod, deleted an orphaned replica).
  • acceptance — Doctor accepted the message without fixing the underlying state — used for actions like "Abandon" on a failed CP promotion, where the operator decides "I'll handle the host manually, just clear the row."

The Doctor executor writes resolution_kind inline within the same transaction it holds the row lock in (executor.py:491–497). Recipes do not write system_messages themselves — see Doctor internals.

Re-appearance: a "resolved" row that comes back

If a previously-resolved row's check_key shows up again, the upsert sets resolved_at = NULL, clean_cycles = 0. This is intentional: from the operator's point of view the issue is back, and they should see it in the active list again. The email digest will treat it as a new event because the row's created_at is older than last_digest_at, but the runner also tracks last_seen_at (used by the digest) so re-appearances are re-emailed.

This also means resolution_kind is not durable history — if the row flaps back to active, the kind is reset on the next resolution. If you need durable history, query doctor_executions (audit trail of every Doctor apply) joined to system_messages.id.

7-day cleanup

Resolved rows older than 7 days are deleted by a daily housekeeping job. The cleanup window matters in two cases:

  • Linking from issue reports. A "this happened last Tuesday" report may no longer have a row — fall back to doctor_executions (preserved longer) or the API access log.
  • Suppression. Suppression rules outlive rows. Re-creating a row matching a still-suppressed glob will be suppressed silently on first re-appearance.

The 7-day window also bounds the size of the system_messages table — a dense incident week can produce ~10k rows per cluster, and we have observed ~50–80k active+resolved on healthy long-running deployments. Index maintenance on (check_key), (resolved_at), and (category, resolved_at) all assume this scale.

Schema evolution

If you are tracing why a column behaves a certain way, here is the migration genealogy. Every change had a reason.

MigrationWhat it addedWhy
040_system_messages.sqlBase tableReplace ad-hoc alert dicts
045_clean_cycles.sqlclean_cycles column + threshold default 2Stop flapping (the original sin)
049_suppression.sqlsystem_message_suppressions tableOperators wanted to mute known-noisy findings
075_check_type.sqlcheck_type column extracted from check_keyDoctor recipe lookup needs a stable type
078_cluster_id_fk_cascade.sqlON DELETE CASCADE on cluster_idDeleting a cluster shouldn't leave orphan rows
079_resolution_kind.sqlresolution_kind CHECK constraintDistinguish auto vs Doctor outcomes

check_type (075) and resolution_kind (079) together are what make the Doctor framework possible — without check_type the registry can't dispatch, and without resolution_kind the runner can't distinguish a real fix from a noise-resolved row.