Lifecycle of a system message

From check emission to 7-day cleanup — fingerprints, the clean_cycles dampener, severity ladders, and resolution_kind.

A system_messages row is a small state machine driven by one writer (the health-check runner) and one optional intervener (the Doctor executor). This page walks the row through every transition it can make.

The state machine

Three terminal-ish states (auto, fix, acceptance) all eventually hit the 7-day cleanup, and any of them can flip back to Active if the underlying condition reappears — system_messages is the source of truth for "currently broken", not "ever broken".

The fingerprint contract: `check_key`

Every CheckResult carries a check_key. The runner's upsert (runner.py:345) keys on this column:

INSERT INTO system_messages
    (check_key, check_type, severity, category, title, details, payload, ...)
VALUES (...)
ON CONFLICT (check_key) DO UPDATE SET
    severity = EXCLUDED.severity,
    title    = EXCLUDED.title,
    payload  = EXCLUDED.payload,
    -- crucially:
    resolved_at  = NULL,
    clean_cycles = 0,
    updated_at   = NOW();

The fingerprint determines identity. Two findings with the same check_key are the same row, even across runner restarts and weeks of uptime. The format varies by category but always includes enough discriminator to distinguish findings of the same check_type:

Category	Example `check_key`
Infrastructure drift	`dns_mismatch:rule:42`
Cluster baseline	`k3s_arg:host:7:write-kubeconfig-mode`
Pod health	`crashloop:pw-prod:dubbing/manager`
Longhorn	`longhorn_setting:cluster:abc:storage-over-provisioning-percentage`
TLS	`tls_k8s:cluster:abc:default/wildcard.example.com`
HA control plane	`cp_promotion_failed:host:14`

When the runner picks a check_key it is committing to a stable identity across cycles. Choosing it wrong is the most common source of flapping bugs (see Edge cases).

What suppression keys off

backend/app/services/health/suppression.py parses check_key into (check_type, target) and matches against per-rule globs. Suppression is matched by check_key, not by row ID, so a rule keeps suppressing even if the underlying row is resolved-and-recreated.

Severity ladder

Severities are: critical, error, warning, info. The ladder appears in the UI as the bell badge color, in the email digest's threshold filter, and in three runner-internal places:

Auto-resolve sweep treats severities equally — a critical with no matching finding still goes through the same clean_cycles dampener.
Email digest filters severity >= configured_min_severity and only emails newly created rows (those with created_at > last_digest_at). Re-appearing rows that were already resolved come through as new.
Bell badge picks the highest unread severity to color the badge.

Severity is per-emission, not per-row. A check can emit the same check_key at warning one cycle and error the next; the upsert flips the severity in place. This is by design — TLS expiry warnings escalate to errors at 7 days remaining without producing two messages.

`clean_cycles`: the flap dampener

This is the single most important field for understanding why the bell looks quiet. The runner increments and resolves in a two-phase sweep (runner.py:401–464) deliberately ordered to avoid a race:

The two SQL statements run in this order: resolve first, then increment. If we incremented first, a row at threshold-1 could be pushed to threshold and resolved in the same cycle it became absent, defeating the whole point of the dampener. The current order means: a row needs to be absent for threshold distinct cycles before it can be resolved.

threshold defaults to 2 (≈30 min wall-clock) and is settable per cluster via health_checks.clean_cycles_to_resolve in system_config. The "immediate resolve" admin button in the UI hits the same path with threshold = 0.

The category-scope subtlety

The sweep only operates on rows whose category ran in the current cycle. A row whose category never runs is structurally unreachable by the sweep — it cannot be incremented and cannot be resolved.

This was the root cause of #915: admission-bypass audit messages were inserted with category='admission_bypass', but no check emits that category, so once written they were stuck active forever. The fix was to insert audit rows already-resolved (resolved_at = NOW(), resolution_kind = 'auto') so they remain visible in timeline views without polluting the active list.

Rule of thumb: anything that lives in system_messages must be emit-able by some check, or written already-resolved. Halfway is a bug.

Auto-resolve in detail

The sweep at runner.py:405–464 actually has four branches, not two — because the runner separates "I have results" from "this category is empty this cycle":

Branch	When	Action
1 (upsert)	Result emitted with this `check_key`	`clean_cycles := 0`, `resolved_at := NULL`
2 (increment, present cats)	Active row in a category that ran, key not in results, `clean_cycles < threshold`	`clean_cycles += 1`
3 (resolve, present cats)	Same as 2 but `clean_cycles >= threshold`	`resolved_at := NOW(), resolution_kind := 'auto'`
4 (empty cats)	Active row in a category that ran but emitted zero results — same logic as 2/3	Increment or resolve

Branches 2 and 4 look the same but are distinct because the SQL WHERE clause matches on category list and on whether check_key != ALL(active_keys). The 4-branch shape lets a check that detects nothing in a cycle still drive prior findings to resolution.

What `category` is a row in?

The category column is set on insert and never changes. A row's category must match the categories the runner enumerates per cycle (see the enabled_categories setting in Settings → Drift Detection). If you rename a category, every existing row with the old value goes structurally unreachable — same shape as #915. Don't do that without a migration.

`resolution_kind`: how was this fixed?

Migration 079 (backend/migrations/079_resolution_kind.sql) added a CHECK constraint:

ALTER TABLE system_messages
    ADD CONSTRAINT system_messages_resolution_kind_check
    CHECK (resolution_kind IN ('auto', 'fix', 'acceptance'));

The three values mean:

auto — the issue stopped being detected; the dampener resolved it. Set only by the runner sweep.
fix — Doctor applied a real on-host change that should have removed the underlying condition (e.g. patched a Longhorn setting, restarted a pod, deleted an orphaned replica).
acceptance — Doctor accepted the message without fixing the underlying state — used for actions like "Abandon" on a failed CP promotion, where the operator decides "I'll handle the host manually, just clear the row."

The Doctor executor writes resolution_kind inline within the same transaction it holds the row lock in (executor.py:491–497). Recipes do not write system_messages themselves — see Doctor internals.

Re-appearance: a "resolved" row that comes back

If a previously-resolved row's check_key shows up again, the upsert sets resolved_at = NULL, clean_cycles = 0. This is intentional: from the operator's point of view the issue is back, and they should see it in the active list again. The email digest will treat it as a new event because the row's created_at is older than last_digest_at, but the runner also tracks last_seen_at (used by the digest) so re-appearances are re-emailed.

This also means resolution_kind is not durable history — if the row flaps back to active, the kind is reset on the next resolution. If you need durable history, query doctor_executions (audit trail of every Doctor apply) joined to system_messages.id.

7-day cleanup

Resolved rows older than 7 days are deleted by a daily housekeeping job. The cleanup window matters in two cases:

Linking from issue reports. A "this happened last Tuesday" report may no longer have a row — fall back to doctor_executions (preserved longer) or the API access log.
Suppression. Suppression rules outlive rows. Re-creating a row matching a still-suppressed glob will be suppressed silently on first re-appearance.

The 7-day window also bounds the size of the system_messages table — a dense incident week can produce ~10k rows per cluster, and we have observed ~50–80k active+resolved on healthy long-running deployments. Index maintenance on (check_key), (resolved_at), and (category, resolved_at) all assume this scale.

Schema evolution

If you are tracing why a column behaves a certain way, here is the migration genealogy. Every change had a reason.

Migration	What it added	Why
`040_system_messages.sql`	Base table	Replace ad-hoc alert dicts
`045_clean_cycles.sql`	`clean_cycles` column + threshold default 2	Stop flapping (the original sin)
`049_suppression.sql`	`system_message_suppressions` table	Operators wanted to mute known-noisy findings
`075_check_type.sql`	`check_type` column extracted from `check_key`	Doctor recipe lookup needs a stable type
`078_cluster_id_fk_cascade.sql`	`ON DELETE CASCADE` on `cluster_id`	Deleting a cluster shouldn't leave orphan rows
`079_resolution_kind.sql`	`resolution_kind` CHECK constraint	Distinguish auto vs Doctor outcomes

check_type (075) and resolution_kind (079) together are what make the Doctor framework possible — without check_type the registry can't dispatch, and without resolution_kind the runner can't distinguish a real fix from a noise-resolved row.