Edge cases & failure modes

Every flapping, stuck, orphan, and structurally-unreachable bug the health system has shipped a fix for — with root causes and how the design now handles them.

This page is a curated postmortem catalogue. Every entry below describes a real failure mode we hit in production or staging, the root cause, the fix shipped, and what to look for if you suspect a similar problem.

It is organised by failure shape because that is how you will encounter them in the wild — "the bell never stops re-pinging" is a flapping bug regardless of which check is involved.

1. Flapping (clean → dirty → clean → dirty)

Symptom. A finding repeatedly auto-resolves and then re-emits one or two cycles later. Email digest churn. Operators learn to ignore the bell.

Common root causes.

check_key is computed from a non-stable input — e.g. an ephemeral pod UID, a port number that varies across restarts, a slot index in state that gets reordered. Each cycle generates a fresh fingerprint, the runner believes the prior one resolved, and a "new" finding is emitted. Look for check_key strings with timestamps, UUIDs, or positional indices.
Threshold too aggressive. clean_cycles_to_resolve = 1 (or the per-cluster setting being lowered) defeats the dampener. Default is 2 for a reason — see Lifecycle.
The check itself is non-deterministic. A check that calls a Kubernetes API with no retry on transient timeouts will report "missing" on a slow cycle and "present" on the next. Wrap external calls in a single-retry block at the check layer, not the runner layer.

How we caught one. #176 and #208 revealed that the change-type detector in MR descriptions was matching keywords across multiple H2 sections — release MRs with a "## Features" heading were being flagged as "Feature" change-type. The fix scoped the regex to the ## Change Type section only. The same pattern shape — "we matched something that mentioned the keyword, not the structural element" — appears in check_key derivation when a fingerprint string contains substrings that vary across cycles.

How to confirm in prod.

-- Find rows that have flapped within the last 24h
SELECT check_key, COUNT(*) FILTER (WHERE resolution_kind = 'auto') AS auto_resolved
FROM system_messages_history -- if you don't have history, see doctor_executions
WHERE updated_at > NOW() - INTERVAL '24 hours'
GROUP BY check_key
HAVING COUNT(*) FILTER (WHERE resolution_kind = 'auto') > 2;

A check_key that auto-resolved more than twice in 24h is a flap candidate.

2. Stuck active (issue gone, row never resolves)

Symptom. A finding's underlying condition has been fixed (verified manually with kubectl/SSH) but the row stays active for days.

Common root causes.

Category mismatch. The row's category does not match any check the runner enumerates. The auto-resolve sweep operates per-category, so a row whose category never runs is structurally unreachable. (#915: admission-bypass audit messages with category='admission_bypass' and no health check emitting that category.)
check_key drift on the fix side. The check still emits the finding, but with a slightly different check_key than what is in the table. The old row stays active because nothing matches it; the new row appears alongside it.
A check was disabled at runtime. If enabled_categories excludes the row's category, the row is structurally unreachable until the category is re-enabled.

The fix pattern. Either make the row reachable by the sweep (emit the same check_key consistently, or run the corresponding check), or write the row already-resolved at insert time (resolved_at = NOW(), resolution_kind = 'auto') so it lives in the timeline but not the active list.

How to confirm in prod.

-- Active rows older than a week — most are real, some are stuck
SELECT category, check_key, created_at, clean_cycles
FROM system_messages
WHERE resolved_at IS NULL
  AND created_at < NOW() - INTERVAL '7 days'
ORDER BY created_at;

-- For each, verify the underlying condition still exists.
-- If it doesn't, you have a structurally-unreachable row.

If clean_cycles is stuck at 0 for a row whose category clearly ran multiple times since created_at, the upsert is firing every cycle — i.e. the check still emits — and the issue is real, not stuck.

If clean_cycles is stuck at 0 and the category does not appear in the recent health_check_runs log, you have a category-scope bug. Open an issue.

3. Orphans (live resource without a tracking row, or vice versa)

There are two flavours, and they are the inverse of each other:

3a. Live resource without a row. A k8s resource exists that PodWarden does not know about. This is the "unmanaged X" family of findings — Unmanaged IngressRoute, Unmanaged Ingress, Orphaned Longhorn replica, Unknown K8s node. These are correct alerts; the fix is either to import the resource into PodWarden or delete it. The delete_unmanaged_workload and orphaned_replica recipes provide one-click delete.

3b. PodWarden row without a live resource. PodWarden has a tracking row but the cluster doesn't have the corresponding resource — "Ghost ingress rule", "Stale assignment". These are usually the residue of a manual kubectl delete or a cleanup that took out the cluster but not the DB. The fix is to clear the PodWarden record; recipes for the ghost cases auto-resolve when the on-host scan no longer sees the resource.

The trickier one: orphans of orphans. Some resources have transitive owners — a Longhorn replica owns no volume, a PVC has no consumer pod. Detecting these requires walking the graph; getting the walk wrong creates false orphans (a short-lived race window where PodWarden hasn't seen the new owner yet). Mitigation:

Two-cycle confirmation. A finding must persist across two scan cycles before it is emitted. The pvc_attached_no_consumer check in infra_drift.py uses this — it tracks "candidate" findings in a per-cluster cache and only writes a system message after the second consecutive observation.
Reverse-lookup before delete. Recipes that delete an "orphan" re-verify ownership at apply time inside preview(), so the re-preview-under-lock check catches the case where the resource acquired an owner between observation and apply.

How we caught one. The Cloudflare-proxied-domain false orphan (#916): DNS resolved to a Cloudflare edge IP, drift detector concluded "this domain isn't pointed at the gateway, that's drift", emitted a proxied:domain:<id> row. But Cloudflare proxying is a deliberate operator config, not drift. The fix replaced the system_messages emission with a logger.debug() and let clean_cycles >= 2 auto-resolve the existing rows.

4. Doctor: stuck in "Applying…"

Symptom. The operator clicks Apply and the modal shows "Applying…" indefinitely. Refreshing the page shows the row still active.

Common root causes.

Recipe deadlock against its own outer lock. The recipe issued UPDATE system_messages (or any write to a row covered by the outer FOR UPDATE) on a connection that was not ctx.apply_conn. The new connection waits for the outer lock, the outer transaction waits for the inner write to commit, deadlock. PostgreSQL's deadlock detector eventually cancels one side, but the canceled side may be the apply path if the inner write was on a different transaction.
A subprocess (kubectl/ssh) is hanging. The recipe shelled out to a command that doesn't exit. There is no per-recipe timeout enforced by the framework — recipes are expected to use asyncio.wait_for() or set explicit subprocess timeouts.
The asyncpg connection pool is saturated. If every connection is already in an apply(), new applies queue. With max_size=20, twenty simultaneous applies can starve the pool. Most installations are far below that and the issue is academic, but on pw_prod under load we have seen it.

Mitigations shipped. The writer-owns-resolve contract (Doctor internals) makes deadlock-against-self impossible-by-protocol if the recipe respects ctx.apply_conn. The _single_action.py mixin packages the common path so most recipes get this for free. Long-running on-host operations (k3s restart, ansible run) are expected to take 30+ seconds — the UI shows a progress indicator, not a hang, but operators should be told in the side-effects panel that a long wait is expected.

Recovery. If you suspect a stuck apply on prod, query running transactions:

SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE state = 'active' AND xact_start < NOW() - INTERVAL '1 minute'
ORDER BY xact_start;

A long-running transaction with query LIKE '%FOR UPDATE NOWAIT%' is either a stuck apply or a test harness that forgot to commit. Cancel with SELECT pg_cancel_backend(<pid>) (graceful) or pg_terminate_backend(<pid>) (forced).

5. Doctor: PreviewMismatch on every Apply

Symptom. Every operator click on Apply returns 409 PreviewMismatch. The diff in the modal looks correct but the executor refuses to proceed.

Common root causes.

preview() is non-deterministic. It returns different DiffLines on identical cluster state because of dict ordering, time-sensitive formatting, or a non-stable iteration. Compare two calls to preview() against the same fixture and diff the outputs.
A background controller is racing the recipe. The Longhorn manager, Cloudflare DNS, or k3s itself is updating the same resource between the modal-open preview and the apply-time preview. This is correctly caught — operator should refresh the modal.

Mitigations shipped. RecipePreview includes safe_to_apply: bool and blocking_reason: str | None. If preview() itself can detect "the cluster is moving", return safe_to_apply=False with a reason; the UI hides the Apply button entirely rather than letting the operator click into a guaranteed-409.

6. Email digest: silent / over-loud

Symptom. The digest mails are either empty when known issues exist ("silent") or send the same set of issues every cycle ("over-loud").

Common root causes.

Silent: severity threshold too high. notification_min_severity is set to error and the issues are warning-grade. Lower to warning for noisier inboxes, leave at error for quiet ones.
Silent: SMTP failure. Check the worker container logs for connection errors. The runner logs every send attempt and outcome.
Over-loud: re-appearance counts as new. A flapping finding (see #1) re-emails on every flip. Fix the flapping at the source; raising the digest threshold is a workaround, not a fix.
Over-loud: the dampener was disabled. clean_cycles_to_resolve = 0 (immediate resolve) makes every absent-then-present cycle produce a fresh email. This setting is intended for ad-hoc admin triggers via the UI, not as a default.

7. Auto-resolve race (resolved-and-incremented in same cycle)

Symptom. A row whose clean_cycles was at threshold - 1 is resolved in the same cycle it became absent, instead of after one more clean cycle.

Status. This was the original sin and is fixed in runner.py:401–464. The two SQL statements run in the order resolve first, then increment. If you ever observe this symptom, the fix has regressed and you have a real bug — open an issue and attach the affected check_key.

The reason the order matters: if we incremented first, a row at clean_cycles = threshold - 1 would be pushed to threshold and then caught by the resolve branch in the same cycle. By resolving first against the prior state, then incrementing, we guarantee the dampener requires threshold distinct absent cycles minimum.

Mental model when a new alert misbehaves

When something looks wrong, walk this list in order:

Is the check_key stable across cycles? Inspect two consecutive results from the same check on the same cluster state. If the key differs, you have a flapping bug; see #1.
Does the row's category match a check that runs? Compare system_messages.category to enabled_categories and the runner's per-cycle category list. Mismatch → structurally unreachable; see #2.
Does preview() return the same thing twice in a row? If not, PreviewMismatch is your future; see #5.
Is the recipe writing system_messages itself? If yes, it's wrong; route through RecipeOutcome.resolution_kind. See Doctor internals.
Does the apply path use ctx.apply_conn for adjacent writes? If not and the row has FK relations being touched, you'll deadlock; see #4.

If you walk all five and nothing matches, the issue is novel and worth filing — every entry on this page started life as a confused operator saying "the bell is doing the thing again."