Edge cases & failure modes
Every flapping, stuck, orphan, and structurally-unreachable bug the health system has shipped a fix for — with root causes and how the design now handles them.
This page is a curated postmortem catalogue. Every entry below describes a real failure mode we hit in production or staging, the root cause, the fix shipped, and what to look for if you suspect a similar problem.
It is organised by failure shape because that is how you will encounter them in the wild — "the bell never stops re-pinging" is a flapping bug regardless of which check is involved.
1. Flapping (clean → dirty → clean → dirty)
Symptom. A finding repeatedly auto-resolves and then re-emits one or two cycles later. Email digest churn. Operators learn to ignore the bell.
Common root causes.
check_keyis computed from a non-stable input — e.g. an ephemeral pod UID, a port number that varies across restarts, a slot index in state that gets reordered. Each cycle generates a fresh fingerprint, the runner believes the prior one resolved, and a "new" finding is emitted. Look forcheck_keystrings with timestamps, UUIDs, or positional indices.- Threshold too aggressive.
clean_cycles_to_resolve = 1(or the per-cluster setting being lowered) defeats the dampener. Default is2for a reason — see Lifecycle. - The check itself is non-deterministic. A check that calls a Kubernetes API with no retry on transient timeouts will report "missing" on a slow cycle and "present" on the next. Wrap external calls in a single-retry block at the check layer, not the runner layer.
How we caught one. #176
and #208
revealed that the change-type detector in MR descriptions was matching
keywords across multiple H2 sections — release MRs with a
"## Features" heading were being flagged as "Feature" change-type. The
fix scoped the regex to the ## Change Type section only. The same
pattern shape — "we matched something that mentioned the keyword,
not the structural element" — appears in check_key derivation when a
fingerprint string contains substrings that vary across cycles.
How to confirm in prod.
-- Find rows that have flapped within the last 24h
SELECT check_key, COUNT(*) FILTER (WHERE resolution_kind = 'auto') AS auto_resolved
FROM system_messages_history -- if you don't have history, see doctor_executions
WHERE updated_at > NOW() - INTERVAL '24 hours'
GROUP BY check_key
HAVING COUNT(*) FILTER (WHERE resolution_kind = 'auto') > 2;A check_key that auto-resolved more than twice in 24h is a flap candidate.
2. Stuck active (issue gone, row never resolves)
Symptom. A finding's underlying condition has been fixed (verified manually with kubectl/SSH) but the row stays active for days.
Common root causes.
- Category mismatch. The row's
categorydoes not match any check the runner enumerates. The auto-resolve sweep operates per-category, so a row whose category never runs is structurally unreachable. (#915: admission-bypass audit messages withcategory='admission_bypass'and no health check emitting that category.) check_keydrift on the fix side. The check still emits the finding, but with a slightly differentcheck_keythan what is in the table. The old row stays active because nothing matches it; the new row appears alongside it.- A check was disabled at runtime. If
enabled_categoriesexcludes the row's category, the row is structurally unreachable until the category is re-enabled.
The fix pattern. Either make the row reachable by the sweep
(emit the same check_key consistently, or run the corresponding
check), or write the row already-resolved at insert time
(resolved_at = NOW(), resolution_kind = 'auto') so it lives in the
timeline but not the active list.
How to confirm in prod.
-- Active rows older than a week — most are real, some are stuck
SELECT category, check_key, created_at, clean_cycles
FROM system_messages
WHERE resolved_at IS NULL
AND created_at < NOW() - INTERVAL '7 days'
ORDER BY created_at;
-- For each, verify the underlying condition still exists.
-- If it doesn't, you have a structurally-unreachable row.If clean_cycles is stuck at 0 for a row whose category clearly ran
multiple times since created_at, the upsert is firing every cycle —
i.e. the check still emits — and the issue is real, not stuck.
If clean_cycles is stuck at 0 and the category does not appear in
the recent health_check_runs log, you have a category-scope bug. Open
an issue.
3. Orphans (live resource without a tracking row, or vice versa)
There are two flavours, and they are the inverse of each other:
3a. Live resource without a row.
A k8s resource exists that PodWarden does not know about. This is the
"unmanaged X" family of findings — Unmanaged IngressRoute,
Unmanaged Ingress, Orphaned Longhorn replica, Unknown K8s node.
These are correct alerts; the fix is either to import the resource
into PodWarden or delete it. The delete_unmanaged_workload and
orphaned_replica recipes provide one-click delete.
3b. PodWarden row without a live resource.
PodWarden has a tracking row but the cluster doesn't have the
corresponding resource — "Ghost ingress rule", "Stale assignment".
These are usually the residue of a manual kubectl delete or a
cleanup that took out the cluster but not the DB. The fix is to clear
the PodWarden record; recipes for the ghost cases auto-resolve when
the on-host scan no longer sees the resource.
The trickier one: orphans of orphans. Some resources have transitive owners — a Longhorn replica owns no volume, a PVC has no consumer pod. Detecting these requires walking the graph; getting the walk wrong creates false orphans (a short-lived race window where PodWarden hasn't seen the new owner yet). Mitigation:
- Two-cycle confirmation. A finding must persist across two scan
cycles before it is emitted. The
pvc_attached_no_consumercheck ininfra_drift.pyuses this — it tracks "candidate" findings in a per-cluster cache and only writes a system message after the second consecutive observation. - Reverse-lookup before delete. Recipes that delete an "orphan"
re-verify ownership at apply time inside
preview(), so the re-preview-under-lock check catches the case where the resource acquired an owner between observation and apply.
How we caught one.
The Cloudflare-proxied-domain false orphan
(#916):
DNS resolved to a Cloudflare edge IP, drift detector concluded "this
domain isn't pointed at the gateway, that's drift", emitted a
proxied:domain:<id> row. But Cloudflare proxying is a deliberate
operator config, not drift. The fix replaced the system_messages
emission with a logger.debug() and let clean_cycles >= 2
auto-resolve the existing rows.
4. Doctor: stuck in "Applying…"
Symptom. The operator clicks Apply and the modal shows "Applying…" indefinitely. Refreshing the page shows the row still active.
Common root causes.
- Recipe deadlock against its own outer lock. The recipe issued
UPDATE system_messages(or any write to a row covered by the outerFOR UPDATE) on a connection that was notctx.apply_conn. The new connection waits for the outer lock, the outer transaction waits for the inner write to commit, deadlock. PostgreSQL's deadlock detector eventually cancels one side, but the canceled side may be the apply path if the inner write was on a different transaction. - A subprocess (
kubectl/ssh) is hanging. The recipe shelled out to a command that doesn't exit. There is no per-recipe timeout enforced by the framework — recipes are expected to useasyncio.wait_for()or set explicit subprocess timeouts. - The asyncpg connection pool is saturated. If every connection is
already in an
apply(), new applies queue. Withmax_size=20, twenty simultaneous applies can starve the pool. Most installations are far below that and the issue is academic, but on pw_prod under load we have seen it.
Mitigations shipped.
The writer-owns-resolve contract (Doctor internals) makes
deadlock-against-self impossible-by-protocol if the recipe respects
ctx.apply_conn. The _single_action.py mixin packages the common
path so most recipes get this for free. Long-running on-host operations
(k3s restart, ansible run) are expected to take 30+ seconds — the UI
shows a progress indicator, not a hang, but operators should be told
in the side-effects panel that a long wait is expected.
Recovery. If you suspect a stuck apply on prod, query running transactions:
SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE state = 'active' AND xact_start < NOW() - INTERVAL '1 minute'
ORDER BY xact_start;A long-running transaction with query LIKE '%FOR UPDATE NOWAIT%' is
either a stuck apply or a test harness that forgot to commit. Cancel
with SELECT pg_cancel_backend(<pid>) (graceful) or
pg_terminate_backend(<pid>) (forced).
5. Doctor: PreviewMismatch on every Apply
Symptom. Every operator click on Apply returns 409 PreviewMismatch. The diff in the modal looks correct but the executor refuses to proceed.
Common root causes.
preview()is non-deterministic. It returns differentDiffLines on identical cluster state because of dict ordering, time-sensitive formatting, or a non-stable iteration. Compare two calls topreview()against the same fixture and diff the outputs.- A background controller is racing the recipe. The Longhorn manager, Cloudflare DNS, or k3s itself is updating the same resource between the modal-open preview and the apply-time preview. This is correctly caught — operator should refresh the modal.
Mitigations shipped.
RecipePreview includes safe_to_apply: bool and blocking_reason: str | None. If preview() itself can detect "the cluster is moving",
return safe_to_apply=False with a reason; the UI hides the Apply
button entirely rather than letting the operator click into a
guaranteed-409.
6. Email digest: silent / over-loud
Symptom. The digest mails are either empty when known issues exist ("silent") or send the same set of issues every cycle ("over-loud").
Common root causes.
- Silent: severity threshold too high.
notification_min_severityis set toerrorand the issues arewarning-grade. Lower towarningfor noisier inboxes, leave aterrorfor quiet ones. - Silent: SMTP failure. Check the worker container logs for connection errors. The runner logs every send attempt and outcome.
- Over-loud: re-appearance counts as new. A flapping finding (see #1) re-emails on every flip. Fix the flapping at the source; raising the digest threshold is a workaround, not a fix.
- Over-loud: the dampener was disabled.
clean_cycles_to_resolve = 0(immediate resolve) makes every absent-then-present cycle produce a fresh email. This setting is intended for ad-hoc admin triggers via the UI, not as a default.
7. Auto-resolve race (resolved-and-incremented in same cycle)
Symptom. A row whose clean_cycles was at threshold - 1 is
resolved in the same cycle it became absent, instead of after one more
clean cycle.
Status. This was the original sin and is fixed in
runner.py:401–464. The two SQL statements run in the order
resolve first, then increment. If you ever observe this symptom,
the fix has regressed and you have a real bug — open an issue and
attach the affected check_key.
The reason the order matters: if we incremented first, a row at
clean_cycles = threshold - 1 would be pushed to threshold and then
caught by the resolve branch in the same cycle. By resolving first
against the prior state, then incrementing, we guarantee the
dampener requires threshold distinct absent cycles minimum.
Mental model when a new alert misbehaves
When something looks wrong, walk this list in order:
- Is the
check_keystable across cycles? Inspect two consecutive results from the same check on the same cluster state. If the key differs, you have a flapping bug; see #1. - Does the row's
categorymatch a check that runs? Comparesystem_messages.categorytoenabled_categoriesand the runner's per-cycle category list. Mismatch → structurally unreachable; see #2. - Does
preview()return the same thing twice in a row? If not,PreviewMismatchis your future; see #5. - Is the recipe writing
system_messagesitself? If yes, it's wrong; route throughRecipeOutcome.resolution_kind. See Doctor internals. - Does the apply path use
ctx.apply_connfor adjacent writes? If not and the row has FK relations being touched, you'll deadlock; see #4.
If you walk all five and nothing matches, the issue is novel and worth filing — every entry on this page started life as a confused operator saying "the bell is doing the thing again."