Doctor framework internals
Recipe protocol, the FOR UPDATE NOWAIT lock chain, multi-action recipes, RBAC, and the writer-owns-resolve contract.
The Doctor framework is the answer to "how do we let operators apply remediations from a button without two of them stepping on each other or quietly corrupting cluster state?" This page covers what the framework actually guarantees and how a recipe author plugs into it.
For the operator-facing UI walkthrough see the Doctor user-manual page.
The propose-only contract
Every recipe implements three async methods:
class Recipe(Protocol):
check_type: ClassVar[str]
name: ClassVar[str]
min_role: ClassVar[RecipeRole] = "admin"
async def list_actions(self, message, ctx) -> list[RecipeAction]: ...
async def preview(self, message, ctx, action_key) -> RecipePreview: ...
async def apply(self, message, ctx, preview) -> RecipeOutcome: ...backend/app/services/doctor/recipe.py defines all the supporting types.
The contract is:
list_actions()andpreview()are read-only. They observe cluster state viactx.pooland report whatapply()would do. They never mutate.apply()is the only mutating method. It is called by the executor with the samepreviewobject that was previously returned, so the recipe can emit a diff once and trust the executor to feed it back.
The "propose-only" property is enforced socially in the protocol and mechanically in the executor — see the lock chain below.
The full apply lock chain
Two operators clicking Apply on the same row at roughly the same time is
the central concurrency case the framework handles. Here is what
run_apply() (executor.py:270) does, in order, for every apply:
The lock chain has five places it can refuse to proceed, all returning distinct error types so the UI and the audit log can tell them apart:
| Error | HTTP | Meaning |
|---|---|---|
NotFound | 404 | system_messages.id does not exist |
NoRecipe | 400 | No registered recipe for this check_type |
UnknownAction | 400 | Caller passed an action_key the recipe didn't declare |
InsufficientRole | 403 | Caller's role is below recipe (or per-action) min_role |
ConcurrentApply | 409 | Another operator holds the row lock right now |
PreviewMismatch | 409 | Cluster state changed since the operator's preview |
api/doctor.py does the mapping. Every other failure (a kubectl call
returning non-zero, an SSH timeout, a RecipeOutcome.succeeded == False)
flows through as a 200 with succeeded=false in the body — the apply
itself executed; whether it worked is recipe-level concern.
Why FOR UPDATE NOWAIT and not FOR UPDATE
NOWAIT turns a blocking wait into an immediate lock_not_available so
the executor can fail fast and return a friendly 409. With plain
FOR UPDATE two operators clicking simultaneously would have the second
hang for the duration of an SSH-driven apply (sometimes 30+ seconds for
k3s restart recipes), and from the UI it would look like the request
hung — far worse UX than "another operator is applying this, refresh and
try again."
Why re-preview under the lock
Between the operator opening the modal and clicking Apply, anything
could have changed in the cluster — someone could have manually fixed
the underlying drift via kubectl, a CP node could have been promoted,
the host could have come back online. If the diff the operator confirmed
no longer matches what preview() would compute right now, applying it
would produce a different change than the operator authorized. That is a
silent integrity hole; we close it with a re-preview compare and a
PreviewMismatch error.
The writer-owns-resolve contract
This is the most counter-intuitive rule in the framework, and the one recipe authors most commonly violate on the first attempt:
Recipes MUST NOT issue
UPDATE system_messagesdirectly. If a recipe needs to mark the message resolved, it returnsresolution_kind="fix"or"acceptance"fromapply(). The executor writes the row inline within the transaction it already holds theFOR UPDATE NOWAITlock in.
executor.py:479–498 is the only place in the codebase that writes
system_messages.resolved_at from a Doctor flow. The reason is purely
mechanical: a recipe that opens a second connection to update the row
would deadlock against the outer FOR UPDATE lock, because that lock is
held until commit. We caught this exact pattern in
#789 Phase A
when QA observed "Abandon" hanging until pool exhaustion.
The same rule covers any DB cleanup the recipe does on adjacent tables:
use ctx.apply_conn for any write that interacts with the locked
row's foreign keys, including cluster_baseline_drift_resolved,
hosts.status flips during CP demotion, etc. The executor sets
apply_conn only during apply() and leaves it None during
preview() and list_actions().
The corollary is that recipes can use ctx.pool freely for read
queries during preview — there is no lock yet, and reads against the
target row are not blocked.
Multi-action recipes (#789)
A recipe can declare more than one action. cp_promotion_failed ships
three: Retry (re-run the bootstrap playbook), Revert (uninstall
CP role, reinstall as worker), Abandon (DB-only: return host to
free pool). Each is a RecipeAction:
RecipeAction(
key="retry",
label="Retry promotion",
description="Re-run the CP bootstrap playbook on this host",
severity="primary",
min_role=None, # inherit recipe-level
)
RecipeAction(
key="revert",
label="Revert to worker",
description="Uninstall k3s server, reinstall as agent, drain etcd",
severity="destructive",
min_role="admin",
)
RecipeAction(
key="abandon",
label="Abandon (DB only)",
description="Mark host free; does not touch the host",
severity="secondary",
min_role=None,
)The executor plumbs the operator's chosen action_key through
preview(action_key) and apply(preview) (the preview's action_key
field carries it). If the operator changes their mind between preview
and apply, the executor's preview-mismatch check catches it because the
diff itself will differ.
severity controls UI styling — "destructive" flips the Apply button
to a red variant and forces the operator to read the side-effects panel
before the button enables. It is purely cosmetic; the executor does not
gate on it.
RBAC: min_role and per-action overrides
PodWarden has three roles in the local user model: viewer, operator,
admin. The hierarchy lives in app.dependencies.auth.ROLE_HIERARCHY.
Every recipe declares a min_role ClassVar. The default is "admin" —
fail-closed. A recipe author who forgets the field gets the strictest
gate, not the loosest. To delegate to operators, set min_role = "operator"
explicitly. (#858)
Per-action min_role overrides (#907) layer on top:
class AppPodsNotReadyRecipe:
check_type = "app_pods_not_ready"
min_role: ClassVar[RecipeRole] = "operator"
async def list_actions(self, message, ctx) -> list[RecipeAction]:
return [
RecipeAction(key="diagnose", ..., min_role=None), # operator (inherited)
RecipeAction(key="restart", ..., min_role=None), # operator (inherited)
RecipeAction(key="force_delete", ..., min_role=None), # operator (inherited)
RecipeAction(key="reset_node", ..., min_role="admin"), # admin only
]The executor (_action_min_role_for(), executor.py:120–144) consults
the per-action override first, falls back to recipe.min_role, and
finally to "admin" if neither is set. This lets one recipe bundle
diagnose-grade actions next to genuinely destructive ones without forcing
the whole recipe to admin-only.
preview() and list_actions() are always allowed regardless of
role — operators and viewers can inspect what would happen even when they
can't pull the trigger.
The audit trail: doctor_executions
Every successful and failed apply writes one row to doctor_executions
inside the same transaction:
CREATE TABLE doctor_executions (
id BIGSERIAL PRIMARY KEY,
message_id BIGINT REFERENCES system_messages(id) ON DELETE SET NULL,
recipe_name TEXT NOT NULL,
action_key TEXT NOT NULL DEFAULT 'apply',
operator_user_id TEXT NOT NULL,
succeeded BOOLEAN NOT NULL,
diff JSONB, -- the preview diff that was actually applied
side_effects JSONB,
error TEXT,
started_at TIMESTAMPTZ NOT NULL,
finished_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);The audit row outlives the message — ON DELETE SET NULL on
message_id means the 7-day system_messages cleanup leaves the
audit row intact for forensic purposes.
If you are debugging "did anyone fix this last week?", join
doctor_executions to system_messages.check_key via message_id and
filter started_at. The diff column reproduces exactly what was
applied, which means a postmortem can replay an incident even after
the originating row is gone.
Adding a new recipe
The actual cookbook for a recipe author is short:
- Create
backend/app/services/doctor/recipes/<your_recipe>.py. - Implement the three protocol methods. Use
_single_action.pyas a mixin if you only have one action. - Set
check_typeto the value the runner emits. - Set
min_role. Default to admin; opt down only when the apply path is genuinely safe to delegate. - Register in
backend/app/services/doctor/registry.py. - Add an integration test that exercises preview → apply against a
fixture cluster (or asyncpg-fakes for DB-only recipes). Don't mock
system_messages— see the user feedback memory on integration tests. - If
apply()does any DB write that touches a row reachable by the outerFOR UPDATE, route it throughctx.apply_conn. Anything else deadlocks (#789).
The recipe registry assumes one recipe per check_type. If you need two
behaviours for the same check type, that is a multi-action recipe, not
two recipes. Two registrations for the same check_type is a startup
error.
Lifecycle of a system message
From check emission to 7-day cleanup — fingerprints, the clean_cycles dampener, severity ladders, and resolution_kind.
Edge cases & failure modes
Every flapping, stuck, orphan, and structurally-unreachable bug the health system has shipped a fix for — with root causes and how the design now handles them.