Doctor framework internals

Recipe protocol, the FOR UPDATE NOWAIT lock chain, multi-action recipes, RBAC, and the writer-owns-resolve contract.

The Doctor framework is the answer to "how do we let operators apply remediations from a button without two of them stepping on each other or quietly corrupting cluster state?" This page covers what the framework actually guarantees and how a recipe author plugs into it.

For the operator-facing UI walkthrough see the Doctor user-manual page.

The propose-only contract

Every recipe implements three async methods:

class Recipe(Protocol):
    check_type: ClassVar[str]
    name: ClassVar[str]
    min_role: ClassVar[RecipeRole] = "admin"

    async def list_actions(self, message, ctx) -> list[RecipeAction]: ...
    async def preview(self, message, ctx, action_key) -> RecipePreview: ...
    async def apply(self, message, ctx, preview) -> RecipeOutcome: ...

backend/app/services/doctor/recipe.py defines all the supporting types. The contract is:

list_actions() and preview() are read-only. They observe cluster state via ctx.pool and report what apply() would do. They never mutate.
apply() is the only mutating method. It is called by the executor with the same preview object that was previously returned, so the recipe can emit a diff once and trust the executor to feed it back.

The "propose-only" property is enforced socially in the protocol and mechanically in the executor — see the lock chain below.

The full apply lock chain

Two operators clicking Apply on the same row at roughly the same time is the central concurrency case the framework handles. Here is what run_apply() (executor.py:270) does, in order, for every apply:

The lock chain has five places it can refuse to proceed, all returning distinct error types so the UI and the audit log can tell them apart:

Error	HTTP	Meaning
`NotFound`	404	`system_messages.id` does not exist
`NoRecipe`	400	No registered recipe for this `check_type`
`UnknownAction`	400	Caller passed an `action_key` the recipe didn't declare
`InsufficientRole`	403	Caller's role is below recipe (or per-action) `min_role`
`ConcurrentApply`	409	Another operator holds the row lock right now
`PreviewMismatch`	409	Cluster state changed since the operator's preview

api/doctor.py does the mapping. Every other failure (a kubectl call returning non-zero, an SSH timeout, a RecipeOutcome.succeeded == False) flows through as a 200 with succeeded=false in the body — the apply itself executed; whether it worked is recipe-level concern.

Why `FOR UPDATE NOWAIT` and not `FOR UPDATE`

NOWAIT turns a blocking wait into an immediate lock_not_available so the executor can fail fast and return a friendly 409. With plain FOR UPDATE two operators clicking simultaneously would have the second hang for the duration of an SSH-driven apply (sometimes 30+ seconds for k3s restart recipes), and from the UI it would look like the request hung — far worse UX than "another operator is applying this, refresh and try again."

Why re-preview under the lock

Between the operator opening the modal and clicking Apply, anything could have changed in the cluster — someone could have manually fixed the underlying drift via kubectl, a CP node could have been promoted, the host could have come back online. If the diff the operator confirmed no longer matches what preview() would compute right now, applying it would produce a different change than the operator authorized. That is a silent integrity hole; we close it with a re-preview compare and a PreviewMismatch error.

The writer-owns-resolve contract

This is the most counter-intuitive rule in the framework, and the one recipe authors most commonly violate on the first attempt:

Recipes MUST NOT issue UPDATE system_messages directly. If a recipe needs to mark the message resolved, it returns resolution_kind="fix" or "acceptance" from apply(). The executor writes the row inline within the transaction it already holds the FOR UPDATE NOWAIT lock in.

executor.py:479–498 is the only place in the codebase that writes system_messages.resolved_at from a Doctor flow. The reason is purely mechanical: a recipe that opens a second connection to update the row would deadlock against the outer FOR UPDATE lock, because that lock is held until commit. We caught this exact pattern in #789 Phase A when QA observed "Abandon" hanging until pool exhaustion.

The same rule covers any DB cleanup the recipe does on adjacent tables: use ctx.apply_conn for any write that interacts with the locked row's foreign keys, including cluster_baseline_drift_resolved, hosts.status flips during CP demotion, etc. The executor sets apply_conn only during apply() and leaves it None during preview() and list_actions().

The corollary is that recipes can use ctx.pool freely for read queries during preview — there is no lock yet, and reads against the target row are not blocked.

Multi-action recipes (#789)

A recipe can declare more than one action. cp_promotion_failed ships three: Retry (re-run the bootstrap playbook), Revert (uninstall CP role, reinstall as worker), Abandon (DB-only: return host to free pool). Each is a RecipeAction:

RecipeAction(
    key="retry",
    label="Retry promotion",
    description="Re-run the CP bootstrap playbook on this host",
    severity="primary",
    min_role=None,  # inherit recipe-level
)
RecipeAction(
    key="revert",
    label="Revert to worker",
    description="Uninstall k3s server, reinstall as agent, drain etcd",
    severity="destructive",
    min_role="admin",
)
RecipeAction(
    key="abandon",
    label="Abandon (DB only)",
    description="Mark host free; does not touch the host",
    severity="secondary",
    min_role=None,
)

The executor plumbs the operator's chosen action_key through preview(action_key) and apply(preview) (the preview's action_key field carries it). If the operator changes their mind between preview and apply, the executor's preview-mismatch check catches it because the diff itself will differ.

severity controls UI styling — "destructive" flips the Apply button to a red variant and forces the operator to read the side-effects panel before the button enables. It is purely cosmetic; the executor does not gate on it.

RBAC: `min_role` and per-action overrides

PodWarden has three roles in the local user model: viewer, operator, admin. The hierarchy lives in app.dependencies.auth.ROLE_HIERARCHY.

Every recipe declares a min_role ClassVar. The default is "admin" — fail-closed. A recipe author who forgets the field gets the strictest gate, not the loosest. To delegate to operators, set min_role = "operator" explicitly. (#858)

Per-action min_role overrides (#907) layer on top:

class AppPodsNotReadyRecipe:
    check_type = "app_pods_not_ready"
    min_role: ClassVar[RecipeRole] = "operator"

    async def list_actions(self, message, ctx) -> list[RecipeAction]:
        return [
            RecipeAction(key="diagnose",     ..., min_role=None),       # operator (inherited)
            RecipeAction(key="restart",      ..., min_role=None),       # operator (inherited)
            RecipeAction(key="force_delete", ..., min_role=None),       # operator (inherited)
            RecipeAction(key="reset_node",   ..., min_role="admin"),    # admin only
        ]

The executor (_action_min_role_for(), executor.py:120–144) consults the per-action override first, falls back to recipe.min_role, and finally to "admin" if neither is set. This lets one recipe bundle diagnose-grade actions next to genuinely destructive ones without forcing the whole recipe to admin-only.

preview() and list_actions() are always allowed regardless of role — operators and viewers can inspect what would happen even when they can't pull the trigger.

The audit trail: `doctor_executions`

Every successful and failed apply writes one row to doctor_executions inside the same transaction:

CREATE TABLE doctor_executions (
    id BIGSERIAL PRIMARY KEY,
    message_id BIGINT REFERENCES system_messages(id) ON DELETE SET NULL,
    recipe_name TEXT NOT NULL,
    action_key TEXT NOT NULL DEFAULT 'apply',
    operator_user_id TEXT NOT NULL,
    succeeded BOOLEAN NOT NULL,
    diff JSONB,           -- the preview diff that was actually applied
    side_effects JSONB,
    error TEXT,
    started_at TIMESTAMPTZ NOT NULL,
    finished_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

The audit row outlives the message — ON DELETE SET NULL on message_id means the 7-day system_messages cleanup leaves the audit row intact for forensic purposes.

If you are debugging "did anyone fix this last week?", join doctor_executions to system_messages.check_key via message_id and filter started_at. The diff column reproduces exactly what was applied, which means a postmortem can replay an incident even after the originating row is gone.

Adding a new recipe

The actual cookbook for a recipe author is short:

Create backend/app/services/doctor/recipes/<your_recipe>.py.
Implement the three protocol methods. Use _single_action.py as a mixin if you only have one action.
Set check_type to the value the runner emits.
Set min_role. Default to admin; opt down only when the apply path is genuinely safe to delegate.
Register in backend/app/services/doctor/registry.py.
Add an integration test that exercises preview → apply against a fixture cluster (or asyncpg-fakes for DB-only recipes). Don't mock system_messages — see the user feedback memory on integration tests.
If apply() does any DB write that touches a row reachable by the outer FOR UPDATE, route it through ctx.apply_conn. Anything else deadlocks (#789).

The recipe registry assumes one recipe per check_type. If you need two behaviours for the same check type, that is a multi-action recipe, not two recipes. Two registrations for the same check_type is a startup error.