PodWarden
User ManualPodWarden

Doctor

Operator-led drift remediation — preview a fix, apply it, audit it

What Doctor is

Doctor is PodWarden's operator-led drift remediation tool, shipped in the 2026-05-01 release. When a system message describes a known drift condition that PodWarden knows how to fix, a Doctor button appears on that message row. Clicking it opens a modal that:

  1. Computes a preview of exactly what change will be made (read-only — nothing touches the cluster yet)
  2. Shows a diff table listing each key, its current value, and the value it will be set to
  3. Lists any side effects (e.g. "restarts k3s on this node")
  4. Lets you apply the change after reviewing the preview

Doctor is not auto-healing. No change is applied automatically. The operator explicitly confirms each remediation after reading the preview.

How to use Doctor

Entry point

Doctor buttons appear in System → Messages. Only messages that have a registered recipe show the button. If a message has no recipe, the button is absent.

You need the operator role (or higher) to see and use Doctor. Viewers see the messages but not the button.

The modal flow

  1. Click "Doctor" on a message row — the modal opens and automatically computes the preview. You see a loading spinner labeled "Computing preview…" while PodWarden reads the live cluster state.

  2. Review the preview:

    • The summary line describes what will happen in plain language.
    • The diff table shows each change: key, current value ("before"), and target value ("after").
    • Side effects listed below the table warn of secondary impacts (e.g. a service restart).
    • If the recipe determines the change is unsafe to apply (e.g. Longhorn cannot satisfy the requested replica count), the Apply button is disabled and a blocking reason explains why.
  3. Click "Apply" to send the change to the cluster. While the apply is running, the modal shows "Applying changes…" and the Escape key and backdrop click are disabled — the backend is executing and you cannot safely cancel mid-flight.

  4. Review the outcome:

    • On success: a green banner lists the changes that landed and provides a "View audit entry" link to the doctor_executions table row for this apply.
    • On partial success: an amber banner lists which changes landed and which did not, with an error message.
    • On full failure: a red banner shows the error.
    • On conflict (cluster state changed between preview and apply): a yellow banner prompts you to re-preview.
  5. Close the modal. The system message will auto-resolve on the next health check cycle (15 minutes) if the fix succeeded. You do not need to manually mark it resolved.

Conflict handling

PodWarden re-runs the preview silently during apply to detect drift between the time you approved the preview and the time the apply executes. If the cluster state changed (e.g. another operator acted on the same resource, or another process changed the setting), the apply is rejected with a "Cluster state changed — re-preview required" message. Click "Re-preview" to start over with a fresh view of the current state.

If another operator is already mid-apply on the same message, you will see "A doctor execution is already running" and the button is disabled until it completes.

Permission model

ActionMinimum role
View system messagesviewer
See Doctor buttonoperator
Run previewoperator
Apply a recipeoperator
View audit historyoperator

Auth-disabled deployments (development mode) treat all requests as operator.

Available recipes

Recipes are registered against specific check_type values. The recipes shipped as of 2026-05-01:

Recipecheck_typeWhat it fixes
Longhorn setting driftlonghorn_setting_driftPatches a single Longhorn setting back to the effective baseline value via kubectl patch settings.longhorn.io
Degraded Longhorn volumelonghorn_volume_degradedInitiates replica rebuild for a degraded volume. Apply is blocked if Longhorn cannot satisfy the replica count (e.g. single-node cluster with replica-soft-anti-affinity=false)
k3s arg driftk3s_arg_driftWrites the correct value to /etc/rancher/k3s/config.yaml on the control-plane node via SSH and restarts k3s (side effect: brief k3s restart on that node)
Missing baseline namespacebaseline_namespace_missingCreates the missing namespace with the required labels via kubectl apply
Orphaned Longhorn replicaorphaned_longhorn_replicaDeletes the orphaned replica.longhorn.io resource
Stuck PVCstuck_pvcDeletes the stuck PVC to allow re-provisioning (Pending) or releases the retained PV (Released)

Recipes that PodWarden does not have a recipe for show no Doctor button — you must remediate those manually.

Audit trail

Every apply attempt that reaches the execution phase creates a row in the doctor_executions table with:

  • Which message was acted on
  • Which cluster and check type
  • The operator who applied it (OIDC email or sub)
  • The full preview payload that was confirmed
  • The outcome (succeeded, failed, partial, pending)
  • The changes that actually landed (on success/partial)
  • Any error message (on failure/partial)

The "View audit entry" link in the modal's success banner deep-links to the audit row under the cluster's baseline audit section at /clusters/{clusterId}/baseline#audit-{execution_id}.

See also

Doctor | PodWarden Hub