Automatic image cleanup on all Hub nodes

Igor

A weekly systemd timer now prunes the containerd image store on every PodWarden Hub production node. First run: Sunday 2026-05-24 04:00 UTC.

A weekly docker-image-gc.timer systemd unit is now live on all three PodWarden Hub production nodes. Every Sunday at 04:00 UTC it runs docker image prune -a --filter until=168h, removing images older than seven days. First scheduled run: 2026-05-24 04:00 UTC.

The fix came out of an incident that took longer than it should have to find. Docker 29.x switched to a containerd-snapshotter image store backend — layers are stored in /var/lib/containerd instead of /var/lib/docker, and unlike the old daemon, the new store never auto-prunes. At normal Hub deploy cadence that accumulates fast: one of the nodes quietly grew its image store to ~78 GB over about five weeks, eventually filling the root partition.

When the disk filled, the Patroni replica on that node ran out of space and went read-only. The symptom that surfaced to users wasn't a disk error — it was "401 token missing" from an auth code path that swallowed the underlying storage failure and re-emitted it as a missing token. MCP clients started seeing auth failures. The real cause sat unmonitored for roughly three weeks because nothing in the alert stack pointed at disk pressure.

The disk is clear now (49 GB free), the gc timer is deployed to all three nodes, and image-store growth is bounded going forward. No user action needed; the weekly prune runs quietly with no downtime.

We're also fixing the auth code path that turned a disk-full error into a 401. That ships in a follow-up so future disk-pressure events generate a real alert instead of looking like a token problem.