Health monitor goes HA

Igor

The Telegram health-monitor bot now runs on every node with advisory-lock leader election, Patroni-aware checks, and an interactive /status command.

The Telegram health-monitor bot — the one that pages us when a Hub service goes sideways — now runs on every node in the cluster. Leadership is arbitrated through an advisory lock, so exactly one node sends alerts at a time, and a node going dark no longer means a silent pager.

The Patroni-aware checks also got smarter. Standbys legitimately return 503 on a few endpoints during their normal life cycle; the old monitor treated that as an outage and woke us up. The new checks read /cluster to know each node's role and apply the right expectations per role. False-positive nights are over.

Interactive commands

The bot also picked up an interactive long-poll sidecar. You can ask it /status in the operator chat and get a per-node breakdown of the last successful probe times, the active write leader, and the last deploy SHA — no SSH required.

The endpoint surface that powers all of this is the same /health route the deploy pipeline already hits to gate a release. If you're running PodWarden Hub yourself, the monitor module is opt-in from the admin dashboard's Telegram tab.