The Telegram health-monitor bot — the one that pages us when a Hub service goes sideways — now runs on every node in the cluster. Leadership is arbitrated through an advisory lock, so exactly one node sends alerts at a time, and a node going dark no longer means a silent pager.
The Patroni-aware checks also got smarter. Standbys legitimately return
503 on a few endpoints during their normal life cycle; the old monitor
treated that as an outage and woke us up. The new checks read
/cluster to know each node's role and apply the right expectations
per role. False-positive nights are over.
Interactive commands
The bot also picked up an interactive long-poll sidecar. You can ask
it /status in the operator chat and get a per-node breakdown of the
last successful probe times, the active write leader, and the last
deploy SHA — no SSH required.
The endpoint surface that powers all of this is the same /health
route the deploy pipeline already hits to gate a release. If you're
running PodWarden Hub yourself, the monitor module is opt-in from the
admin dashboard's Telegram tab.