NVIDIA DCGM Exporter exposes GPU telemetry data as Prometheus metrics, providing per-GPU visibility into utilization, memory usage, temperature, power draw, and encoder/decoder engine activity. Deploy as a DaemonSet to monitor every GPU node in your cluster.
About
DCGM Exporter is a Prometheus-compatible metrics exporter for NVIDIA GPUs, built on top of NVIDIA's Data Center GPU Manager (DCGM). It exposes over 40 GPU telemetry metrics — utilization, memory, temperature, power draw, ECC errors, and more — at an HTTP endpoint that Prometheus…