DCGM Exporter

NVIDIA DCGM Exporter exposes GPU telemetry data as Prometheus metrics, providing per-GPU visibility into utilization, memory usage, temperature, power draw, and encoder/decoder engine activity. Deploy as a DaemonSet to monitor every GPU node in your cluster.

MonitoringAI / Machine LearningFreeApprovedAudited·12.5M4.7K1y ago1.2K deploys

About

DCGM Exporter is a Prometheus-compatible metrics exporter for NVIDIA GPUs, built on top of NVIDIA's Data Center GPU Manager (DCGM). It exposes over 40 GPU telemetry metrics — utilization, memory, temperature, power draw, ECC errors, and more — at an HTTP endpoint that Prometheus…

Deployment Options

1 stack

You might also like