Distributed AI Model Training

Run multi-node GPU training jobs across rented cloud instances, DGX workstations, and home servers — all from a single control plane

Training a serious model requires more GPU memory and compute than any single machine provides. The usual answer is to rent multiple GPU nodes and connect them for distributed training. The hard part isn't the training code — it's the infrastructure: provisioning nodes consistently, wiring them together, staging datasets, saving checkpoints to shared storage, and tearing everything down cleanly when the run finishes.

PodWarden turns a mixed fleet of local workstations, on-site GPU servers, and rented cloud nodes into a single managed training cluster — and its Hub catalog covers every supporting service the stack needs, so you can build the full thing without reaching outside the platform.

What You Need

A distributed training stack has six components. For each one: use what you already have, or deploy it from the PodWarden Hub catalog in minutes.

Component	Bring your own	Or deploy from Hub
Object storage	AWS S3, GCS, Wasabi, any S3-compatible endpoint	MinIO or RustFS — deploy to any node in your cluster
Shared filesystem	Existing NFS server, SMB share	NFS-Ganesha — lightweight NFS server deployable as a workload
Secrets management	HashiCorp Vault, AWS Secrets Manager, Infisical	Vault — from the Hub catalog, stores tokens and credentials encrypted
Experiment tracking	Weights & Biases (cloud), existing MLflow	MLflow — self-hosted experiment tracking, runs on any cluster node
Container registry	Docker Hub, GHCR, GitLab registry	Harbor or Gitea (with built-in registry)
Monitoring	Existing Prometheus + Grafana	Prometheus + Grafana + DCGM Exporter — GPU metrics stack from the catalog

None of these are mandatory to start. But a production training setup uses all of them.

Stack Architecture

Building the Foundation

Deploy supporting services before the training workloads. They're stacks like any other — create, assign to a cluster, deploy.

1. Storage

If you have an S3 endpoint or NFS share, register it as a storage connection in PodWarden (Settings → Storage). PodWarden validates connectivity from all cluster nodes before you deploy anything that depends on it.

If you don't:

MinIO or RustFS — Import from Hub. Deploy to a node with fast local disk. Both expose a standard S3 API, so every other component treats them identically to AWS S3. Your datasets go in one bucket, checkpoints in another.
NFS-Ganesha — For workloads that need a POSIX filesystem rather than S3. Import from Hub, deploy to your storage node, mount from any training job via NFS volume.

2. Secrets

Training jobs need secrets: HuggingFace tokens, W&B API keys, private registry credentials. Store them once, inject anywhere.

If you have Vault or a cloud secrets manager, PodWarden can reference those via environment variables.

If you don't: import Vault from Hub, deploy to your control plane node, and store everything there. PodWarden's secret_refs field injects secrets from Vault into containers at deploy time — they never appear in template configuration or deployment logs.

3. Monitoring

Import Prometheus, Grafana, and DCGM Exporter from the Hub catalog as three separate stacks.

DCGM Exporter — Deploy as a DaemonSet. It runs on every GPU node automatically, exposing per-GPU metrics: utilization, memory usage, temperature, NVLink bandwidth.
Prometheus — Scrapes DCGM Exporter on all nodes and stores the time series.
Grafana — Pre-configured dashboards for GPU utilization, training throughput, and cluster health.

If you already have Prometheus and Grafana, just add DCGM Exporter as a DaemonSet to your cluster and point your existing Prometheus at it.

4. Experiment Tracking (optional)

Import MLflow from Hub. Deploy to any node with access to your object storage. Your training scripts call mlflow.log_metric() as usual — MLflow stores runs, parameters, and artifacts to the same S3/MinIO bucket your checkpoints use.

Adding Your Nodes

Every node in your training fleet starts as a host in PodWarden. Add by IP address or hostname. If your provider supports Tailscale — or you install it yourself — PodWarden discovers them automatically and detects hardware specs: CPU cores, RAM, GPU model, VRAM.

On-site and home lab nodes

If you developed your model on a DGX Spark, a desktop workstation, or any machine on your home network, that hardware can join the same cluster as your rented cloud nodes. Being behind NAT is not a problem. With a Tailscale mesh, the node is reachable regardless of whether it has a public IP. Your local DGX Spark and a rented A100 node on the other side of the world are both just cluster members.

This lets you start training experiments on hardware you already own, then scale out to rented nodes for the full run — without rebuilding your setup or switching tools. The DGX Spark stays in the cluster as extra capacity or as the control plane node, and the rented nodes join and leave as needed.

Rented nodes

Nodes from Lambda Cloud, RunPod, CoreWeave, Vast.ai, or any bare-metal provider are added the same way. Provision them into the cluster over SSH — PodWarden installs K3s and the GPU runtime. A cluster of eight A100 nodes is ready in minutes.

Training Templates

With the foundation deployed, define your training job as a stack.

Example: PyTorch DDP Training Job

Kind:           Job
Image:          nvcr.io/nvidia/pytorch:24.04-py3
GPU count:      8
VRAM:           80Gi
CPU:            64
Memory:         512Gi

Variable	Example	Description
`MASTER_ADDR`	(injected via mesh)	Rank-0 node address for NCCL
`NNODES`	`4`	Total node count
`NPROC_PER_NODE`	`8`	GPUs per node
`MODEL_NAME`	`llama-3-70b`	Model identifier
`DATASET_PATH`	`/data/pile-v2`	Mounted dataset path
`OUTPUT_DIR`	`/checkpoints/run-42`	Checkpoint output
`BATCH_SIZE`	`128`	Global batch size
`MLFLOW_TRACKING_URI`	`http://mlflow.mesh:5000`	Experiment tracker
`HF_TOKEN`	(from secret ref)	HuggingFace API token
`WANDB_API_KEY`	(from secret ref)	W&B API key (if using cloud tracking)

Volume mounts attach your MinIO/S3 dataset bucket and checkpoint bucket. Sensitive values (HF_TOKEN, WANDB_API_KEY) come from Vault via secret_refs — not hardcoded in the template.

Training Data Flow

Job Kind

Training jobs run once and stop. Use kind: Job. PodWarden tracks completion status, logs duration and exit code, and records the run in your deployment history. When the job finishes, the cluster is idle — rented nodes stop consuming billable GPU hours.

Multi-Node Communication

PyTorch DDP, FSDP, and DeepSpeed use NCCL for collective operations. NCCL needs nodes to reach each other over a fast path.

Tailscale mesh — Nodes connected to the same mesh are tagged mesh in PodWarden. Set required_network_types: ["mesh"] on the training template; PodWarden warns before deploying to a cluster where nodes can't communicate.
Provider LAN — Most large GPU cloud providers have private networking between nodes on the same order. Tag those nodes lan.
Public IP — Slower and generally not recommended for NCCL traffic, but works for experimentation.

Scale Up and Down

Your MinIO, Vault, MLflow, and monitoring stack keep running on your permanent nodes. Rented nodes are stateless compute — nothing important lives on them.

Hub Templates

PodWarden Hub catalog for this stack:

Template	Purpose
PyTorch DDP	Multi-node training with NCCL
DeepSpeed ZeRO-3	Large model training
HuggingFace Accelerate	Fine-tuning foundation models
MLflow	Experiment tracking and model registry
MinIO	S3-compatible object storage
RustFS	High-performance S3-compatible object storage
Vault	Secrets management
Prometheus	Metrics collection
Grafana	Dashboards
DCGM Exporter	Per-GPU metrics (DaemonSet)
Harbor	Private container registry

Import any template, customize GPU count and environment variables for your run, deploy. For proprietary training stacks or licensed model weights, publish private templates to your organization's Hub catalog.

What This Looks Like in Practice

A full stack from scratch on a mixed cluster:

Add permanent nodes — DGX Spark or home workstation behind NAT joins via Tailscale. Provision as control plane.
Deploy supporting services — Import MinIO, Vault, MLflow, Prometheus, Grafana, DCGM Exporter from Hub. Deploy to the permanent node. Takes ~15 minutes including configuration.
Rent burst nodes — Add 4× 8×A100 cloud nodes to the cluster over Tailscale mesh.
Configure training job — Import PyTorch DDP template. Set NNODES=5, mount dataset and checkpoint buckets, reference HF_TOKEN from Vault.
Deploy — Distributed job runs across all GPUs. Grafana shows per-node GPU utilization in real time. MLflow tracks the run.
Teardown — Remove rented nodes. MinIO, Vault, MLflow, and Grafana keep running on the permanent node for the next experiment.