Distributed AI Model Training
Run multi-node GPU training jobs across rented cloud instances, DGX workstations, and home servers — all from a single control plane
Training a serious model requires more GPU memory and compute than any single machine provides. The usual answer is to rent multiple GPU nodes and connect them for distributed training. The hard part isn't the training code — it's the infrastructure: provisioning nodes consistently, wiring them together, staging datasets, saving checkpoints to shared storage, and tearing everything down cleanly when the run finishes.
PodWarden turns a mixed fleet of local workstations, on-site GPU servers, and rented cloud nodes into a single managed training cluster — and its Hub catalog covers every supporting service the stack needs, so you can build the full thing without reaching outside the platform.
What You Need
A distributed training stack has six components. For each one: use what you already have, or deploy it from the PodWarden Hub catalog in minutes.
| Component | Bring your own | Or deploy from Hub |
|---|---|---|
| Object storage | AWS S3, GCS, Wasabi, any S3-compatible endpoint | MinIO or RustFS — deploy to any node in your cluster |
| Shared filesystem | Existing NFS server, SMB share | NFS-Ganesha — lightweight NFS server deployable as a workload |
| Secrets management | HashiCorp Vault, AWS Secrets Manager, Infisical | Vault — from the Hub catalog, stores tokens and credentials encrypted |
| Experiment tracking | Weights & Biases (cloud), existing MLflow | MLflow — self-hosted experiment tracking, runs on any cluster node |
| Container registry | Docker Hub, GHCR, GitLab registry | Harbor or Gitea (with built-in registry) |
| Monitoring | Existing Prometheus + Grafana | Prometheus + Grafana + DCGM Exporter — GPU metrics stack from the catalog |
None of these are mandatory to start. But a production training setup uses all of them.
Stack Architecture
Building the Foundation
Deploy supporting services before the training workloads. They're stacks like any other — create, assign to a cluster, deploy.
1. Storage
If you have an S3 endpoint or NFS share, register it as a storage connection in PodWarden (Settings → Storage). PodWarden validates connectivity from all cluster nodes before you deploy anything that depends on it.
If you don't:
- MinIO or RustFS — Import from Hub. Deploy to a node with fast local disk. Both expose a standard S3 API, so every other component treats them identically to AWS S3. Your datasets go in one bucket, checkpoints in another.
- NFS-Ganesha — For workloads that need a POSIX filesystem rather than S3. Import from Hub, deploy to your storage node, mount from any training job via NFS volume.
2. Secrets
Training jobs need secrets: HuggingFace tokens, W&B API keys, private registry credentials. Store them once, inject anywhere.
If you have Vault or a cloud secrets manager, PodWarden can reference those via environment variables.
If you don't: import Vault from Hub, deploy to your control plane node, and store everything there. PodWarden's secret_refs field injects secrets from Vault into containers at deploy time — they never appear in template configuration or deployment logs.
3. Monitoring
Import Prometheus, Grafana, and DCGM Exporter from the Hub catalog as three separate stacks.
- DCGM Exporter — Deploy as a DaemonSet. It runs on every GPU node automatically, exposing per-GPU metrics: utilization, memory usage, temperature, NVLink bandwidth.
- Prometheus — Scrapes DCGM Exporter on all nodes and stores the time series.
- Grafana — Pre-configured dashboards for GPU utilization, training throughput, and cluster health.
If you already have Prometheus and Grafana, just add DCGM Exporter as a DaemonSet to your cluster and point your existing Prometheus at it.
4. Experiment Tracking (optional)
Import MLflow from Hub. Deploy to any node with access to your object storage. Your training scripts call mlflow.log_metric() as usual — MLflow stores runs, parameters, and artifacts to the same S3/MinIO bucket your checkpoints use.
Adding Your Nodes
Every node in your training fleet starts as a host in PodWarden. Add by IP address or hostname. If your provider supports Tailscale — or you install it yourself — PodWarden discovers them automatically and detects hardware specs: CPU cores, RAM, GPU model, VRAM.
On-site and home lab nodes
If you developed your model on a DGX Spark, a desktop workstation, or any machine on your home network, that hardware can join the same cluster as your rented cloud nodes. Being behind NAT is not a problem. With a Tailscale mesh, the node is reachable regardless of whether it has a public IP. Your local DGX Spark and a rented A100 node on the other side of the world are both just cluster members.
This lets you start training experiments on hardware you already own, then scale out to rented nodes for the full run — without rebuilding your setup or switching tools. The DGX Spark stays in the cluster as extra capacity or as the control plane node, and the rented nodes join and leave as needed.
Rented nodes
Nodes from Lambda Cloud, RunPod, CoreWeave, Vast.ai, or any bare-metal provider are added the same way. Provision them into the cluster over SSH — PodWarden installs K3s and the GPU runtime. A cluster of eight A100 nodes is ready in minutes.
Training Templates
With the foundation deployed, define your training job as a stack.
Example: PyTorch DDP Training Job
Kind: Job
Image: nvcr.io/nvidia/pytorch:24.04-py3
GPU count: 8
VRAM: 80Gi
CPU: 64
Memory: 512Gi| Variable | Example | Description |
|---|---|---|
MASTER_ADDR | (injected via mesh) | Rank-0 node address for NCCL |
NNODES | 4 | Total node count |
NPROC_PER_NODE | 8 | GPUs per node |
MODEL_NAME | llama-3-70b | Model identifier |
DATASET_PATH | /data/pile-v2 | Mounted dataset path |
OUTPUT_DIR | /checkpoints/run-42 | Checkpoint output |
BATCH_SIZE | 128 | Global batch size |
MLFLOW_TRACKING_URI | http://mlflow.mesh:5000 | Experiment tracker |
HF_TOKEN | (from secret ref) | HuggingFace API token |
WANDB_API_KEY | (from secret ref) | W&B API key (if using cloud tracking) |
Volume mounts attach your MinIO/S3 dataset bucket and checkpoint bucket. Sensitive values (HF_TOKEN, WANDB_API_KEY) come from Vault via secret_refs — not hardcoded in the template.
Training Data Flow
Job Kind
Training jobs run once and stop. Use kind: Job. PodWarden tracks completion status, logs duration and exit code, and records the run in your deployment history. When the job finishes, the cluster is idle — rented nodes stop consuming billable GPU hours.
Multi-Node Communication
PyTorch DDP, FSDP, and DeepSpeed use NCCL for collective operations. NCCL needs nodes to reach each other over a fast path.
- Tailscale mesh — Nodes connected to the same mesh are tagged
meshin PodWarden. Setrequired_network_types: ["mesh"]on the training template; PodWarden warns before deploying to a cluster where nodes can't communicate. - Provider LAN — Most large GPU cloud providers have private networking between nodes on the same order. Tag those nodes
lan. - Public IP — Slower and generally not recommended for NCCL traffic, but works for experimentation.
Scale Up and Down
Your MinIO, Vault, MLflow, and monitoring stack keep running on your permanent nodes. Rented nodes are stateless compute — nothing important lives on them.
Hub Templates
PodWarden Hub catalog for this stack:
| Template | Purpose |
|---|---|
| PyTorch DDP | Multi-node training with NCCL |
| DeepSpeed ZeRO-3 | Large model training |
| HuggingFace Accelerate | Fine-tuning foundation models |
| MLflow | Experiment tracking and model registry |
| MinIO | S3-compatible object storage |
| RustFS | High-performance S3-compatible object storage |
| Vault | Secrets management |
| Prometheus | Metrics collection |
| Grafana | Dashboards |
| DCGM Exporter | Per-GPU metrics (DaemonSet) |
| Harbor | Private container registry |
Import any template, customize GPU count and environment variables for your run, deploy. For proprietary training stacks or licensed model weights, publish private templates to your organization's Hub catalog.
What This Looks Like in Practice
A full stack from scratch on a mixed cluster:
- Add permanent nodes — DGX Spark or home workstation behind NAT joins via Tailscale. Provision as control plane.
- Deploy supporting services — Import MinIO, Vault, MLflow, Prometheus, Grafana, DCGM Exporter from Hub. Deploy to the permanent node. Takes ~15 minutes including configuration.
- Rent burst nodes — Add 4× 8×A100 cloud nodes to the cluster over Tailscale mesh.
- Configure training job — Import PyTorch DDP template. Set
NNODES=5, mount dataset and checkpoint buckets, reference HF_TOKEN from Vault. - Deploy — Distributed job runs across all GPUs. Grafana shows per-node GPU utilization in real time. MLflow tracks the run.
- Teardown — Remove rented nodes. MinIO, Vault, MLflow, and Grafana keep running on the permanent node for the next experiment.
Your First Homelab with PodWarden
Everything you need to start a homelab — from bare hardware to running services — without learning Kubernetes, writing YAML, or spending money on cloud subscriptions.
Blender Render Farms
Manage homogeneous and heterogeneous render fleets for 3D animation and VFX production — all supporting infrastructure included