Decoupled DiLoCo: Distributed LLM Training Over 2-5 Gbps WAN

Google DeepMind published a paper on Decoupled DiLoCo (Distributed Low-Communication), a distributed training architecture that decouples compute into separate "learner units" that communicate asynchronously. This allows training large models across geographically distributed data centers with much lower bandwidth requirements than traditional synchronized approaches.

Key Details

Builds on two prior advances: Pathways (asynchronous data flow system) and DiLoCo (reduced bandwidth between data centers).
Training is split across decoupled learner units — independent compute islands. A chip failure in one unit doesn't interrupt the others. The system is self-healing: after losing an entire learner unit to hardware failure, training continues and the unit is seamlessly reintegrated once it recovers.
Validated with chaos engineering — injected artificial hardware failures during training runs. Decoupled DiLoCo maintained high "goodput" (useful training time) while conventional methods nosedived under failure.
Trained a 12 billion parameter model across four separate U.S. regions using 2-5 Gbps wide-area networking — achievable with existing internet connectivity between datacenters.
Achieved the same benchmarked ML performance (tested with Gemma 4 models) as conventional training approaches.
Reported more than 20× faster than conventional synchronization methods because communication is overlapped with computation, avoiding blocking bottlenecks.

Architecture Overview

The system incorporates communication into longer computation periods instead of requiring synchronous all-reduce across all chips. This avoids "blocking" where one part of the system must wait for another. The result is resilient training that can tap unused compute anywhere, turning stranded resources into useful capacity.

Who It's For

Teams training large language models or other frontier models across multiple data centers who need fault tolerance without sacrificing performance or requiring custom network infrastructure.

📖 Read the full source: HN AI Agents

Decoupled DiLoCo: Resilient Distributed Training Across Data Centers with Low Bandwidth

Key Details

Architecture Overview

Who It's For

👀 See Also

OpenClaw Gateway Reliability Issues: Silent Failures After 25 Days of Heavy Use

Claude Code v2.1.146: /code-review Command, Pagination Fix, Windows PowerShell Fix

Qwen KV Cache Quantization Deep Dive: PPL, KL Divergence, and Asymmetric K/V Results

Claude for Excel and PowerPoint Updates: Cross-Application Context and Skills Integration