Decoupled DiLoCo: Resilient Distributed Training Across Data Centers with Low Bandwidth

Google DeepMind published a paper on Decoupled DiLoCo (Distributed Low-Communication), a distributed training architecture that decouples compute into separate "learner units" that communicate asynchronously. This allows training large models across geographically distributed data centers with much lower bandwidth requirements than traditional synchronized approaches.
Key Details
- Builds on two prior advances: Pathways (asynchronous data flow system) and DiLoCo (reduced bandwidth between data centers).
- Training is split across decoupled learner units — independent compute islands. A chip failure in one unit doesn't interrupt the others. The system is self-healing: after losing an entire learner unit to hardware failure, training continues and the unit is seamlessly reintegrated once it recovers.
- Validated with chaos engineering — injected artificial hardware failures during training runs. Decoupled DiLoCo maintained high "goodput" (useful training time) while conventional methods nosedived under failure.
- Trained a 12 billion parameter model across four separate U.S. regions using 2-5 Gbps wide-area networking — achievable with existing internet connectivity between datacenters.
- Achieved the same benchmarked ML performance (tested with Gemma 4 models) as conventional training approaches.
- Reported more than 20× faster than conventional synchronization methods because communication is overlapped with computation, avoiding blocking bottlenecks.
Architecture Overview
The system incorporates communication into longer computation periods instead of requiring synchronous all-reduce across all chips. This avoids "blocking" where one part of the system must wait for another. The result is resilient training that can tap unused compute anywhere, turning stranded resources into useful capacity.
Who It's For
Teams training large language models or other frontier models across multiple data centers who need fault tolerance without sacrificing performance or requiring custom network infrastructure.
📖 Read the full source: HN AI Agents
👀 See Also

OpenClaw Gateway Reliability Issues: Silent Failures After 25 Days of Heavy Use
A detailed report from an OpenClaw user running 18+ cron jobs with Telegram for 25 days identifies a critical pattern where the gateway enters a 'zombified' state—showing as running but with all functionality frozen. The user documents specific issues including session write locks held indefinitely, cron jobs stuck in phantom running states, and silent failures on invalid configurations.

Claude Code v2.1.146: /code-review Command, Pagination Fix, Windows PowerShell Fix
Claude Code v2.1.146 renames /simplify to /code-review with optional effort level, fixes MCP pagination and Windows PowerShell tool, improves auto-updater reliability and diff rendering performance.

Qwen KV Cache Quantization Deep Dive: PPL, KL Divergence, and Asymmetric K/V Results
Second round of benchmarks on Qwen 3.6-35B-A3B with KV cache quantization: perplexity, KL divergence, asymmetric K/V combos, and 64K context depth on Apple M5 Max.

Claude for Excel and PowerPoint Updates: Cross-Application Context and Skills Integration
Claude for Excel and PowerPoint now share conversation context across open files, with Skills available in both add-ins. The tools are accessible via Amazon Bedrock, Google Cloud's Vertex AI, and Microsoft Foundry for paid Mac and Windows users.