Distributed Training under Packet Loss
Erez Weintraub, Ron Banner, Ariel Orda

TL;DR
This paper introduces a distributed training framework that maintains model accuracy and convergence guarantees over unreliable network connections with packet loss, enabling scalable training across commodity networks.
Contribution
A novel end-to-end distributed training method that ensures unbiased gradient aggregation and bounded parameter drift under packet loss without changing model code or optimizers.
Findings
Tolerates 10% packet loss with less than 1% perplexity increase on LLAMA2 7B.
Provides theoretical guarantees for unbiased gradients and bounded parameter divergence.
Demonstrates robustness and scalability on large models and multi-GPU setups.
Abstract
State-of-the-art language and vision models are routinely trained across thousands of GPUs, often spanning multiple data-centers, yet today's distributed frameworks still assume reliable connections (e.g., InfiniBand or RoCE). The resulting acknowledgment traffic and retransmissions inflate tail latencies and limit scalability. Leveraging unreliable connections will reduce latency but may sacrifice model accuracy and convergence once packets are dropped. A principled, end-to-end solution that preserves accuracy and convergence guarantees under genuine packet loss has previously been missing. We address this critical gap by introducing a novel distributed training framework capable of operating over unreliable connections, offering unbiased gradient aggregation and bounded parameter drift without modifying model code or optimizers. The key insight is a two-stage defense against missing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · IoT and Edge/Fog Computing · Cloud Computing and Resource Management
