Taming Latency and Bandwidth: A Theoretical Framework and Adaptive Algorithm for Communication-Constrained Training
Rongwei Lu, Jingyan Jiang, Chunyang Li, Xingguang Wei, Zhi Wang

TL;DR
This paper introduces a theoretical framework and adaptive algorithm for distributed training over wide-area networks with high latency and variable bandwidth, optimizing communication efficiency and convergence.
Contribution
It provides the first convergence analysis for communication-constrained distributed training and proposes DeCo-SGD, an adaptive method that dynamically adjusts compression and staleness.
Findings
DeCo-SGD achieves up to 5.07x speed-up over distributed SGD.
Theoretical analysis reveals exponential amplification of compression effects by staleness.
Adaptive strategy outperforms fixed strategies in high-latency, low-bandwidth networks.
Abstract
Regional energy caps limit the growth of any single data center used for large-scale model training. This single-center training paradigm works when model size remains manageable, but exponential growth in the model size and computational demand challenges it. A natural alternative is to distribute training across multiple data centers over wide-area networks. This pools distributed resources, but suffers from high latency and low, time-varying bandwidth, sharply reducing throughout. Employing jointly gradient compression and delayed aggregation can alleviate communication problems, but introduces a complex three-way trade-off among compression ratio, staleness (delayed synchronization steps), and convergence rate. Existing work lacks theoretical guidance and can only propose fixed strategies, insensitive to computation and communication conditions. We address this with a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Computer Graphics and Visualization Techniques · Parallel Computing and Optimization Techniques
