An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning
Chuyan Chen, Chenyang Ma, Zhangxin Li, Yutong He, Yanjie Dong, Kun Yuan

TL;DR
The paper introduces ARC-Top-K, a communication-efficient gradient compressor for distributed learning that combines the benefits of Top-K sparsification with All-Reduce compatibility, improving speed and accuracy.
Contribution
ARC-Top-K is a novel gradient compressor that aligns sparsity patterns across nodes using a lightweight sketch, enabling index-free All-Reduce and maintaining contraction properties.
Findings
Achieves linear speedup with momentum error feedback.
Matches Top-K accuracy while reducing training time by up to 60.7%.
Provably contractive and scalable in distributed settings.
Abstract
Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand- discards structural information and performs poorly in practice, while Top- preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-, an {All-Reduce}-Compatible Top- compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top- is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
