Near-Optimal Sparse Allreduce for Distributed Deep Learning

Shigang Li; Torsten Hoefler

arXiv:2201.07598·cs.DC·August 22, 2025

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Shigang Li, Torsten Hoefler

PDF

1 Repo

TL;DR

This paper introduces O$k$-Top$k$, a scalable sparse allreduce algorithm for distributed deep learning that reduces communication overhead and maintains model accuracy, significantly improving training throughput on large GPU clusters.

Contribution

It presents a novel sparse allreduce algorithm with asymptotically optimal communication volume integrated with decentralized SGD, and demonstrates its effectiveness in large-scale training.

Findings

01

Achieves similar accuracy to dense allreduce.

02

Significantly improves training throughput (up to 12.95x on BERT).

03

More scalable than existing methods.

Abstract

Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O $k$ -Top $k$ , a scheme for distributed training with sparse gradients. O $k$ -Top $k$ integrates a novel sparse allreduce algorithm (less than 6 $k$ communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O $k$ -Top $k$ efficiently selects the top- $k$ gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shigangli/ok-topk
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · WordPiece · Dense Connections · Linear Warmup With Linear Decay · Softmax · Attention Dropout · Layer Normalization · Residual Connection