TL;DR
This paper introduces O$k$-Top$k$, a scalable sparse allreduce algorithm for distributed deep learning that reduces communication overhead and maintains model accuracy, significantly improving training throughput on large GPU clusters.
Contribution
It presents a novel sparse allreduce algorithm with asymptotically optimal communication volume integrated with decentralized SGD, and demonstrates its effectiveness in large-scale training.
Findings
Achieves similar accuracy to dense allreduce.
Significantly improves training throughput (up to 12.95x on BERT).
More scalable than existing methods.
Abstract
Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O-Top, a scheme for distributed training with sparse gradients. O-Top integrates a novel sparse allreduce algorithm (less than 6 communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O-Top efficiently selects the top- gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · WordPiece · Dense Connections · Linear Warmup With Linear Decay · Softmax · Attention Dropout · Layer Normalization · Residual Connection
