A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks
Mingrui Liu, Zhenxun Zhuang, Yunwei Lei, Chunyang Liao

TL;DR
This paper introduces a communication-efficient distributed gradient clipping algorithm for training deep neural networks, addressing exploding gradients and achieving linear speedup with reduced communication rounds.
Contribution
It proposes a novel distributed gradient clipping method under relaxed smoothness assumptions, with proven convergence and practical validation on benchmark datasets.
Findings
Achieves $O(1/N\epsilon^4)$ iteration complexity
Reduces communication complexity to $O(1/\epsilon^3)$
Demonstrates fast convergence in experiments
Abstract
In distributed training of deep neural networks, people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the distributed setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Privacy-Preserving Technologies in Data
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Tanh Activation · Sigmoid Activation · Long Short-Term Memory · Stochastic Gradient Descent · Gradient Clipping
