A Communication-Efficient Distributed Gradient Clipping Algorithm for   Training Deep Neural Networks

Mingrui Liu; Zhenxun Zhuang; Yunwei Lei; Chunyang Liao

arXiv:2205.05040·cs.LG·October 14, 2022·6 cites

A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks

Mingrui Liu, Zhenxun Zhuang, Yunwei Lei, Chunyang Liao

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a communication-efficient distributed gradient clipping algorithm for training deep neural networks, addressing exploding gradients and achieving linear speedup with reduced communication rounds.

Contribution

It proposes a novel distributed gradient clipping method under relaxed smoothness assumptions, with proven convergence and practical validation on benchmark datasets.

Findings

01

Achieves $O(1/N\epsilon^4)$ iteration complexity

02

Reduces communication complexity to $O(1/\epsilon^3)$

03

Demonstrates fast convergence in experiments

Abstract

In distributed training of deep neural networks, people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the distributed setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mingruiliu-ml-lab/communication-efficient-local-gradient-clipping
pytorchOfficial

Videos

A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Privacy-Preserving Technologies in Data

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Tanh Activation · Sigmoid Activation · Long Short-Term Memory · Stochastic Gradient Descent · Gradient Clipping