Loss Landscape Dependent Self-Adjusting Learning Rates in Decentralized   Stochastic Gradient Descent

Wei Zhang; Mingrui Liu; Yu Feng; Xiaodong Cui; Brian Kingsbury; Yuhai; Tu

arXiv:2112.01433·cs.LG·December 3, 2021·1 cites

Loss Landscape Dependent Self-Adjusting Learning Rates in Decentralized Stochastic Gradient Descent

Wei Zhang, Mingrui Liu, Yu Feng, Xiaodong Cui, Brian Kingsbury, Yuhai, Tu

PDF

Open Access

TL;DR

This paper reveals that decentralized stochastic gradient descent (DPSGD) introduces landscape-dependent noise that automatically adjusts the effective learning rate, leading to improved convergence and stability in large-batch distributed deep learning training.

Contribution

The paper provides a theoretical and empirical analysis showing how DPSGD's landscape-dependent noise enhances convergence by smoothing the loss landscape and adjusting the learning rate.

Findings

01

DPSGD often converges where SSGD diverges at large learning rates.

02

DPSGD's noise smooths the loss landscape, enabling larger learning rates.

03

Results are consistent across computer vision and speech recognition tasks.

Abstract

Distributed Deep Learning (DDL) is essential for large-scale Deep Learning (DL) training. Synchronous Stochastic Gradient Descent (SSGD) 1 is the de facto DDL optimization method. Using a sufficiently large batch size is critical to achieving DDL runtime speedup. In a large batch setting, the learning rate must be increased to compensate for the reduced number of parameter updates. However, a large learning rate may harm convergence in SSGD and training could easily diverge. Recently, Decentralized Parallel SGD (DPSGD) has been proposed to improve distributed training speed. In this paper, we find that DPSGD not only has a system-wise run-time benefit but also a significant convergence benefit over SSGD in the large batch setting. Based on a detailed analysis of the DPSGD learning dynamics, we find that DPSGD introduces additional landscape-dependent noise that automatically adjusts the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning

MethodsStochastic Gradient Descent