Accelerate Distributed Stochastic Descent for Nonconvex Optimization with Momentum
Guojing Cong, Tianyi Liu

TL;DR
This paper introduces a block momentum method for distributed nonconvex optimization that accelerates training and improves results by combining local stochastic gradients with global momentum at the model averaging level.
Contribution
It proposes a novel block momentum technique for distributed stochastic descent, analyzing its convergence and demonstrating its effectiveness in deep learning training.
Findings
Block momentum accelerates training speed.
It achieves better model performance.
The method scales well with distributed systems.
Abstract
Momentum method has been used extensively in optimizers for deep learning. Recent studies show that distributed training through K-step averaging has many nice properties. We propose a momentum method for such model averaging approaches. At each individual learner level traditional stochastic gradient is applied. At the meta-level (global learner level), one momentum term is applied and we call it block momentum. We analyze the convergence and scaling properties of such momentum methods. Our experimental results show that block momentum not only accelerates training, but also achieves better results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM
