mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural   Network Optimization

Yue Niu; Zalan Fabian; Sunwoo Lee; Mahdi Soltanolkotabi; Salman; Avestimehr

arXiv:2307.13744·cs.LG·July 27, 2023

mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization

Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, Salman, Avestimehr

PDF

Open Access

TL;DR

The paper introduces mL-BFGS, a momentum-enhanced L-BFGS algorithm designed for stable, efficient large-scale distributed neural network training, outperforming traditional optimizers in convergence speed and computational efficiency.

Contribution

mL-BFGS is a novel, lightweight quasi-Newton method that incorporates momentum to stabilize stochastic training and enables distributed large-scale neural network optimization.

Findings

01

mL-BFGS achieves faster convergence than SGD and Adam.

02

It provides significant wall-clock speedup in training large neural models.

03

The method effectively reduces stochastic noise in Hessian approximations.

Abstract

Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training. A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training. So far, attempts that adapt L-BFGS to large-scale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wall-clock time. In this paper, we propose mL-BFGS, a lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) optimization. mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and ELM

MethodsAdam