Lookahead Optimizer: k steps forward, 1 step back

Michael R. Zhang; James Lucas; Geoffrey Hinton; Jimmy Ba

arXiv:1907.08610·cs.LG·December 4, 2019·382 cites

Lookahead Optimizer: k steps forward, 1 step back

Michael R. Zhang, James Lucas, Geoffrey Hinton, Jimmy Ba

PDF

Open Access 5 Repos

TL;DR

Lookahead is a new optimizer that enhances existing methods like SGD and Adam by iteratively updating two sets of weights, improving stability and performance with minimal additional cost.

Contribution

The paper introduces Lookahead, a novel optimization algorithm that improves training stability and performance of existing optimizers through a simple, orthogonal approach.

Findings

01

Lookahead improves training stability and reduces variance.

02

It enhances the performance of SGD and Adam on multiple benchmarks.

03

Minimal additional computational and memory overhead.

Abstract

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsAverage Pooling · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling