Scalable Second Order Optimization for Deep Learning

Rohan Anil; Vineet Gupta; Tomer Koren; Kevin Regan; Yoram Singer

arXiv:2002.09018·cs.LG·March 8, 2021·29 cites

Scalable Second Order Optimization for Deep Learning

Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, Yoram Singer

PDF

Open Access 2 Repos 3 Models

TL;DR

This paper introduces a scalable second-order optimization method tailored for deep learning, achieving faster convergence and better performance on large-scale tasks by leveraging heterogeneous hardware architectures.

Contribution

The paper presents a practical, scalable second-order optimization algorithm that significantly improves training efficiency and effectiveness for deep neural networks.

Findings

01

Outperforms first-order methods in convergence speed

02

Achieves better wall-clock time on large models

03

Demonstrates superior results on diverse large-scale tasks

Abstract

Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and Algorithms

MethodsDistributed Shampoo