Scalable Second Order Optimization for Deep Learning
Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, Yoram Singer

TL;DR
This paper introduces a scalable second-order optimization method tailored for deep learning, achieving faster convergence and better performance on large-scale tasks by leveraging heterogeneous hardware architectures.
Contribution
The paper presents a practical, scalable second-order optimization algorithm that significantly improves training efficiency and effectiveness for deep neural networks.
Findings
Outperforms first-order methods in convergence speed
Achieves better wall-clock time on large models
Demonstrates superior results on diverse large-scale tasks
Abstract
Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and Algorithms
MethodsDistributed Shampoo
