Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions
Vladimir Feinberg, Xinyi Chen, Y. Jennifer Sun, Rohan Anil, Elad Hazan

TL;DR
This paper introduces Sketchy, a memory-efficient adaptive regularization method using Frequent Directions sketching, enabling near full-matrix performance with significantly reduced memory in deep learning optimization.
Contribution
We propose a novel low-rank sketching approach for adaptive regularization, allowing efficient approximation of second-order methods with reduced memory and computational costs.
Findings
Achieves regret guarantees close to full-matrix methods with only dk memory
Extends to Shampoo, matching its quality with sub-linear memory requirements
Demonstrates competitive performance with Adam and Shampoo in experiments
Abstract
Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace that changes throughout training, motivating a low-rank sketching approach. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner using the Frequent Directions (FD) sketch. While previous approaches have explored applying FD for second-order optimization, we present a novel analysis which allows efficient interpolation between resource requirements and the degradation in regret guarantees with rank : in the online convex optimization (OCO) setting over dimension , we match…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Advanced Adaptive Filtering Techniques
MethodsAdam
