Compressing Gradient Optimizers via Count-Sketches
Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava

TL;DR
This paper introduces a count-sketch based method to compress auxiliary variables in gradient optimizers, significantly reducing memory usage and training time in large-scale deep learning models without sacrificing convergence or accuracy.
Contribution
The authors propose a novel count-sketch technique to compress optimizer states, maintaining convergence rates while reducing memory and training time in large models.
Findings
Achieved 25% memory reduction on a 1-billion-word dataset with negligible accuracy loss.
Reduced training time by 38% on an Amazon classification task by increasing mini-batch size.
Theoretically proved that count-sketch optimization preserves SGD convergence rates.
Abstract
Many popular first-order optimization methods (e.g., Momentum, AdaGrad, Adam) accelerate the convergence rate of deep learning models. However, these algorithms require auxiliary parameters, which cost additional memory proportional to the number of parameters in the model. The problem is becoming more severe as deep learning models continue to grow larger in order to learn from complex, large-scale datasets. Our proposed solution is to maintain a linear sketch to compress the auxiliary variables. We demonstrate that our technique has the same performance as the full-sized baseline, while using significantly less space for the auxiliary variables. Theoretically, we prove that count-sketch optimization maintains the SGD convergence rate, while gracefully reducing memory usage for large-models. On the large-scale 1-Billion Word dataset, we save 25% of the memory used during training (8.6…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Machine Learning and Data Classification
MethodsAdaGrad · Adam · Softmax · Stochastic Gradient Descent
