Compressing Gradient Optimizers via Count-Sketches

Ryan Spring; Anastasios Kyrillidis; Vijai Mohan; Anshumali Shrivastava

arXiv:1902.00179·cs.LG·February 27, 2019·5 cites

Compressing Gradient Optimizers via Count-Sketches

Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava

PDF

Open Access 1 Repo

TL;DR

This paper introduces a count-sketch based method to compress auxiliary variables in gradient optimizers, significantly reducing memory usage and training time in large-scale deep learning models without sacrificing convergence or accuracy.

Contribution

The authors propose a novel count-sketch technique to compress optimizer states, maintaining convergence rates while reducing memory and training time in large models.

Findings

01

Achieved 25% memory reduction on a 1-billion-word dataset with negligible accuracy loss.

02

Reduced training time by 38% on an Amazon classification task by increasing mini-batch size.

03

Theoretically proved that count-sketch optimization preserves SGD convergence rates.

Abstract

Many popular first-order optimization methods (e.g., Momentum, AdaGrad, Adam) accelerate the convergence rate of deep learning models. However, these algorithms require auxiliary parameters, which cost additional memory proportional to the number of parameters in the model. The problem is becoming more severe as deep learning models continue to grow larger in order to learn from complex, large-scale datasets. Our proposed solution is to maintain a linear sketch to compress the auxiliary variables. We demonstrate that our technique has the same performance as the full-sized baseline, while using significantly less space for the auxiliary variables. Theoretically, we prove that count-sketch optimization maintains the SGD convergence rate, while gracefully reducing memory usage for large-models. On the large-scale 1-Billion Word dataset, we save 25% of the memory used during training (8.6…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rdspring1/Count-Sketch-Optimizers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Machine Learning and Data Classification

MethodsAdaGrad · Adam · Softmax · Stochastic Gradient Descent