FlashOptim: Optimizers for Memory-Efficient Training
Jose Javier Gonzalez Ortiz, Abhay Gupta, Christopher Rinard, Davis Blalock

TL;DR
FlashOptim introduces memory-saving techniques for mixed-precision neural network training, reducing memory usage by over 50% without sacrificing model quality, enabling training of larger models on limited hardware.
Contribution
The paper presents novel quantization and splitting methods that significantly reduce optimizer memory footprint while maintaining API compatibility and model performance.
Findings
Memory usage reduced by over 50%
No measurable quality degradation on benchmarks
Model checkpoint sizes halved
Abstract
Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory. We introduce FlashOptim, a suite of optimizations that reduces per-parameter memory by over 50% while preserving model quality and API compatibility. Our approach introduces two key techniques. First, we improve master weight splitting by finding and exploiting a tight bound on its quantization error. Second, we design companding functions that greatly reduce the error in 8-bit optimizer state quantization. Together with 16-bit gradients, these techniques reduce AdamW memory from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Domain Adaptation and Few-Shot Learning
