FlashOptim: Optimizers for Memory-Efficient Training

Jose Javier Gonzalez Ortiz; Abhay Gupta; Christopher Rinard; Davis Blalock

arXiv:2602.23349·cs.LG·March 13, 2026

FlashOptim: Optimizers for Memory-Efficient Training

Jose Javier Gonzalez Ortiz, Abhay Gupta, Christopher Rinard, Davis Blalock

PDF

Open Access

TL;DR

FlashOptim introduces memory-saving techniques for mixed-precision neural network training, reducing memory usage by over 50% without sacrificing model quality, enabling training of larger models on limited hardware.

Contribution

The paper presents novel quantization and splitting methods that significantly reduce optimizer memory footprint while maintaining API compatibility and model performance.

Findings

01

Memory usage reduced by over 50%

02

No measurable quality degradation on benchmarks

03

Model checkpoint sizes halved

Abstract

Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory. We introduce FlashOptim, a suite of optimizations that reduces per-parameter memory by over 50% while preserving model quality and API compatibility. Our approach introduces two key techniques. First, we improve master weight splitting by finding and exploiting a tight bound on its quantization error. Second, we design companding functions that greatly reduce the error in 8-bit optimizer state quantization. Together with 16-bit gradients, these techniques reduce AdamW memory from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Domain Adaptation and Few-Shot Learning