GWT: Scalable Optimizer State Compression for Large Language Model Training
Ziqing Wen, Ping Luo, Jiahuan Wang, Kun Yuan, Dongsheng Li, Tao Sun

TL;DR
This paper introduces GWT, a wavelet-based compression method for optimizer states in large language model training, reducing memory usage while maintaining performance.
Contribution
GWT offers a novel wavelet transform-based framework for optimizer state compression that surpasses traditional low-rank methods without performance loss.
Findings
GWT achieves comparable performance to full-rank optimizers.
GWT significantly reduces memory overhead during training.
GWT integrates seamlessly with existing optimization protocols.
Abstract
Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse natural language processing benchmarks. However, the escalating scale of model parameters imposes prohibitive memory overheads during training, especially when employing stateful optimizers such as Adam. Conventional memory-efficient strategies, typically involving singular value decomposition (SVD) or weight freezing, often incur non-negligible performance degradation relative to full-rank updates. To address these limitations, this paper explores memory-efficient optimization beyond low-rank constraints and proposes the Gradient Wavelet Transform (GWT). GWT characterizes a novel compression framework that projects gradients into wavelet subspaces, effectively compacting optimizer states while preserving essential update information. We theoretically and empirically demonstrate that GWT can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
