4-bit Shampoo for Memory-Efficient Network Training
Sike Wang, Pan Zhou, Jia Li, Hua Huang

TL;DR
This paper introduces 4-bit Shampoo, a memory-efficient second-order optimizer that maintains performance comparable to 32-bit versions by quantizing the eigenvector matrix of the preconditioner, enabling large model training with reduced memory.
Contribution
First 4-bit second-order optimizer, demonstrating effective eigenvector matrix quantization for memory-efficient training without performance loss.
Findings
4-bit Shampoo matches 32-bit performance in image and language tasks.
Eigenvector matrix quantization outperforms direct preconditioner quantization.
Linear square quantization slightly better than dynamic tree quantization.
Abstract
Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models trained by second-order optimizers. To address this, compressing 32-bit optimizer states to lower bitwidths has shown promise in reducing memory usage. However, current approaches only pertain to first-order optimizers. In this paper, we propose the first 4-bit second-order optimizers, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones. We show that quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo is remarkably better than quantizing the preconditioner itself both theoretically and experimentally. By rectifying the orthogonality of the quantized eigenvector matrix, we enhance the approximation of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExperimental Learning in Engineering
