TL;DR
MuonQ introduces a low-bit quantization framework for Muon's optimizer, employing directional fidelity techniques to maintain training stability and accuracy while significantly reducing memory usage.
Contribution
It presents a novel 4-bit quantization method for Muon optimizer states using normalization, structural decomposition, and $rac12;$-law companding, enabling efficient large language model training.
Findings
MuonQ achieves stable 4-bit quantization of Muon optimizer states.
Pre-training results show MuonQ matches full-precision Muon in loss and accuracy.
Optimizer state memory is reduced by up to 7.3 times.
Abstract
The Muon optimizer has emerged as a compelling alternative to Adam for training large language models, achieving remarkable computational savings through gradient orthogonalization. However, Muon's optimizer state is more sensitive to quantization errors: because the orthogonalization discards the magnitudes of singular values and retains only directional information, even small quantization errors in singular vector directions are amplified in the update. In this work, we propose MuonQ, a low-bit Muon training framework built on the principle of directional fidelity optimization. First, we apply a pre-quantization normalization so that each step introduces quantization errors of the same magnitude, preventing the accumulated error from developing a preferred direction. Second, we introduce a structural decomposition that separately quantizes the dominant singular components via power…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
