MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation Methods
Zukang Xu, Yuxuan Yue, Xing Hu, Zhihang Yuan, Zixu Jiang, Zhixuan, Chen, Jiangyong Yu, Chen Xu, Sifan Zhou, Dawei Yang

TL;DR
MambaQuant introduces a novel post-training quantization framework tailored for Mamba models, effectively reducing model size and latency with minimal accuracy loss by addressing unique distribution challenges.
Contribution
The paper presents the first comprehensive PTQ method for Mamba models, utilizing variance-aligned rotation and variance equalization techniques to handle distribution outliers.
Findings
Achieves less than 1% accuracy loss with 8-bit quantization.
Effectively handles outliers and distribution variance in Mamba models.
Enables efficient deployment of Mamba models in resource-constrained environments.
Abstract
Mamba is an efficient sequence model that rivals Transformers and demonstrates significant potential as a foundational architecture for various tasks. Quantization is commonly used in neural networks to reduce model size and computational latency. However, applying quantization to Mamba remains underexplored, and existing quantization methods, which have been effective for CNN and Transformer models, appear inadequate for Mamba models (e.g., Quarot suffers a 21% accuracy drop on Vim-T even under W8A8). We have pioneered the exploration of this issue and identified several key challenges. First, significant outliers are present in gate projections, output projections, and matrix multiplications. Second, Mamba's unique parallel scan further amplifies these outliers, leading to uneven and heavy-tailed data distributions. Third, even with the application of the Hadamard transform,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics and Applications
MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
