MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation   Methods

Zukang Xu; Yuxuan Yue; Xing Hu; Zhihang Yuan; Zixu Jiang; Zhixuan; Chen; Jiangyong Yu; Chen Xu; Sifan Zhou; Dawei Yang

arXiv:2501.13484·cs.LG·March 12, 2025

MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation Methods

Zukang Xu, Yuxuan Yue, Xing Hu, Zhihang Yuan, Zixu Jiang, Zhixuan, Chen, Jiangyong Yu, Chen Xu, Sifan Zhou, Dawei Yang

PDF

Open Access

TL;DR

MambaQuant introduces a novel post-training quantization framework tailored for Mamba models, effectively reducing model size and latency with minimal accuracy loss by addressing unique distribution challenges.

Contribution

The paper presents the first comprehensive PTQ method for Mamba models, utilizing variance-aligned rotation and variance equalization techniques to handle distribution outliers.

Findings

01

Achieves less than 1% accuracy loss with 8-bit quantization.

02

Effectively handles outliers and distribution variance in Mamba models.

03

Enables efficient deployment of Mamba models in resource-constrained environments.

Abstract

Mamba is an efficient sequence model that rivals Transformers and demonstrates significant potential as a foundational architecture for various tasks. Quantization is commonly used in neural networks to reduce model size and computational latency. However, applying quantization to Mamba remains underexplored, and existing quantization methods, which have been effective for CNN and Transformer models, appear inadequate for Mamba models (e.g., Quarot suffers a 21% accuracy drop on Vim-T $^{†}$ even under W8A8). We have pioneered the exploration of this issue and identified several key challenges. First, significant outliers are present in gate projections, output projections, and matrix multiplications. Second, Mamba's unique parallel scan further amplifies these outliers, leading to uneven and heavy-tailed data distributions. Third, even with the application of the Hadamard transform,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics and Applications

MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer