Quamba: A Post-Training Quantization Recipe for Selective State Space Models
Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, and, Diana Marculescu

TL;DR
This paper introduces Quamba, a static 8-bit quantization method for State Space Models that reduces latency and model size while maintaining high accuracy, enabling efficient deployment on cloud and edge devices.
Contribution
The paper presents a novel 8-bit quantization technique tailored for SSMs, addressing their unique sensitivity and outlier issues, and demonstrates significant latency improvements with minimal accuracy loss.
Findings
Achieves 1.72x lower latency on Nvidia Orin Nano 8G
Maintains only 0.9% accuracy drop in zero-shot tasks
Demonstrates effectiveness for cloud and edge deployment
Abstract
State Space Models (SSMs) have emerged as an appealing alternative to Transformers for large language models, achieving state-of-the-art accuracy with constant memory complexity which allows for holding longer context lengths than attention-based networks. The superior computational efficiency of SSMs in long sequence modeling positions them favorably over Transformers in many scenarios. However, improving the efficiency of SSMs on request-intensive cloud-serving and resource-limited edge applications is still a formidable task. SSM quantization is a possible solution to this problem, making SSMs more suitable for wide deployment, while still maintaining their accuracy. Quantization is a common technique to reduce the model size and to utilize the low bit-width acceleration features on modern computing units, yet existing quantization techniques are poorly suited for SSMs. Most notably,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
