Quamba: A Post-Training Quantization Recipe for Selective State Space   Models

Hung-Yueh Chiang; Chi-Chih Chang; Natalia Frumkin; Kai-Chiang Wu; and; Diana Marculescu

arXiv:2410.13229·cs.LG·December 10, 2024

Quamba: A Post-Training Quantization Recipe for Selective State Space Models

Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, and, Diana Marculescu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Quamba, a static 8-bit quantization method for State Space Models that reduces latency and model size while maintaining high accuracy, enabling efficient deployment on cloud and edge devices.

Contribution

The paper presents a novel 8-bit quantization technique tailored for SSMs, addressing their unique sensitivity and outlier issues, and demonstrates significant latency improvements with minimal accuracy loss.

Findings

01

Achieves 1.72x lower latency on Nvidia Orin Nano 8G

02

Maintains only 0.9% accuracy drop in zero-shot tasks

03

Demonstrates effectiveness for cloud and edge deployment

Abstract

State Space Models (SSMs) have emerged as an appealing alternative to Transformers for large language models, achieving state-of-the-art accuracy with constant memory complexity which allows for holding longer context lengths than attention-based networks. The superior computational efficiency of SSMs in long sequence modeling positions them favorably over Transformers in many scenarios. However, improving the efficiency of SSMs on request-intensive cloud-serving and resource-limited edge applications is still a formidable task. SSM quantization is a possible solution to this problem, making SSMs more suitable for wide deployment, while still maintaining their accuracy. Quantization is a common technique to reduce the model size and to utilize the low bit-width acceleration features on modern computing units, yet existing quantization techniques are poorly suited for SSMs. Most notably,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

enyac-group/quamba
pytorchOfficial

Videos

Quamba: A Post-Training Quantization Recipe for Selective State Space Models· slideslive

Taxonomy

TopicsNeural Networks and Applications

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces