Mamba-PTQ: Outlier Channels in Recurrent Large Language Models
Alessandro Pierro, Steven Abreu

TL;DR
This paper investigates the challenges of quantizing recurrent large language models, specifically Mamba, highlighting activation outliers as a key obstacle and proposing initial steps for outlier-aware quantization to improve deployment efficiency.
Contribution
It is the first to analyze outlier channels in recurrent LLMs like Mamba during post-training quantization and suggests methods to address these outliers.
Findings
Activation outliers are a major challenge in quantizing recurrent LLMs.
Baseline quantization results are affected by activation outliers.
Initial outlier-aware quantization strategies are proposed.
Abstract
Modern recurrent layers are emerging as a promising path toward edge deployment of foundation models, especially in the context of large language models (LLMs). Compressing the whole input sequence in a finite-dimensional representation enables recurrent layers to model long-range dependencies while maintaining a constant inference cost for each token and a fixed memory requirement. However, the practical deployment of LLMs in resource-limited environments often requires further model compression, such as quantization and pruning. While these techniques are well-established for attention-based models, their effects on recurrent layers remain underexplored. In this preliminary work, we focus on post-training quantization for recurrent LLMs and show that Mamba models exhibit the same pattern of outlier channels observed in attention-based LLMs. We show that the reason for the difficulty…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsFocus
