Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
J. Pablo Mu\~noz, Jinjie Yuan, Nilesh Jain

TL;DR
This paper introduces Mamba-Shedder, a compression method for SSM-based models that reduces size and computation with minimal accuracy loss, achieving up to 1.4x speedup during inference.
Contribution
It presents a novel post-Transformer compression technique for SSM models, enhancing efficiency while preserving performance.
Findings
Achieves up to 1.4x inference speedup.
Reduces model size and computational overhead.
Maintains accuracy with component removal.
Abstract
Large pre-trained models have achieved outstanding results in sequence modeling. The Transformer block and its attention mechanism have been the main drivers of the success of these models. Recently, alternative architectures, such as Selective Structured State Space Models (SSMs), have been proposed to address the inefficiencies of Transformers. This paper explores the compression of SSM-based models, particularly Mamba and its hybrids. We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy. The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsCopper Interconnects and Reliability · Vibration and Dynamic Analysis
MethodsAttention Is All You Need · Softmax · Adam · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
