The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, and Tri Dao

TL;DR
This paper demonstrates how to distill large Transformer models into linear RNNs called Mamba, enabling efficient deployment and inference acceleration, while maintaining competitive performance in language modeling and benchmarks.
Contribution
It introduces a method to convert pretrained Transformers into hybrid models with linear RNNs, and proposes a hardware-aware decoding algorithm for faster inference.
Findings
Distilled models outperform open-source hybrid Mamba trained from scratch.
Achieves high performance on chat benchmarks and general benchmarks.
Exhibits natural length extrapolation and efficient inference acceleration.
Abstract
Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗JunxiongWang/mamba_0_5_dpo_ep1model· 1 dl1 dl
- 🤗JunxiongWang/mamba_0_5_dpo_ep3model· 2 dl2 dl
- 🤗JunxiongWang/mamba_0_875_dpo_ep3model· 3 dl· ♡ 13 dl♡ 1
- 🤗JunxiongWang/mamba_0_875_dpo_ep1model· 1 dl1 dl
- 🤗JunxiongWang/mamba_0_75_dpo_ep3model· 1 dl1 dl
- 🤗JunxiongWang/mamba_0_75_dpo_ep1model· 4 dl4 dl
- 🤗JunxiongWang/MambaInLlama_0_50model· 1 dl1 dl
- 🤗JunxiongWang/Mamba2InLlama_0_50model· 4 dl4 dl
- 🤗JunxiongWang/MambaInLlama_0_75model
- 🤗JunxiongWang/Mamba2InLlama_0_75model· 1 dl1 dl
Videos
Taxonomy
TopicsAgriculture and Rural Development Research
MethodsAttention Is All You Need · Linear Layer · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings
