The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Junxiong Wang; Daniele Paliotta; Avner May; Alexander M. Rush; and Tri Dao

arXiv:2408.15237·cs.LG·June 30, 2025·2 cites

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, and Tri Dao

PDF

Open Access 2 Repos 10 Models 2 Datasets 1 Video

TL;DR

This paper demonstrates how to distill large Transformer models into linear RNNs called Mamba, enabling efficient deployment and inference acceleration, while maintaining competitive performance in language modeling and benchmarks.

Contribution

It introduces a method to convert pretrained Transformers into hybrid models with linear RNNs, and proposes a hardware-aware decoding algorithm for faster inference.

Findings

01

Distilled models outperform open-source hybrid Mamba trained from scratch.

02

Achieves high performance on chat benchmarks and general benchmarks.

03

Exhibits natural length extrapolation and efficient inference acceleration.

Abstract

Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

The Mamba in the Llama: Distilling and Accelerating Hybrid Models· slideslive

Taxonomy

TopicsAgriculture and Rural Development Research

MethodsAttention Is All You Need · Linear Layer · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings