Attention to Mamba: A Recipe for Cross-Architecture Distillation

Abhinav Moudgil; Ningyuan Huang; Eeshan Gunesh Dhekane; Pau Rodr\'iguez; Luca Zappella; Federico Danieli

arXiv:2604.14191·cs.CL·April 17, 2026

Attention to Mamba: A Recipe for Cross-Architecture Distillation

Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodr\'iguez, Luca Zappella, Federico Danieli

PDF

1 Models

TL;DR

This paper presents a two-stage distillation method to effectively transfer knowledge from Transformer models to Mamba State Space Models, preserving performance while reducing computational costs.

Contribution

It introduces a principled initialization and a two-stage distillation process that enables Mamba models to match Transformer performance in downstream tasks.

Findings

01

Distilled Mamba models maintain close perplexity to Transformer teachers.

02

The two-stage distillation improves performance over naive methods.

03

The approach is validated through extensive ablations and scaling analyses.

Abstract

State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a na\"ive distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
akashicmarga/whisper-tiny-hedgemamba
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.