Data Efficient Any Transformer-to-Mamba Distillation via Attention Bridge
Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, Kai Wang

TL;DR
This paper introduces a novel, data-efficient distillation framework called Attention Bridge (CAB) that effectively transfers attention knowledge from Transformer models to state-space models, improving their performance especially with limited data.
Contribution
The paper proposes a new cross-architecture distillation method with token-level supervision and flexible layer alignment, enabling efficient knowledge transfer from Transformers to SSMs.
Findings
CAB outperforms existing distillation methods in vision and language tasks.
The method enhances SSM performance with limited training data.
Attention knowledge transfer is effective across diverse domains.
Abstract
State-space models (SSMs) have emerged as efficient alternatives to Transformers for sequence modeling, offering superior scalability through recurrent structures. However, their training remains costly and the ecosystem around them is far less mature than that of Transformers. Moreover, the structural heterogeneity between SSMs and Transformers makes it challenging to efficiently distill knowledge from pretrained attention models. In this work, we propose Cross-architecture distillation via Attention Bridge (CAB), a novel data-efficient distillation framework that efficiently transfers attention knowledge from Transformer teachers to state-space student models. Unlike conventional knowledge distillation that transfers knowledge only at the output level, CAB enables token-level supervision via a lightweight bridge and flexible layer-wise alignment, improving both efficiency and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper has a combination of intuitive ideas (e.g. the Attention Bridge, the layerwise proportional indexing for asymmetric teacher and students), and these are introduced and demonstrated well. - The chosen baselines make sense to compare against, and the proposed method seems generally superior to them. - The paper is generally well written and is easy to read and understand.
- The paper restricts evaluation to situations in which the choice of teacher model is the same (or similar) size as the student model (e.g. Deit Tiny to Vim Tiny, Deit Small to Vim Small in most experiments, and only one experiment with the immediately larger model variation used as teacher). I think that distillation presents the most practical value when we can use a larger teacher to train a smaller student model (e.g. Deit base, large, or huge used as a teacher for a smaller Vim model). The
1. Innovative cross-architecture distillation paradigm with structural alignment. The paper demonstrates strong originality through creative solutions to long-standing cross-architecture knowledge transfer challenges. CAB introduces the Attention Bridge—a lightweight MLP-based module that maps Transformer’s explicit Q/K representations to Mamba’s implicit B/C projections—addressing the core issue of structural heterogeneity between attention-based models and SSMs. Unlike prior works, this desig
1. Empirical Considerations for Attention Bridge Design: Lack of Systematic Validation The paper designs the attention bridge as a "2-layer MLP + SiLU activation," but this design lacks theoretical basis and systematic comparison: (1) It does not test the impact of different network structures (1-layer MLP, 3-layer MLP, attention layer, etc.) on the alignment effect; (2) It does not analyze the trade-off between the bridge's parameter size and performance efficiency, if the bridge's parameters
* The formulation of the "Attention Bridge" is a creative and novel solution to the cross-architecture distillation problem. The insight to treat Mamba's $B$ and $C$ projections as analogous to the Transformer's $K$ and $Q$ under a linear attention approximation is theoretically grounded and is a natural extension of State-Space Duality. * The experimental evaluation is thorough, spanning multiple domains, model scales, and data regimes. The consistent and often substantial improvements over s
* The authors use a 2-layer MLP with SiLU activations for the bridge modules ($\phi_B$) and ($\phi_C$). While effective, the paper does not ablate this choice against a simpler linear projection. A brief justification or experimental result showing the superiority of a non-linear transformation over a linear one would strengthen this design decision, especially given the focus on efficiency. * The proportional layer mapping $g(l)$ is a simple and effective heuristic. However, the paper does not
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Multimodal Machine Learning Applications
