TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba

Xiuwei Chen; Wentao Hu; Xiao Dong; Sihao Lin; Zisheng Chen; Meng Cao; Yina Zhuang; Jianhua Han; Hang Xu; Xiaodan Liang

arXiv:2502.15130·cs.CV·October 10, 2025

TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba

Xiuwei Chen, Wentao Hu, Xiao Dong, Sihao Lin, Zisheng Chen, Meng Cao, Yina Zhuang, Jianhua Han, Hang Xu, Xiaodan Liang

PDF

TL;DR

TransMamba introduces a two-stage knowledge transfer framework that enables fast and effective adaptation of Transformer pre-trained models to Mamba architectures, significantly reducing training resources needed.

Contribution

It proposes a novel cross-architecture transfer method with selective weight subcloning, layered initialization, and adaptive distillation to accelerate Mamba model training.

Findings

01

Outperforms baseline methods across multiple Mamba architectures.

02

Achieves high performance with less training data and fewer resources.

03

Effective on diverse downstream tasks like classification and multimodal reasoning.

Abstract

Transformer-based architectures have become the backbone of both uni-modal and multi-modal foundation models, largely due to their scalability via attention mechanisms, resulting in a rich ecosystem of publicly available pre-trained models such as LLaVA, CLIP, and DeiT, etc. In parallel, emerging sub-quadratic architectures like Mamba offer promising efficiency gains by enabling global context modeling with linear complexity. However, training these architectures from scratch remains resource-intensive (e.g., in terms of data and time). Motivated by this challenge, we explore a cross-architecture knowledge transfer paradigm, termed TransMamba, that facilitates the reuse of Transformer pre-trained knowledge. We propose a two-stage framework to accelerate the training of Mamba-based models, ensuring their effectiveness across both uni-modal and multi-modal tasks. The first stage leverages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Multi-Head Attention · Mamba: Linear-Time Sequence Modeling with Selective State Spaces