TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba
Xiuwei Chen, Wentao Hu, Xiao Dong, Sihao Lin, Zisheng Chen, Meng Cao, Yina Zhuang, Jianhua Han, Hang Xu, Xiaodan Liang

TL;DR
TransMamba introduces a two-stage knowledge transfer framework that enables fast and effective adaptation of Transformer pre-trained models to Mamba architectures, significantly reducing training resources needed.
Contribution
It proposes a novel cross-architecture transfer method with selective weight subcloning, layered initialization, and adaptive distillation to accelerate Mamba model training.
Findings
Outperforms baseline methods across multiple Mamba architectures.
Achieves high performance with less training data and fewer resources.
Effective on diverse downstream tasks like classification and multimodal reasoning.
Abstract
Transformer-based architectures have become the backbone of both uni-modal and multi-modal foundation models, largely due to their scalability via attention mechanisms, resulting in a rich ecosystem of publicly available pre-trained models such as LLaVA, CLIP, and DeiT, etc. In parallel, emerging sub-quadratic architectures like Mamba offer promising efficiency gains by enabling global context modeling with linear complexity. However, training these architectures from scratch remains resource-intensive (e.g., in terms of data and time). Motivated by this challenge, we explore a cross-architecture knowledge transfer paradigm, termed TransMamba, that facilitates the reuse of Transformer pre-trained knowledge. We propose a two-stage framework to accelerate the training of Mamba-based models, ensuring their effectiveness across both uni-modal and multi-modal tasks. The first stage leverages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Multi-Head Attention · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
