TL;DR
This paper presents a two-stage distillation method to effectively transfer knowledge from Transformer models to Mamba State Space Models, preserving performance while reducing computational costs.
Contribution
It introduces a principled initialization and a two-stage distillation process that enables Mamba models to match Transformer performance in downstream tasks.
Findings
Distilled Mamba models maintain close perplexity to Transformer teachers.
The two-stage distillation improves performance over naive methods.
The approach is validated through extensive ablations and scaling analyses.
Abstract
State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a na\"ive distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
