TL;DR
Priming transforms pre-trained Transformers into Hybrid State Space Models, enabling faster, memory-efficient long-context reasoning with minimal additional training, and facilitates controlled architecture comparisons.
Contribution
Introduces Priming, a method for converting pre-trained Transformers into Hybrid models, reducing training costs and enabling architecture comparisons at scale.
Findings
Priming achieves high downstream performance with less than 0.5% of pre-training tokens.
Hybrid models with Priming outperform baseline Transformers in long-context reasoning tasks.
The expressiveness hierarchy GKA>GDN>Mamba-2 predicts downstream performance.
Abstract
Hybrid State-Space models combine Attention with recurrent State-Space Model (SSM) layers, balancing eidetic memory from Attention with compressed fading memory from SSMs. This yields smaller Key-Value caches and faster decoding than Transformers, along with a richer architectural design space. Exploring that design space at scale has so far required training from scratch, a barrier that has kept most large-model Hybrid research within a narrow range of architectures. We introduce Priming, a method that turns Hybrid architecture design from a pre-training problem into a knowledge transfer one. Priming initializes a Hybrid model from a pre-trained Transformer and, through short alignment and post-training phases, recovers downstream quality using less than 0.5% of the source model's pre-training token budget. Priming is agnostic to the source Transformer family (e.g., Qwen, Llama,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
