Priming: Hybrid State Space Models From Pre-trained Transformers

Aditya Chattopadhyay; Elvis Nunez; Prannay Kaul; Benjamin Bowman; Evan Becker; Luca Zancato; David Thomas; Wei Xia; Stefano Soatto

arXiv:2605.08301·cs.LG·May 12, 2026

Priming: Hybrid State Space Models From Pre-trained Transformers

Aditya Chattopadhyay, Elvis Nunez, Prannay Kaul, Benjamin Bowman, Evan Becker, Luca Zancato, David Thomas, Wei Xia, Stefano Soatto

PDF

1 Repo

TL;DR

Priming transforms pre-trained Transformers into Hybrid State Space Models, enabling faster, memory-efficient long-context reasoning with minimal additional training, and facilitates controlled architecture comparisons.

Contribution

Introduces Priming, a method for converting pre-trained Transformers into Hybrid models, reducing training costs and enabling architecture comparisons at scale.

Findings

01

Priming achieves high downstream performance with less than 0.5% of pre-training tokens.

02

Hybrid models with Priming outperform baseline Transformers in long-context reasoning tasks.

03

The expressiveness hierarchy GKA>GDN>Mamba-2 predicts downstream performance.

Abstract

Hybrid State-Space models combine Attention with recurrent State-Space Model (SSM) layers, balancing eidetic memory from Attention with compressed fading memory from SSMs. This yields smaller Key-Value caches and faster decoding than Transformers, along with a richer architectural design space. Exploring that design space at scale has so far required training from scratch, a barrier that has kept most large-model Hybrid research within a narrow range of architectures. We introduce Priming, a method that turns Hybrid architecture design from a pre-training problem into a knowledge transfer one. Priming initializes a Hybrid model from a pre-trained Transformer and, through short alignment and post-training phases, recovers downstream quality using less than 0.5% of the source model's pre-training token budget. Priming is agnostic to the source Transformer family (e.g., Qwen, Llama,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.