Retrieval-Aware Distillation for Transformer-SSM Hybrids
Aviv Bick, Eric P. Xing, Albert Gu

TL;DR
This paper introduces retrieval-aware distillation to create efficient Transformer-SSM hybrid models that retain high retrieval performance while significantly reducing memory usage by focusing on critical attention heads.
Contribution
It proposes a method to distill pretrained Transformers into hybrid models by preserving only retrieval-critical heads, greatly reducing memory and computational costs.
Findings
Preserving just 2% of attention heads recovers over 95% of performance on retrieval tasks.
Hybrid models with reduced states are 5-6 times more memory-efficient than comparable models.
Large recurrent states can compensate for missing retrieval, enabling simpler SSM backbones.
Abstract
State-space models (SSMs) offer efficient sequence modeling but lag behind Transformers on benchmarks that require in-context retrieval. Prior work links this gap to a small set of attention heads, termed Gather-and-Aggregate (G&A), which SSMs struggle to reproduce. We propose *retrieval-aware distillation*, which converts a pretrained Transformer into a hybrid student by preserving only these retrieval-critical heads and distilling the rest into recurrent heads. We identify the essential heads via ablation on a synthetic retrieval task, producing a hybrid with sparse, non-uniform attention placement. We show that preserving **just 2% of attention heads recovers over 95% of teacher performance on retrieval-heavy tasks** (10 heads in a 1B model), requiring far fewer heads than hybrids that retain at least 25%. We further find that large recurrent states often compensate for missing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Generative Adversarial Networks and Image Synthesis · Topic Modeling
