Retrieval-Aware Distillation for Transformer-SSM Hybrids

Aviv Bick; Eric P. Xing; Albert Gu

arXiv:2602.11374·cs.LG·February 13, 2026

Retrieval-Aware Distillation for Transformer-SSM Hybrids

Aviv Bick, Eric P. Xing, Albert Gu

PDF

Open Access

TL;DR

This paper introduces retrieval-aware distillation to create efficient Transformer-SSM hybrid models that retain high retrieval performance while significantly reducing memory usage by focusing on critical attention heads.

Contribution

It proposes a method to distill pretrained Transformers into hybrid models by preserving only retrieval-critical heads, greatly reducing memory and computational costs.

Findings

01

Preserving just 2% of attention heads recovers over 95% of performance on retrieval tasks.

02

Hybrid models with reduced states are 5-6 times more memory-efficient than comparable models.

03

Large recurrent states can compensate for missing retrieval, enabling simpler SSM backbones.

Abstract

State-space models (SSMs) offer efficient sequence modeling but lag behind Transformers on benchmarks that require in-context retrieval. Prior work links this gap to a small set of attention heads, termed Gather-and-Aggregate (G&A), which SSMs struggle to reproduce. We propose *retrieval-aware distillation*, which converts a pretrained Transformer into a hybrid student by preserving only these retrieval-critical heads and distilling the rest into recurrent heads. We identify the essential heads via ablation on a synthetic retrieval task, producing a hybrid with sparse, non-uniform attention placement. We show that preserving **just 2% of attention heads recovers over 95% of teacher performance on retrieval-heavy tasks** (10 heads in a 1B model), requiring far fewer heads than hybrids that retain at least 25%. We further find that large recurrent states often compensate for missing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Generative Adversarial Networks and Image Synthesis · Topic Modeling