SR-TTT: Surprisal-Aware Residual Test-Time Training

Swamynathan V P

arXiv:2603.06642·cs.LG·March 10, 2026

SR-TTT: Surprisal-Aware Residual Test-Time Training

Swamynathan V P

PDF

Open Access

TL;DR

SR-TTT enhances test-time training for language models by selectively preserving highly surprising tokens with exact attention, improving recall on tasks with rare or unique inputs while maintaining low memory usage.

Contribution

It introduces a surprisal-aware, sparse memory mechanism that dynamically routes critical tokens to exact attention, addressing recall failures in TTT architectures.

Findings

01

Improved recall on Needle-in-a-Haystack tasks.

02

Maintains O(1) memory footprint for background context.

03

Open-source implementation available.

Abstract

Test-Time Training (TTT) language models achieve theoretically infinite context windows with an O(1) memory footprint by replacing the standard exact-attention KV-cache with hidden state ``fast weights'' W_fast updated via self-supervised learning during inference. However, pure TTT architectures suffer catastrophic failures on exact-recall tasks (e.g., Needle-in-a-Haystack). Because the fast weights aggressively compress the context into an information bottleneck, highly surprising or unique tokens are rapidly overwritten and forgotten by subsequent token gradient updates. We introduce SR-TTT (Surprisal-Aware Residual Test-Time Training), which resolves this recall failure by augmenting the TTT backbone with a loss-gated sparse memory mechanism. By dynamically routing only incompressible, highly surprising tokens to a traditional exact-attention Residual Cache, SR-TTT preserves O(1)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications