SR-TTT: Surprisal-Aware Residual Test-Time Training
Swamynathan V P

TL;DR
SR-TTT enhances test-time training for language models by selectively preserving highly surprising tokens with exact attention, improving recall on tasks with rare or unique inputs while maintaining low memory usage.
Contribution
It introduces a surprisal-aware, sparse memory mechanism that dynamically routes critical tokens to exact attention, addressing recall failures in TTT architectures.
Findings
Improved recall on Needle-in-a-Haystack tasks.
Maintains O(1) memory footprint for background context.
Open-source implementation available.
Abstract
Test-Time Training (TTT) language models achieve theoretically infinite context windows with an O(1) memory footprint by replacing the standard exact-attention KV-cache with hidden state ``fast weights'' W_fast updated via self-supervised learning during inference. However, pure TTT architectures suffer catastrophic failures on exact-recall tasks (e.g., Needle-in-a-Haystack). Because the fast weights aggressively compress the context into an information bottleneck, highly surprising or unique tokens are rapidly overwritten and forgotten by subsequent token gradient updates. We introduce SR-TTT (Surprisal-Aware Residual Test-Time Training), which resolves this recall failure by augmenting the TTT backbone with a loss-gated sparse memory mechanism. By dynamically routing only incompressible, highly surprising tokens to a traditional exact-attention Residual Cache, SR-TTT preserves O(1)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
