TL;DR
This paper reinterprets test-time training with KV binding as a form of learned linear attention, offering new insights, simplifications, and efficiency improvements over previous memorization-based views.
Contribution
It demonstrates that a broad class of TTT architectures can be expressed as learned linear attention, challenging the traditional memorization perspective.
Findings
Reveals phenomena contradicting the memorization interpretation of TTT.
Shows TTT architectures can be reformulated as linear attention.
Provides architectural simplifications and efficiency improvements.
Abstract
Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
