End-to-End Test-Time Training for Long Context

Arnuv Tandon; Karan Dalal; Xinhao Li; Daniel Koceja; Marcel R{\o}d; Sam Buchanan; Xiaolong Wang; Jure Leskovec; Sanmi Koyejo; Tatsunori Hashimoto; Carlos Guestrin; Jed McCaleb; Yejin Choi; Yu Sun

arXiv:2512.23675·cs.LG·January 1, 2026

End-to-End Test-Time Training for Long Context

Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel R{\o}d, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun

PDF

Open Access

TL;DR

This paper introduces a test-time training method for long-context language modeling using a standard Transformer architecture, enabling the model to adapt during inference and scale efficiently with context length.

Contribution

It presents a novel end-to-end test-time training approach with meta-learning that allows Transformers to adapt to long contexts without changing architecture.

Findings

01

Scales with context length similarly to full attention for 3B models.

02

Achieves 2.7x faster inference than full attention at 128K context length.

03

Maintains constant inference latency regardless of context size.

Abstract

We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Topic Modeling