Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs

Rachit Bansal; Aston Zhang; Rishabh Tiwari; Lovish Madaan; Sai Surya Duvvuri; Devvrit Khatri; David Brandfonbrener; David Alvarez-Melis; Prajjwal Bhargava; Mihir Sanjay Kale; Samy Jelassi

arXiv:2512.13898·cs.LG·December 17, 2025

Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs

Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri, Devvrit Khatri, David Brandfonbrener, David Alvarez-Melis, Prajjwal Bhargava, Mihir Sanjay Kale, Samy Jelassi

PDF

Open Access 3 Reviews

TL;DR

This paper identifies limitations of static self-attention in long-context LLMs and proposes a gradient update method during inference that significantly improves performance on long-context tasks.

Contribution

It introduces a simple, provably effective inference-time training approach that overcomes static self-attention limitations in long-context LLMs.

Findings

01

Inference-time strategies show diminishing returns at long context.

02

Proposed method improves performance by 12.6-14.1 percentage points on benchmarks.

03

Targeted gradient updates outperform current inference scaling techniques.

Abstract

Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long-context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference-time compute can be used to scale performance of LLMs, often by generating thinking tokens, on challenging tasks involving multi-step reasoning. Through controlled experiments on sandbox long-context tasks, we find that such inference-time strategies show rapidly diminishing returns and fail at long context. We attribute these failures to score dilution, a phenomenon inherent to static self-attention. Further, we show that current inference-time strategies cannot retrieve relevant long-context signals under certain conditions. We propose a simple method that, through targeted gradient updates on the given…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- Clever idea to learn only the test time "decoder", not the "encoder" - Extremely strong performance improvement

Weaknesses

- Training required (during decoding) - No detailed efficiency study has happened. - No large model is tested

Reviewer 02Rating 4Confidence 3

Strengths

* It formally introduces "score dilution" to explain long-context LLM failures, turning vague issues (e.g., missed key info) into a quantifiable, solvable problem—filling a gap in prior research that lacked clear theoretical grounding for such limitations. * The proposed query-only Test-Time Training (qTTT) is innovative in its frugality: it reuses frozen KV caches and only updates query projections, avoiding the high compute of full-model fine-tuning or ineffective "thinking tokens" for long

Weaknesses

* While it highlights compute efficiency, it does not measure inference latency (critical for production) when qTTT is added—leaving unclear if its small compute overhead translates to acceptable delays for time-sensitive tasks (e.g., real-time code debugging). * It does not explore how qTTT performs with noisy or low-quality long texts (e.g., unstructured logs, messy code), where distractors are more prevalent—limiting understanding of its robustness beyond clean benchmark datasets.

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper proposes a compute-aware design. The proposed prefill-once, KV-cache reuse, and FLOP matching to thinking tokens make for a fair comparison and a practical recipe. 2. The benchmarks in the paper are sufficient. It evaluates the proposed method across model sizes and multiple long-context benchmarks. 3. The idea is really cute. The inference-time updating the parameters is novel in the community.

Weaknesses

1. The theoretical analysis is too naive to capture the main motivation. Concretely, it cannot prove that the score dilution is the main reason for the poor performance. * For example, whether the poor performance comes from the small number of training samples. Usually, learning more complex abilities, i.e., solving problems with longer context, requires more training samples than learning the simple ability. The poor performance can simply come from the relatively small number of samples.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Software System Performance and Reliability · Advanced Graph Neural Networks