ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

Kiarash Zahirnia; Zahra Golpayegani; Walid Ahmed; Yang Liu

arXiv:2507.06313·cs.CL·October 23, 2025

ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

Kiarash Zahirnia, Zahra Golpayegani, Walid Ahmed, Yang Liu

PDF

Open Access 4 Reviews

TL;DR

This paper introduces ETT, a method to extend the context length of transformer-based LLMs at test-time with minimal additional memory, significantly improving their ability to process longer sequences and accuracy.

Contribution

ETT enables test-time extension of LLM context length with constant memory and linear computation, using efficient fine-tuning on input chunks, and identifies optimal modules for fine-tuning.

Findings

01

Extended GPT-Large and Phi-2 up to 32k tokens.

02

Achieved up to 30% accuracy improvement.

03

Fine-tuning the second FFN layer is most effective.

Abstract

Transformer-based Language Models' computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model's parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model's accuracy. We also study how context can be stored in LLM's weights effectively and efficiently. Through…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. The method achieves constant memory footprint and linear computational complexity with respect to context length. 2. The systematic investigation of which modules to fine-tune (FFNs, attention, specific layers) provides valuable insights. 3. The paper provides sufficient implementation details (chunk size, overlap, learning rate, etc.) for reproducibility.

Weaknesses

1. The paper has limited novelty. Test-time training for language models is well-established. The primary contribution is applying this to long-context understanding, which is incremental. 2. Fine-tuning for 10 epochs at test time for every input incurs massive computational overhead, making the method not practical in real-world use cases. 3. The testing scope is very limited. Only two small models are evaluated, without any modern long-context technique baselines like the PE extrapolation meth

Reviewer 02Rating 2Confidence 4

Strengths

1. ETT offers a plug-and-play mechanism for improving long-context performance on existing models (e.g., Phi-2, GPT-Large) without re-training or additional memory, making it potentially valuable for deployment in resource-constrained environments.

Weaknesses

1. Experiments are restricted to small models (Phi-2, GPT-Large) and a single benchmark (LongBench). Results on stronger baselines (Llama3, Mistral, Qwen2.5) and more diverse tasks would significantly strengthen the paper. Small models often fail to show the same qualitative behavior in memory usage, optimization stability, or attention saturation seen in larger LLMs. More benchmark is also needed. 2. This paper lacks baselines. Other methods like LongLoRA, Position Interpolation are typicall

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper compare several different baselines including LIFT, SelfExtend 2. It chooses a decently long generative tasks to evaluate (i.e. LongBench) 3. It proposes a novel idea of finetuning a single test sample before generation, which leads to accuracy improvement.

Weaknesses

1. The baseline is rather weak as a “long context” model. 2. The experiments use 10 epochs to finetune the model for every single input, whose cost is not well studied and explained. The reviewer personally thinks this as less viable as a method for large models. 3. Both GPT-large and Phi-2 have not used any out-of-the-box tricks for context extension (e.g. tuning the RoPE parameters as suggested by https://arxiv.org/abs/2309.16039) and should be at least compared with changing the RoPE and fi

Reviewer 04Rating 4Confidence 4

Strengths

ETT addresses a pressing practical bottleneck in using LLMs for long-context tasks, trading off memory inefficiency for test-time adaptation without architectural modification.

Weaknesses

1. The overall contribution of the paper shows limited originality. The proposed ETT method merely applies the existing Test-Time Training concept to long-context inference without introducing substantial theoretical analysis or methodological innovation. 2. The review of classic long-context Transformer variants is missing, such as Compressive Transformer [1], BigBird [2], and Transformer-XL [3]. 3. The paper does not compare its models with recent state-of-the-art long-context large models suc

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Scientific Computing and Data Management · Wikis in Education and Collaboration