PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning
Zeming Chen, Angelika Romanou, Gail Weiss, Antoine Bosselut

TL;DR
PERK introduces a scalable, parameter-efficient method for long-context reasoning that uses test-time learning with lightweight adapters, significantly improving performance over prompt-based methods while maintaining inference efficiency.
Contribution
PERK proposes a novel nested optimization approach with low-rank adapters for scalable, memory-efficient long-context reasoning during test time.
Findings
PERK achieves up to 90% performance improvement on small models.
PERK outperforms prompt-based baselines in robustness and reasoning complexity.
It scales more efficiently at inference despite memory-intensive training.
Abstract
Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to…
Peer Reviews
Decision·ICLR 2026 Poster
- Paper is well written, and the results are nice - PERK demonstrates strong generalization to long context extrapolation (e.g., training on 8K, testing on 128K) and superior robustness to positional biases in the relevant information. - The idea of using a LoRA adapter as a differentiable memory module for context is a good alternative to ICL - The use of LoRA and TGU is nice, and seems well executed.
- The biggest concern is the added complexity. The gains are nice, but I'm not sure they justify using this method over FT-ICR. - Related to the added complexity, in general the inference time is significantly longer than ICR (at least up to 34k according to fig 7) - PERK is designed for length generalization, as noted in the experiments (train on 8k, eval on 32k). However, FT-ICR is not suited for this kind of evaluation, as is well known. So I'm not sure this is the best baseline to compare ag
- The paper offers a wide-ranging experimental study across multiple long-context reasoning benchmarks, including Needle-in-a-Haystack, HotpotQA, TriviaQA, and the newly introduced Drops-in-the-Ocean (DIO) dataset. The introduction of DIO, which features distributionally similar distractors to better test reasoning precision, strengthens the evaluation by addressing limitations of prior benchmarks. Comparisons against both open-source and commercial LLMs convincingly demonstrate PERK’s superiori
- The authors don't mention any public release of the code - In section 2 of the appendix I see you only ran you experiments setting temperature to 0. Have you tried other values and what results have you found? I'd be curious to see the robustness of PERK at different decoding parameters - While I didn't find any major weakness, what prevented me from giving a strong accept to this paper is a clear metric of inference cost for PERK. For this method to be useful to researchers and ML practitio
* The test-time learning problem is relatively useful to further improve the model performance. * Figure 1 presents the pipeline of this method, and the inner loop is clear. * The method compares the performance with FT-ICR, proving that the PERK is better. * The experiments have various models, supporting that the PERK is general.
* Is the method test-time training? According to the definition of the outer loop, Equation 3 needs the label of the question. However, during test time, there is no such label. * Figure 4 is not related to length extrapolation. The Qwen-2.5-0.5B is trained with 32K by the Qwen Team. However, Figure 4 presents the performance with a maximum 32K, which is NOT longer than the Qwen-2.5-0.5B training length. This is a misclaim. * It should provide the training cost, such as time cost, for the compar
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Multimodal Machine Learning Applications
