NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models
Hyeonseok Moon, Heuiseok Lim

TL;DR
NeedleChain is a new benchmark designed to rigorously evaluate large language models' ability to fully understand and incorporate all provided context, addressing limitations of previous benchmarks that overemphasized snippet retrieval.
Contribution
The paper introduces NeedleChain, a novel benchmark with variants for assessing context comprehension, and proposes a training-free strategy called ROPE contraction to improve model reasoning.
Findings
Current benchmarks overestimate LLMs' true context understanding.
Even GPT-4o struggles with integrating 200 tokens of query-relevant text.
ROPE contraction enhances models' full-context integration capabilities.
Abstract
Recent reports suggest that LLMs can handle increasingly long contexts. However, many existing benchmarks for context understanding embed substantial query-irrelevant content, which shifts evaluation toward retrieving relevant snippets rather than fully integrating all provided information. Under this setting, we view that current benchmarks can overestimate true context-understanding ability of LLMs. In particular, we demonstrate that when the context consists entirely of query-relevant text, even advanced models such as GPT-4o fail to reliably integrate inputs as short as 200 tokens. To evaluate this capability more rigorously, we introduce NeedleChain, a benchmark designed to test whether models can faithfully incorporate all given evidence. NeedleChain includes three variants that differ in the required order of comprehension, along with a parallel benchmark based on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
