NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models

Hyeonseok Moon; Heuiseok Lim

arXiv:2507.22411·cs.CL·January 5, 2026

NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models

Hyeonseok Moon, Heuiseok Lim

PDF

1 Datasets

TL;DR

NeedleChain is a new benchmark designed to rigorously evaluate large language models' ability to fully understand and incorporate all provided context, addressing limitations of previous benchmarks that overemphasized snippet retrieval.

Contribution

The paper introduces NeedleChain, a novel benchmark with variants for assessing context comprehension, and proposes a training-free strategy called ROPE contraction to improve model reasoning.

Findings

01

Current benchmarks overestimate LLMs' true context understanding.

02

Even GPT-4o struggles with integrating 200 tokens of query-relevant text.

03

ROPE contraction enhances models' full-context integration capabilities.

Abstract

Recent reports suggest that LLMs can handle increasingly long contexts. However, many existing benchmarks for context understanding embed substantial query-irrelevant content, which shifts evaluation toward retrieving relevant snippets rather than fully integrating all provided information. Under this setting, we view that current benchmarks can overestimate true context-understanding ability of LLMs. In particular, we demonstrate that when the context consists entirely of query-relevant text, even advanced models such as GPT-4o fail to reliably integrate inputs as short as 200 tokens. To evaluate this capability more rigorously, we introduce NeedleChain, a benchmark designed to test whether models can faithfully incorporate all given evidence. NeedleChain includes three variants that differ in the required order of comprehension, along with a parallel benchmark based on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

hyeonsss/needlechain
dataset· 221 dl
221 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.