TL;DR
This study systematically investigates how the size of relevant context affects large language models' ability to perform long-context question answering across diverse domains, revealing that smaller contexts significantly hinder performance.
Contribution
It is the first comprehensive analysis of gold context size impact on LLMs, demonstrating its critical role independent of other factors across multiple benchmarks and models.
Findings
Smaller gold contexts lead to lower model accuracy.
Performance degradation is consistent across different domains and models.
Gold context size is an independent predictor of success.
Abstract
Large language models (LLMs) face significant challenges with needle-in-ahaystack tasks, where relevant information ("the needle") must be drawn from a large pool of irrelevant context ("the haystack"). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of gold context size, the length of the answer-containing document, has received little attention. We present the first systematic study of gold context size in long-context question answering, spanning three diverse benchmarks (general knowledge, biomedical reasoning, and mathematical reasoning), eleven state-of-the-art LLMs (including recent reasoning models), and more than 150K controlled runs. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., smaller gold contexts consistently degrade model…
Peer Reviews
Decision·Submitted to ICLR 2026
- main finding is clear - strong empirical evidence supporting the impact of needle size across different domains and models - confounder checks (answer repetition, gold/distractor ratio, etc.) strengthen the core claim
- The proposed mitigation (balance sizes of needles, L469-470) seems oversimplified and unconfirmed. Since the gold needle isn't known a priori, we cannot enlarge only the gold; making all passages similarly long may be the only option. Because experiments vary only the gold size, it's unclear whether balancing helps in practice. Evaluating a setup where all passages (not just gold) are large (and/or of other but equal length) could clarify this. - The single-needle setup mainly probes retrieval
1. The study is comprehensive. The authors conducted a sheer scale of the experiments, testing 11 modern LLMs (including strong proprietary and open-weight models) on three diverse tasks (biomedical, general QA, and math). This provides strong evidence that the findings are not a model-specific or task-specific fluke. 2. The paper's most valuable contribution is not just that "size matters," but its demonstration of the interaction between gold context size and positional bias. 3. The findings
My main concerns are with the experimental design, which seems to entangle the core variable of interest ("absolute size") with other, more powerful confounding variables. 1. Entangled variables (size vs. ratio): The study fixes the distractor (haystack) size. This means gold_context_size and gold_to_distractor_ratio are perfectly correlated. A "small needle" is always a "low signal-to-noise ratio," and a "large needle" is always a "high ratio." The post-hoc analysis in Sec 4.3 does not adequate
1. This paper proposes an interesting and intuitive factor affecting NIAH performance, i.e., the gold context length. The study of this factor breaks down the reliability of the notion that "longer inputs always lead to performance degradation" in practical scenarios. 2. The paper conducts extensive experiments to demonstrate that gold context length reduces LLM performance in NIAH tasks and that this performance is more sensitive to position.
1. The most significant issue is that this paper only points out the problem without conducting a mechanistic analysis. This makes it difficult to determine whether this is a temporary issue or a fundamental deficiency in LLMs. Furthermore, the paper does not test on the latest models known for excellent long-context performance, such as Gemini 2.5 Pro and Claude. Notably, Figure 3 shows that o3-mini is less affected by changes in context length, which heightens my concern about the significance
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
