Not All Needles Are Found: How Fact Distribution and Don't Make It Up Prompts Shape Literal Extraction, Logical Inference, and Hallucination Risks in Long-Context LLMs

Amirali Ebrahimzadeh; Seyyed M. Salili

arXiv:2601.02023·cs.CL·January 6, 2026

Not All Needles Are Found: How Fact Distribution and Don't Make It Up Prompts Shape Literal Extraction, Logical Inference, and Hallucination Risks in Long-Context LLMs

Amirali Ebrahimzadeh, Seyyed M. Salili

PDF

Open Access

TL;DR

This study investigates how fact placement, distribution, and prompt design affect the performance, inference, and hallucination risks of long-context large language models across various tasks and models.

Contribution

It introduces a comprehensive benchmark analyzing literal extraction, inference, and hallucination, revealing how context length and fact distribution impact model reliability.

Findings

01

Longer contexts do not always improve performance and can harm accuracy when evidence is dispersed.

02

Model robustness varies significantly, with some models degrading under realistic long-context conditions.

03

Anti-hallucination prompts can reduce hallucinations but may also decrease extraction and inference accuracy.

Abstract

Large language models (LLMs) increasingly support very long input contexts. Yet it remains unclear how reliably they extract and infer information at scale. Performance varies with context length and strongly interacts with how information is distributed in real-world corpora. Motivated by these observations, we study how fact placement, corpus-level fact distributions, and Don't Make It Up prompts influence model behavior. We introduce an extended needle-in-a-haystack benchmark across four production-scale models: Gemini-2.5-flash, ChatGPT-5-mini, Claude-4.5-haiku, and Deepseek-v3.2-chat. Unlike prior work, we separately evaluate literal extraction, logical inference, and hallucination risk. Our study considers both positional effects and realistic distributions of evidence across long contexts, as well as prompts that explicitly discourage fabrication. We find that longer contexts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Computational and Text Analysis Methods