Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

Tianwei Lin; Zuyi Zhou; Xinda Zhao; Chenke Wang; Xiaohong Li; Yu Chen; Chuanrui Hu; Jian Pei; Yafeng Deng

arXiv:2601.20276·cs.CL·January 29, 2026

Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

Tianwei Lin, Zuyi Zhou, Xinda Zhao, Chenke Wang, Xiaohong Li, Yu Chen, Chuanrui Hu, Jian Pei, Yafeng Deng

PDF

Open Access

TL;DR

This paper introduces a new benchmark and diagnostic protocol for evaluating evidence access and use in long-context language models, revealing a gap between benign evaluation and real-world semantic interference challenges.

Contribution

The paper presents EverMemBench-S, an adversarial benchmark with a decoupled diagnostic protocol to better assess evidence retrieval and utilization in long-context models under semantic interference.

Findings

01

Evidence access degrades sharply under semantic interference.

02

Benign NIAH evaluations overestimate real-world performance.

03

Semantic discrimination is the main bottleneck for long-context memory.

Abstract

Long-context LLM agents must access the right evidence from large environments and use it faithfully. However, the popular Needle-in-a-Haystack (NIAH) evaluation mostly measures benign span localization. The needle is near-unique, and the haystack is largely irrelevant. We introduce EverMemBench-S (EMB-S), an adversarial NIAH-style benchmark built on a 326M-token MemoryBank. While the full MemoryBank spans 326M tokens for retrieval-based (RAG) evaluation, we evaluate native long-context models only at scales that fit within each model's context window (up to 1M tokens in this work) to ensure a fair comparison. EMB-S pairs queries with collision-tested near-miss hard negatives and gold evidence sets spanning one or more documents, validated via human screening and LLM verification. We also propose a decoupled diagnostic protocol that reports evidence access (document-ID localization)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Information Retrieval and Search Behavior · Multimodal Machine Learning Applications