Needle in the Web: A Benchmark for Retrieving Targeted Web Pages in the Wild
Yumeng Wang, Tianyu Fan, Lingrui Xu, Chao Huang

TL;DR
This paper introduces Needle in the Web, a new benchmark for evaluating how well language models and search agents retrieve relevant web pages in response to vague, exploratory queries across multiple domains, revealing current limitations.
Contribution
The paper presents Needle in the Web, a novel benchmark designed to assess fuzzy web retrieval capabilities of LLMs and agents on ambiguous, real-world queries, filling a gap in existing benchmarks.
Findings
Most models score below 35% accuracy on the benchmark.
Current systems struggle with fuzzy, ambiguous web queries.
The benchmark exposes significant challenges in semantic web retrieval.
Abstract
Large Language Models (LLMs) have evolved from simple chatbots into sophisticated agents capable of automating complex real-world tasks, where browsing and reasoning over live web content is key to assessing retrieval and cognitive skills. Existing benchmarks like BrowseComp and xBench-DeepSearch emphasize complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search, namely queries that are vague and multifaceted, where users seek the most relevant webpage rather than a single factual answer. To address this gap, we introduce Needle in the Web, a novel benchmark specifically designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content in response to ambiguous, exploratory queries under varying levels of difficulty. Needle in the Web comprises 663 questions spanning seven distinct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Information Retrieval and Search Behavior · Multimodal Machine Learning Applications
