Evaluating the Search Agent in a Parallel World
Jiawei Chen, Xintian Shen, Lihao Zheng, Lifu Mu, Haoyi Sun, Ning Mao, Hao Ma, Tao Wei, Pan Zhou, Kun Zhan

TL;DR
This paper introduces a novel framework and benchmark for evaluating search agents in a controlled, synthetic environment called the Parallel World, addressing challenges of real-world evaluation such as data reliability and temporal obsolescence.
Contribution
The authors propose Mind-ParaWorld, a new evaluation framework with a synthetic environment and MPW-Bench, enabling reproducible and reliable assessment of search agents beyond real-world limitations.
Findings
Search agents excel at evidence synthesis with complete info
Performance limited by evidence sufficiency judgment and stopping decisions
The framework reveals bottlenecks in evidence collection and coverage
Abstract
Integrating web search tools has significantly extended the capability of LLMs to address open-world, real-time, and long-tail problems. However, evaluating these Search Agents presents formidable challenges. First, constructing high-quality deep search benchmarks is prohibitively expensive, while unverified synthetic data often suffers from unreliable sources. Second, static benchmarks face dynamic obsolescence: as internet information evolves, complex queries requiring deep research often degrade into simple retrieval tasks due to increased popularity, and ground truths become outdated due to temporal shifts. Third, attribution ambiguity confounds evaluation, as an agent's performance is often dominated by its parametric memory rather than its actual search and reasoning capabilities. Finally, reliance on specific commercial search engines introduces variability that hampers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
