iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics
Preetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta, Chirag Shah

TL;DR
iAgentBench is a new benchmark designed to evaluate the ability of information-seeking agents to perform complex sensemaking tasks involving multiple sources, addressing limitations of existing QA benchmarks.
Contribution
The paper introduces iAgentBench, a dynamic, realistic benchmark that assesses higher-level information synthesis and reasoning in open-domain question answering systems.
Findings
Retrieval improves accuracy but is insufficient alone.
Existing benchmarks do not effectively measure multi-source sensemaking.
Evaluation of evidence use is crucial for understanding system capabilities.
Abstract
With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Information Retrieval and Search Behavior · Expert finding and Q&A systems
