DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

Nikita Gupta; Riju Chatterjee; Lukas Haas; Connie Tao; Andrew Wang; Chang Liu; Hidekazu Oiwa; Elena Gribovskaya; Jan Ackermann; John Blitzer; Sasha Goldshtein; Dipanjan Das

arXiv:2601.20975·cs.CL·January 30, 2026

DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, Dipanjan Das

PDF

Open Access

TL;DR

DeepSearchQA introduces a challenging benchmark with 900 multi-step, open-web information-seeking tasks across 17 fields, exposing current agent limitations in complex search, reasoning, and answer precision.

Contribution

The paper presents a novel benchmark dataset designed to evaluate deep research agents on complex, multi-step information retrieval and reasoning tasks, highlighting current performance gaps.

Findings

01

State-of-the-art agents struggle with balancing recall and precision.

02

Agents often prematurely stop or over-generate answers, revealing weaknesses in planning and confidence estimation.

03

DeepSearchQA serves as a diagnostic tool for improving deep research agent capabilities.

Abstract

We introduce DeepSearchQA, a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields. Unlike traditional benchmarks that target single answer retrieval or broad-spectrum factuality, DeepSearchQA features a dataset of challenging, handcrafted tasks designed to evaluate an agent's ability to execute complex search plans to generate exhaustive answer lists. This shift in design explicitly tests three critical, yet under-evaluated capabilities: 1) systematic collation of fragmented information from disparate sources, 2) de-duplication and entity resolution to ensure precision, and 3) the ability to reason about stopping criteria within an open-ended search space. Each task is structured as a causal chain, where discovering information for one step is dependent on the successful completion of the previous one, stressing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Multimodal Machine Learning Applications