Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
Tanmay Asthana, Aman Saksena, Divyansh Sahu

TL;DR
This paper introduces a new benchmark for evaluating deep research agents on complex, multi-document tasks resembling management consulting, using structured rubrics and verifiers to assess their decision-making and reasoning abilities.
Contribution
It presents a comprehensive benchmark with detailed scoring methods, evaluates three state-of-the-art agents, and analyzes their strengths and weaknesses in expert-level research tasks.
Findings
Most agents have low acceptance rates under strict evaluation criteria.
VRS scores correlate with existing benchmark scores, validating the rubric.
Different agents exhibit distinct failure modes and reasoning patterns.
Abstract
Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
