From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents
Shuoling Liu, Zhiquan Tan, Kun Yi, Hui Wu, Yihan Li, Jiangpeng Yan, Liyuan Chen, Kai Chen, Qiang Yang

TL;DR
This paper introduces a category-theoretic framework for evaluating deep research agents, presents a new benchmark for structural research skills, and demonstrates how theory-guided interventions can improve system reliability.
Contribution
It provides a novel categorical approach to evaluate and enhance deep research agents, along with a challenging benchmark and practical interventions.
Findings
Best system achieves only 19.9% accuracy on structural tasks
Agents struggle with long-horizon synthesis and verification
Interventions like tracked search improve system performance
Abstract
Deep Research Agents (DRAs) aim to answer complex questions by searching the web, checking evidence, and synthesizing conclusions across heterogeneous sources. We introduce a category-theoretic framework for evaluating and improving such agents. The framework treats deep research as a structured mapping from user intent to evidence-grounded conclusions, making retrieval traces, cross-source alignment, and final synthesis explicit. Guided by this view, we derive a mechanism-aware benchmark of 296 bilingual questions. The benchmark targets four structural skills central to real research: following multi-hop evidence chains, verifying claims across sources, re-ordering fragmented information, and rejecting unsupported assumptions. We evaluate 16 frontier systems with human verification and find that these structural tasks remain highly challenging: the best system reaches only 19.9%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
