From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents

Shuoling Liu; Zhiquan Tan; Kun Yi; Hui Wu; Yihan Li; Jiangpeng Yan; Liyuan Chen; Kai Chen; Qiang Yang

arXiv:2603.25342·cs.LG·April 30, 2026

From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents

Shuoling Liu, Zhiquan Tan, Kun Yi, Hui Wu, Yihan Li, Jiangpeng Yan, Liyuan Chen, Kai Chen, Qiang Yang

PDF

TL;DR

This paper introduces a category-theoretic framework for evaluating deep research agents, presents a new benchmark for structural research skills, and demonstrates how theory-guided interventions can improve system reliability.

Contribution

It provides a novel categorical approach to evaluate and enhance deep research agents, along with a challenging benchmark and practical interventions.

Findings

01

Best system achieves only 19.9% accuracy on structural tasks

02

Agents struggle with long-horizon synthesis and verification

03

Interventions like tracked search improve system performance

Abstract

Deep Research Agents (DRAs) aim to answer complex questions by searching the web, checking evidence, and synthesizing conclusions across heterogeneous sources. We introduce a category-theoretic framework for evaluating and improving such agents. The framework treats deep research as a structured mapping from user intent to evidence-grounded conclusions, making retrieval traces, cross-source alignment, and final synthesis explicit. Guided by this view, we derive a mechanism-aware benchmark of 296 bilingual questions. The benchmark targets four structural skills central to real research: following multi-hop evidence chains, verifying claims across sources, re-ordering fragmented information, and rejecting unsupported assumptions. We evaluate 16 frontier systems with human verification and find that these structural tasks remain highly challenging: the best system reaches only 19.9%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.