Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification
Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, Michael R. Lyu

TL;DR
This paper introduces a self-evolving approach for Deep Research Agents that iteratively verify and refine their outputs at inference time using rubric-guided feedback, improving accuracy without retraining.
Contribution
It proposes DeepVerifier, a plug-and-play verification module that enhances agent performance through test-time feedback and introduces a new dataset for open-source model verification.
Findings
DeepVerifier outperforms baseline judges by 12%-48% in meta-evaluation F1 score.
Test-time verification yields 8%-11% accuracy improvements on challenging datasets.
The dataset DeepVerifier-4K supports open-source development of verification capabilities.
Abstract
Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
