Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

Yuxuan Wan; Tianqing Fang; Zaitang Li; Yintong Huo; Wenxuan Wang; Haitao Mi; Dong Yu; Michael R. Lyu

arXiv:2601.15808·cs.AI·April 30, 2026

Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, Michael R. Lyu

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces a self-evolving approach for Deep Research Agents that iteratively verify and refine their outputs at inference time using rubric-guided feedback, improving accuracy without retraining.

Contribution

It proposes DeepVerifier, a plug-and-play verification module that enhances agent performance through test-time feedback and introduces a new dataset for open-source model verification.

Findings

01

DeepVerifier outperforms baseline judges by 12%-48% in meta-evaluation F1 score.

02

Test-time verification yields 8%-11% accuracy improvements on challenging datasets.

03

The dataset DeepVerifier-4K supports open-source development of verification capabilities.

Abstract

Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yxwan123/DeepVerifier
github

Datasets

iforgott/DeepVerifier-4K
dataset· 17 dl
17 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.