AgentV-RL: Scaling Reward Modeling with Agentic Verifier

Jiazheng Zhang; Ziche Fu; Zhiheng Xi; Wenqing Jing; Mingxu Chai; Wei He; Guoqiang Zhang; Chenghao Fan; Chenxin An; Wenxiang Chen; Zhicheng Liu; Haojie Pan; Dingwei Zhu; Tao Gui; Qi Zhang; Xuanjing Huang

arXiv:2604.16004·cs.CL·April 20, 2026

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai, Wei He, Guoqiang Zhang, Chenghao Fan, Chenxin An, Wenxiang Chen, Zhicheng Liu, Haojie Pan, Dingwei Zhu, Tao Gui, Qi Zhang, Xuanjing Huang

PDF

TL;DR

AgentV-RL introduces an agentic verifier framework that enhances reward modeling for large language models by using bidirectional reasoning and reinforcement learning, leading to significant performance improvements.

Contribution

This work presents a novel agentic verifier with tool-augmented, bidirectional reasoning and reinforcement learning, improving reward modeling in complex, knowledge-intensive tasks.

Findings

01

AgentV-RL surpasses state-of-the-art ORM by 25.2%.

02

The framework improves consistency in parallel and sequential test-time scaling.

03

Bidirectional reasoning enhances the reliability and interpretability of verifiers.

Abstract

Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.