AgentV-RL: Scaling Reward Modeling with Agentic Verifier
Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai, Wei He, Guoqiang Zhang, Chenghao Fan, Chenxin An, Wenxiang Chen, Zhicheng Liu, Haojie Pan, Dingwei Zhu, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR
AgentV-RL introduces an agentic verifier framework that enhances reward modeling for large language models by using bidirectional reasoning and reinforcement learning, leading to significant performance improvements.
Contribution
This work presents a novel agentic verifier with tool-augmented, bidirectional reasoning and reinforcement learning, improving reward modeling in complex, knowledge-intensive tasks.
Findings
AgentV-RL surpasses state-of-the-art ORM by 25.2%.
The framework improves consistency in parallel and sequential test-time scaling.
Bidirectional reasoning enhances the reliability and interpretability of verifiers.
Abstract
Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
