Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents

Reuben Tan; Baolin Peng; Zhengyuan Yang; Hao Cheng; Oier Mees; Theodore Zhao; Andrea Tupini; Isar Meijier; Qianhui Wu; Yuncong Yang; Lars Liden; Yu Gu; Sheng Zhang; Xiaodong Liu; Lijuan Wang; Marc Pollefeys; Yong Jae Lee; Jianfeng Gao

arXiv:2512.03438·cs.AI·April 21, 2026

Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents

Reuben Tan, Baolin Peng, Zhengyuan Yang, Hao Cheng, Oier Mees, Theodore Zhao, Andrea Tupini, Isar Meijier, Qianhui Wu, Yuncong Yang, Lars Liden, Yu Gu, Sheng Zhang, Xiaodong Liu, Lijuan Wang, Marc Pollefeys, Yong Jae Lee, Jianfeng Gao

PDF

TL;DR

This paper introduces Argos, an adaptive verifier that enhances multimodal reinforcement learning by providing fine-grained, task-specific rewards, leading to state-of-the-art results and reduced reward hacking.

Contribution

The paper presents Argos, a novel reward agent that evaluates multiple aspects of reasoning, improving training and robustness of multimodal AI agents.

Findings

01

Achieves state-of-the-art results on multiple agentic tasks.

02

Reduces reward hacking in multimodal reinforcement learning.

03

Online verification prevents agents from collapsing to ungrounded solutions.

Abstract

Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded & Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.