Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness
Mengzhao Jia, Zhihan Zhang, Meng Jiang

TL;DR
This paper introduces a new reward mechanism called Groupwise Ranking Reward to improve the reliability of multimodal reasoning in reinforcement learning by better aligning reasoning validity with answer correctness.
Contribution
It proposes Groupwise Ranking Reward, a novel trajectory supervision method that enhances the alignment of reasoning validity with answer correctness in multimodal RL.
Findings
Groupwise Ranking Reward outperforms reward models and generative rewards.
Trajectory supervision reduces reasoning-answer inconsistency.
Reliability-conditioned accuracy improved from 47.4% to 54.7%.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) improves multimodal reasoning by rewarding verifiable final answers. Yet answer-correct trajectories may still rely on incomplete derivations, weak evidence, or statements that contradict their conclusions. This gap between answer correctness and reasoning validity, which we call reasoning-answer inconsistency, motivates trajectory supervision in multimodal RL. We compare two main approaches: reward models (RMs), and Generative Rewards (GRs). RMs are efficient and help early in training, but their gains weaken as the policy distribution shifts; GRs improve performance, but may give unstable rewards and computationally expensive. We therefore propose Groupwise Ranking Reward, which ranks verifier-passed trajectories for the same prompt in one pass and redistributes reward accordingly. Groupwise comparison better separates stronger and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
