Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

Mengzhao Jia; Zhihan Zhang; Meng Jiang

arXiv:2604.18892·cs.CL·April 22, 2026

Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

Mengzhao Jia, Zhihan Zhang, Meng Jiang

PDF

TL;DR

This paper introduces a new reward mechanism called Groupwise Ranking Reward to improve the reliability of multimodal reasoning in reinforcement learning by better aligning reasoning validity with answer correctness.

Contribution

It proposes Groupwise Ranking Reward, a novel trajectory supervision method that enhances the alignment of reasoning validity with answer correctness in multimodal RL.

Findings

01

Groupwise Ranking Reward outperforms reward models and generative rewards.

02

Trajectory supervision reduces reasoning-answer inconsistency.

03

Reliability-conditioned accuracy improved from 47.4% to 54.7%.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) improves multimodal reasoning by rewarding verifiable final answers. Yet answer-correct trajectories may still rely on incomplete derivations, weak evidence, or statements that contradict their conclusions. This gap between answer correctness and reasoning validity, which we call reasoning-answer inconsistency, motivates trajectory supervision in multimodal RL. We compare two main approaches: reward models (RMs), and Generative Rewards (GRs). RMs are efficient and help early in training, but their gains weaken as the policy distribution shifts; GRs improve performance, but may give unstable rewards and computationally expensive. We therefore propose Groupwise Ranking Reward, which ranks verifier-passed trajectories for the same prompt in one pass and redistributes reward accordingly. Groupwise comparison better separates stronger and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.