Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning
Peng Cui, Boyao Yang, Jun Zhu

TL;DR
This paper introduces Ranking-Aware Calibration (RAC), a training framework for multimodal reinforcement learning that improves model calibration and accuracy by using ranking and corruption-based confidence supervision without external annotations.
Contribution
The paper proposes RAC, a novel training-time calibration method that leverages ranking signals and visual evidence degradation to enhance confidence calibration and reasoning accuracy.
Findings
RAC significantly improves task accuracy across multiple benchmarks.
The pairwise corruption loss reduces calibration error under degraded inputs.
Combining both losses achieves the best calibration and often improves accuracy.
Abstract
Reinforcement learning post-training has substantially improved the reasoning accuracy of vision-language models, yet the resulting policies remain poorly calibrated. Terminal correctness rewards provide no gradient that penalizes confident errors more than uncertain ones and no signal that ties confidence to the quality of visual evidence, a gap that becomes especially severe under corrupted or ambiguous inputs where models continue to report high confidence on incorrect answers. We introduce Ranking-Aware Calibration (RAC), a training-time framework that supervises confidence using two comparison signals that group-based RL already produces at no additional labeling cost. The ranking-aware group loss enforces that a better rollout receives higher confidence than a worse one within the same prompt. The clean--corrupted pairwise loss enforces that confidence attenuates as visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
