Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling
Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu

TL;DR
This paper introduces Self-Consistency Sampling (SCS), a novel method that improves reward-based reinforcement learning for multimodal large language models by reducing unfaithful reasoning trajectories, leading to significant accuracy gains.
Contribution
The paper proposes SCS, a new sampling technique that enhances outcome-reward RL training of MLLMs by identifying and down-weighting unreliable reasoning trajectories.
Findings
SCS improves accuracy by up to 7.7 percentage points on six benchmarks.
SCS yields notable gains across different model sizes and training methods.
SCS introduces minimal additional computation.
Abstract
Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
