Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang; Weiye Xu; Aijun Yang; Wengang Zhou; Lewei Lu; Houqiang Li; Xiaohua Wang; Jinguo Zhu

arXiv:2511.10648·cs.CV·November 14, 2025

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Self-Consistency Sampling (SCS), a novel method that improves reward-based reinforcement learning for multimodal large language models by reducing unfaithful reasoning trajectories, leading to significant accuracy gains.

Contribution

The paper proposes SCS, a new sampling technique that enhances outcome-reward RL training of MLLMs by identifying and down-weighting unreliable reasoning trajectories.

Findings

01

SCS improves accuracy by up to 7.7 percentage points on six benchmarks.

02

SCS yields notable gains across different model sizes and training methods.

03

SCS introduces minimal additional computation.

Abstract

Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

GenuineWWD/SCS_data
dataset· 78 dl
78 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques