Stable and Efficient Single-Rollout RL for Multimodal Reasoning
Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, Dong Yu

TL;DR
This paper introduces MSSR, a novel single-rollout reinforcement learning framework for multimodal reasoning that enhances training stability and efficiency, outperforming existing group-based methods across multiple benchmarks.
Contribution
MSSR is a group-free RLVR method with an entropy-based advantage-shaping mechanism that stabilizes training and improves performance in multimodal reasoning tasks.
Findings
MSSR achieves similar accuracy with half the training steps of group-based methods.
MSSR surpasses group-based baselines when trained for the same number of steps.
MSSR shows consistent generalization across five reasoning benchmarks.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
