TL;DR
This paper investigates how reinforcement learning influences vision-language models' reasoning, identifies limitations like diversity collapse, and proposes MUPO to promote divergent thinking for improved performance.
Contribution
It reveals the behavioral differences between RL and base models, analyzes training dynamics, and introduces MUPO to enhance reasoning diversity in VLMs.
Findings
GRPO causes diversity collapse, limiting reasoning strategies.
MUPO effectively incentivizes divergent thinking.
MUPO improves performance on established benchmarks.
Abstract
Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
