TL;DR
OpenVLThinkerV2 is a versatile multimodal reasoning model that employs a novel RL training objective, G$^2$RPO, to improve multi-domain visual task performance through balanced perception and reasoning.
Contribution
The paper introduces G$^2$RPO, a distributional matching RL objective, and task-level shaping mechanisms, enabling robust, general-purpose multimodal models for diverse visual tasks.
Findings
Outperforms strong open-source and proprietary models on 18 benchmarks.
G$^2$RPO improves training stability and inter-task gradient equity.
Task shaping mechanisms effectively balance perception and reasoning.
Abstract
Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (GRPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, , GRPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
