OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Wenbo Hu; Xin Chen; Yan Gao-Tian; Yihe Deng; Nanyun Peng; Kai-Wei Chang

arXiv:2604.08539·cs.CV·April 21, 2026

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Wenbo Hu, Xin Chen, Yan Gao-Tian, Yihe Deng, Nanyun Peng, Kai-Wei Chang

PDF

1 Repo

TL;DR

OpenVLThinkerV2 is a versatile multimodal reasoning model that employs a novel RL training objective, G$^2$RPO, to improve multi-domain visual task performance through balanced perception and reasoning.

Contribution

The paper introduces G$^2$RPO, a distributional matching RL objective, and task-level shaping mechanisms, enabling robust, general-purpose multimodal models for diverse visual tasks.

Findings

01

Outperforms strong open-source and proprietary models on 18 benchmarks.

02

G$^2$RPO improves training stability and inter-task gradient equity.

03

Task shaping mechanisms effectively balance perception and reasoning.

Abstract

Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G $^{2}$ RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $N (0, 1)$ , G $^{2}$ RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uclanlp/OpenVLThinker
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.