One RL to See Them All: Visual Triple Unified Reinforcement Learning

Yan Ma; Linge Du; Xuyang Shen; Shaoxiang Chen; Pengfei Li; Qibing Ren; Lizhuang Ma; Yuchao Dai; Pengfei Liu; Junjie Yan

arXiv:2505.18129·cs.CV·April 17, 2026

One RL to See Them All: Visual Triple Unified Reinforcement Learning

Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan

PDF

1 Repo 6 Models 1 Datasets

TL;DR

This paper introduces V-Triune, a unified multimodal reinforcement learning framework, and develops Orsta models that outperform specialized models across multiple vision-language tasks.

Contribution

It proposes a novel V-Triune methodology for unified multimodal RL and demonstrates its effectiveness with the Orsta models on diverse benchmarks.

Findings

01

Unified training matches or outperforms specialist models.

02

Orsta models improve over backbones on MEGA-Bench.

03

Unified RL enhances reasoning and perception in VLMs.

Abstract

Reinforcement learning (RL) is becoming an important direction for post-training vision-language models (VLMs), but public training methodologies for unified multimodal RL remain much less mature, especially for heterogeneous reasoning and perception-heavy tasks. We propose V-Triune, a Visual Triple Unified Reinforcement Learning methodology for unified multimodal RL. It organizes training around three coordinated abstractions: Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics. Within this methodology, Dynamic IoU provides localization-specific reward shaping that avoids reward ambiguity under loose thresholds and reward sparsity under strict ones. Built on V-Triune, we develop Orsta (7B, 32B), a family of models jointly trained on eight reasoning and perception tasks. Under matched budgets, unified training matches or outperforms specialist…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MiniMax-AI/One-RL-to-See-Them-All
github

Models

Datasets

One-RL-to-See-Them-All/Orsta-Data-47k
dataset· 507 dl
507 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.