Vero: An Open RL Recipe for General Visual Reasoning
Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

TL;DR
Vero is an open, scalable vision-language model trained with reinforcement learning on a diverse dataset, achieving state-of-the-art visual reasoning across multiple challenging benchmarks.
Contribution
The paper introduces Vero, an open-source RL-trained vision-language model with a large, diverse dataset and task-specific rewards, surpassing existing models in visual reasoning tasks.
Findings
Vero achieves 3.6-5.3 points improvement over base models on 30 benchmarks.
Vero outperforms proprietary models like Qwen3-VL-8B-Thinking on most benchmarks.
Diverse task categories are key to effective RL scaling and reasoning transfer.
Abstract
What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.6-5.3 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
