Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation

Yuheng Zha; Kun Zhou; Yujia Wu; Yushu Wang; Jie Feng; Zhi Xu; Shibo Hao; Zhengzhong Liu; Eric P. Xing; Zhiting Hu

arXiv:2508.12680·cs.CV·August 19, 2025

Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation

Yuheng Zha, Kun Zhou, Yujia Wu, Yushu Wang, Jie Feng, Zhi Xu, Shibo Hao, Zhengzhong Liu, Eric P. Xing, Zhiting Hu

PDF

Open Access 1 Models

TL;DR

Vision-G1 introduces a comprehensive multi-domain visual reasoning dataset and employs a multi-round RL training approach, significantly enhancing general reasoning abilities across diverse tasks and outperforming existing models.

Contribution

The paper presents a large, multi-domain dataset and a novel influence function-based data selection method for training a general visual language model with reinforcement learning.

Findings

01

Achieves state-of-the-art results on various visual reasoning benchmarks.

02

Outperforms similar-sized VLMs and proprietary models like GPT-4o and Gemini-1.5 Flash.

03

Demonstrates effective multi-domain generalization in visual reasoning.

Abstract

Despite their success, current training pipelines for reasoning VLMs focus on a limited range of tasks, such as mathematical and logical reasoning. As a result, these models face difficulties in generalizing their reasoning capabilities to a wide range of domains, primarily due to the scarcity of readily available and verifiable reward data beyond these narrowly defined areas. Moreover, integrating data from multiple domains is challenging, as the compatibility between domain-specific datasets remains uncertain. To address these limitations, we build a comprehensive RL-ready visual reasoning dataset from 46 data sources across 8 dimensions, covering a wide range of tasks such as infographic, mathematical, spatial, cross-image, graphic user interface, medical, common sense and general science. We propose an influence function based data selection and difficulty based filtering strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
yzha/vision-g1
model· 14 dl
14 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Logic, Reasoning, and Knowledge