Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation
Yuheng Zha, Kun Zhou, Yujia Wu, Yushu Wang, Jie Feng, Zhi Xu, Shibo Hao, Zhengzhong Liu, Eric P. Xing, Zhiting Hu

TL;DR
Vision-G1 introduces a comprehensive multi-domain visual reasoning dataset and employs a multi-round RL training approach, significantly enhancing general reasoning abilities across diverse tasks and outperforming existing models.
Contribution
The paper presents a large, multi-domain dataset and a novel influence function-based data selection method for training a general visual language model with reinforcement learning.
Findings
Achieves state-of-the-art results on various visual reasoning benchmarks.
Outperforms similar-sized VLMs and proprietary models like GPT-4o and Gemini-1.5 Flash.
Demonstrates effective multi-domain generalization in visual reasoning.
Abstract
Despite their success, current training pipelines for reasoning VLMs focus on a limited range of tasks, such as mathematical and logical reasoning. As a result, these models face difficulties in generalizing their reasoning capabilities to a wide range of domains, primarily due to the scarcity of readily available and verifiable reward data beyond these narrowly defined areas. Moreover, integrating data from multiple domains is challenging, as the compatibility between domain-specific datasets remains uncertain. To address these limitations, we build a comprehensive RL-ready visual reasoning dataset from 46 data sources across 8 dimensions, covering a wide range of tasks such as infographic, mathematical, spatial, cross-image, graphic user interface, medical, common sense and general science. We propose an influence function based data selection and difficulty based filtering strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Logic, Reasoning, and Knowledge
