G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

Liang Chen; Hongcheng Gao; Tianyu Liu; Zhiqi Huang; Flood Sung; Xinyu Zhou; Yuxin Wu; Baobao Chang

arXiv:2505.13426·cs.CV·May 20, 2025

G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

Liang Chen, Hongcheng Gao, Tianyu Liu, Zhiqi Huang, Flood Sung, Xinyu Zhou, Yuxin Wu, Baobao Chang

PDF

Open Access 1 Repo

TL;DR

This paper introduces VLM-Gym, a reinforcement learning environment for vision-language models, demonstrating that perception and reasoning abilities mutually improve through self-evolution and fine-tuning, leading to superior performance in visual games.

Contribution

The paper presents VLM-Gym for scalable multi-game training and introduces G1 models with perception-enhanced priors, achieving state-of-the-art results and revealing mutual bootstrap effects between perception and reasoning.

Findings

01

G1 models outperform their teachers and proprietary models.

02

Perception and reasoning abilities mutually bootstrap during RL training.

03

VLM-Gym enables scalable multi-game reinforcement learning for vision-language models.

Abstract

Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games. This ``knowing-doing'' gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM-Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi-game parallel training. Leveraging VLM-Gym, we train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception-enhanced cold start prior to RL fine-tuning. Our resulting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chenllliang/g1
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Language and cultural evolution