Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning
Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Zhiheng Xi, Changhao Jiang, Zhangyue Yin, Yining Zheng, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu

TL;DR
This paper introduces Game-RL, a method that uses synthesized game data to train reinforcement learning models, significantly improving the general reasoning capabilities of vision-language models across multiple benchmarks.
Contribution
It presents a novel approach combining game code synthesis and reinforcement learning to enhance VLM reasoning, expanding beyond narrow domain training.
Findings
VLMs trained with GameQA outperform baseline models on 7 benchmarks.
Game-RL demonstrates the effectiveness of video games as training resources.
The approach achieves controllable difficulty in reasoning tasks.
Abstract
Vision-language reinforcement learning (RL) has primarily focused on narrow domains (e.g. geometry or chart reasoning). This leaves broader training scenarios and resources underexplored, limiting the exploration and learning of Vision Language Models (VLMs) through RL. We find video games inherently provide rich visual elements and mechanics that are easy to verify. To fully use the multimodal and verifiable reward in video games, we propose Game-RL, constructing diverse game tasks for RL training to boost VLMs general reasoning ability. To obtain training data, we propose Code2Logic, a novel approach that adapts game code to synthesize game reasoning task data, thus obtaining the GameQA dataset of 30 games and 158 tasks with controllable difficulty gradation. Unexpectedly, RL training solely on GameQA enables multiple VLMs to achieve performance improvements across 7 diverse…
Peer Reviews
Decision·ICLR 2026 Poster
- Scalable data-generation pipeline: Code2Logic programmatically maps game code to reasoning logic and auto-generates verifiable multimodal QA data. - Demonstrates that purely synthetic, self-verifiable environments can modestly improve general VLM reasoning—important for RL reproducibility. - Diverse benchmark coverage (3D perception, pattern matching, planning, reasoning). - Clear, reproducible methodology; good visualizations and qualitative examples.
- Small gains on external benchmarks are statistically and practically modest; no significance tests or efficiency comparisons. - Lack of ablation isolating contributions of Code2Logic data vs RL itself (no SFT vs RL comparison on the same data). - Evaluator bias: rewards rely on QwQ-32B, potentially aligning to its own style and inflating self-consistency. - Unclear verification metrics: “verifiable” is claimed, but no automated correctness guarantees are quantified.
* This paper is clear writing and easy to follow. * Dataset contribution: GameQA spans 30 games / 158 tasks with explicit difficulty control and verifiable answers. * GRPO on GameQA yields consistent improvements on diverse general benchmarks
* The Code2Logic pipeline is presented as highly scalable, but Section 2.4 and Appendix F.4 reveal a significant reliance on manual verification at every step (code, data engine, and augmented samples). Furthermore, the data augmentation relies on paraphrasing from InternVL2.5-78B, and data quality checks use commercial LLMs. This "human-in-the-loop" and "proprietary-LLM-in-the-loop" requirement makes the process less automated and scalable than implied. * The work only generates massive amount
- The motivation is clear and sound. The authors correctly identify a critical limitation in the current VLM training paradigm, the over-reliance on narrow, static domains like geometry or chart reasoning. The proposal to use video games as a more dynamic, verifiable, and diverse training environment for fostering general reasoning is an interesting idea. - The GameQA dataset could be a substantial contribution to the community. Its scale (30 games, 158 tasks, 140K samples), diversity across fo
1. The novelty is somewhat incremental and limited. The Code2Logic pipeline, while well-executed, is fundamentally a specific instance of LLM-based synthetic data generation. This approach is becoming increasingly common, and the paper does not sufficiently differentiate its technical contribution from existing work in program-aided or LLM-driven data synthesis. It seems to be more of an extensive engineering protocol than a novel, generalizable method. Similarly, Game-RL is an application of an
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
