Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

Zihao Wang; Xujing Li; Yining Ye; Junjie Fang; Haoming Wang; Longxiang Liu; Shihao Liang; Junting Lu; Zhiyong Wu; Jiazhan Feng; Wanjun Zhong; Zili Li; Yu Wang; Yu Miao; Bo Zhou; Yuanfan Li; Hao Wang; Zhongkai Zhao; Faming Wu; Zhengxuan Jiang; Weihao Tan; Heyuan Yao; Shi Yan; Xiangyang Li; Yitao Liang; Yujia Qin; Guang Shi

arXiv:2510.23691·cs.AI·October 29, 2025

Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, Wanjun Zhong, Zili Li, Yu Wang, Yu Miao, Bo Zhou, Yuanfan Li, Hao Wang, Zhongkai Zhao, Faming Wu, Zhengxuan Jiang, Weihao Tan, Heyuan Yao, Shi Yan

PDF

TL;DR

Game-TARS introduces a scalable, multimodal pre-trained model for generalist game agents that outperforms previous models and approaches human-level performance across diverse gaming environments.

Contribution

It presents a unified, scalable action space and pre-training methodology enabling large-scale continual learning across heterogeneous game domains.

Findings

01

Achieves twice the success rate on Minecraft tasks.

02

Performs comparably to humans in unseen web 3D games.

03

Outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks.

Abstract

We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data. Key techniques include a decaying continual loss to reduce causal confusion and an efficient Sparse-Thinking strategy that balances reasoning depth and inference cost. Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks, is close to the generality of fresh humans in unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling results on training-time and test-time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.