Pixels to Play: A Foundation Model for 3D Gameplay
Yuguang Yue, Chris Green, Samuel Hunt, Irakli Salia, Wenzhe Shi, Jonathan J Hunt

TL;DR
Pixels2Play-0.1 is a foundation model that learns to play various 3D video games using pixel streams, combining behavior cloning from demonstrations and unlabeled videos, aiming for generalization with minimal game-specific tuning.
Contribution
The paper introduces Pixels2Play-0.1, a novel end-to-end transformer-based model that learns to play multiple 3D games from pixel data using combined supervised and unsupervised learning methods.
Findings
Competent play on Roblox and MS-DOS titles
Effective use of unlabeled videos for training
Potential for reaching expert-level control with further scaling
Abstract
We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior. Motivated by emerging consumer and developer use cases - AI teammates, controllable NPCs, personalized live-streamers, assistive testers - we argue that an agent must rely on the same pixel stream available to players and generalize to new titles with minimal game-specific engineering. P2P0.1 is trained end-to-end with behavior cloning: labeled demonstrations collected from instrumented human game-play are complemented by unlabeled public videos, to which we impute actions via an inverse-dynamics model. A decoder-only transformer with auto-regressive action output handles the large action space while remaining latency-friendly on a single consumer GPU. We report qualitative results showing competent play across simple Roblox and classic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Games and Media · Augmented Reality Applications
