WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Jisu Nam; Yicong Hong; Chun-Hao Paul Huang; Feng Liu; JoungBin Lee; Jiyoung Kim; Siyoon Jin; Yunsung Lee; Jaeyoon Jung; Suhwan Choi; Seungryong Kim; Yang Zhou

arXiv:2603.16871·cs.CV·March 18, 2026

WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou

PDF

Open Access

TL;DR

WorldCam introduces a novel approach that uses camera pose as a unifying geometric representation, enabling precise action control and consistent long-horizon 3D exploration in interactive gaming worlds.

Contribution

The paper proposes a new method leveraging camera pose as a geometric anchor, integrating physics-based action spaces and global camera poses for improved 3D consistency and control.

Findings

01

Outperforms state-of-the-art models in action controllability

02

Enhances long-horizon visual quality in 3D environments

03

Improves 3D spatial consistency during navigation

Abstract

Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Advanced Vision and Imaging