VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

Xiangyu Sun; Shijie Wang; Fengyi Zhang; Lin Liu; Caiyan Jia; Ziying Song; Zi Huang; Yadan Luo

arXiv:2603.12655·cs.CV·March 16, 2026

VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, Yadan Luo

PDF

Open Access

TL;DR

VGGT-World introduces a novel geometry-focused world model that predicts the evolution of frozen GFM features, outperforming baselines in depth forecasting with high efficiency and fewer parameters.

Contribution

It repurposes frozen GFM features as a predictive state and develops a new training approach to improve high-dimensional feature forecasting.

Findings

01

Outperforms strong baselines in depth forecasting.

02

Runs 3.6-5 times faster than existing models.

03

Uses only 0.43B trainable parameters.

Abstract

World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · 3D Shape Modeling and Analysis