How Much 3D Do Video Foundation Models Encode?

Zixuan Huang; Xiang Li; Zhaoyang Lv; James M. Rehg

arXiv:2512.19949·cs.CV·December 24, 2025

How Much 3D Do Video Foundation Models Encode?

Zixuan Huang, Xiang Li, Zhaoyang Lv, James M. Rehg

PDF

Open Access

TL;DR

This paper investigates the extent to which Video Foundation Models implicitly learn 3D understanding from large-scale video data, revealing that they can encode significant 3D knowledge despite no explicit 3D training.

Contribution

It introduces the first model-agnostic framework to quantify 3D awareness in VidFMs and provides comprehensive benchmarking results.

Findings

01

State-of-the-art video models encode strong 3D understanding.

02

Video models can outperform specialized 3D models in certain tasks.

03

3D awareness emerges naturally in large-scale video training.

Abstract

Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · 3D Shape Modeling and Analysis