TL;DR
This paper introduces Geometry Forcing, a method that guides video diffusion models to learn 3D-aware representations by aligning intermediate features with geometric cues, improving 3D consistency in video generation.
Contribution
It proposes a novel alignment-based approach to embed geometric awareness into video diffusion models, bridging the gap between 2D video data and 3D world understanding.
Findings
Enhanced 3D consistency in generated videos.
Improved visual quality over baseline models.
Effective alignment of features with geometric cues.
Abstract
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge the gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing geometric features from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
