GeoVideo: Introducing Geometric Regularization into Video Generation Model
Yunpeng Bai, Shaoheng Fang, Chaohui Yu, Fan Wang, Qixing Huang

TL;DR
This paper enhances video generation models by integrating geometric regularization through depth prediction, significantly improving temporal consistency and 3D structural coherence in synthesized videos.
Contribution
It introduces a novel geometric regularization framework using depth prediction and multi-view loss to enforce 3D structural consistency in diffusion-based video generation.
Findings
Improved temporal stability and geometric consistency in generated videos.
Enhanced shape and structure coherence across frames.
Significant performance gains over baseline models.
Abstract
Recent advances in video generation have enabled the synthesis of high-quality and visually realistic clips using diffusion transformer models. However, most existing approaches operate purely in the 2D pixel space and lack explicit mechanisms for modeling 3D structures, often resulting in temporally inconsistent geometries, implausible motions, and structural artifacts. In this work, we introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction. We adopted depth as the geometric representation because of the great progress in depth prediction and its compatibility with image-based latent encoders. Specifically, to enforce structural consistency over time, we propose a multi-view geometric loss that aligns the predicted depth maps across frames within a shared 3D coordinate system. Our method bridges the gap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · 3D Shape Modeling and Analysis
