UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation

Yang-Tian Sun; Xin Yu; Zehuan Huang; Yi-Hua Huang; Yuan-Chen Guo; Ziyi Yang; Yan-Pei Cao; Xiaojuan Qi

arXiv:2505.24521·cs.CV·June 2, 2025

UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation

Yang-Tian Sun, Xin Yu, Zehuan Huang, Yi-Hua Huang, Yuan-Chen Guo, Ziyi Yang, Yan-Pei Cao, Xiaojuan Qi

PDF

Open Access

TL;DR

UniGeo leverages diffusion models for consistent global geometric estimation in videos, introducing a novel conditioning method and joint training to improve accuracy and generalization to dynamic scenes.

Contribution

The paper presents a new approach that harnesses diffusion models for unified, consistent geometry estimation across video frames, with innovative conditioning and training strategies.

Findings

01

Achieves superior global geometric attribute prediction in videos.

02

Demonstrates potential to generalize from static to dynamic scenes.

03

Enables direct application to reconstruction tasks.

Abstract

Recently, methods leveraging diffusion model priors to assist monocular geometric estimation (e.g., depth and normal) have gained significant attention due to their strong generalization ability. However, most existing works focus on estimating geometric properties within the camera coordinate system of individual video frames, neglecting the inherent ability of diffusion models to determine inter-frame correspondence. In this work, we demonstrate that, through appropriate design and fine-tuning, the intrinsic consistency of video generation models can be effectively harnessed for consistent geometric estimation. Specifically, we 1) select geometric attributes in the global coordinate system that share the same correspondence with video frames as the prediction targets, 2) introduce a novel and efficient conditioning method by reusing positional encodings, and 3) enhance performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Coding and Compression Technologies · Advanced Vision and Imaging · Advanced Image Processing Techniques

MethodsSoftmax · Attention Is All You Need · Focus · Diffusion