DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

Yue-Jiang Dong; Wang Zhao; Jiale Xu; Ying Shan; Song-Hai Zhang

arXiv:2507.01603·cs.CV·August 8, 2025

DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

Yue-Jiang Dong, Wang Zhao, Jiale Xu, Ying Shan, Song-Hai Zhang

PDF

Open Access

TL;DR

DepthSync introduces a diffusion guidance framework that ensures scale and geometry consistency in long video depth estimation, overcoming limitations of previous sliding window approaches.

Contribution

It proposes a training-free, diffusion guidance-based method that synchronizes depth scale and enforces geometric alignment, improving long video depth estimation.

Findings

01

Enhanced scale consistency across video windows

02

Improved geometric accuracy in depth predictions

03

Effective on various datasets for long videos

Abstract

Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Human Pose and Action Recognition