DVD: Deterministic Video Depth Estimation with Generative Priors
Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Jing He, Zixin Zhang, Haodong Li, Yihao Liang, Kanghao Chen, Bin Ren, Xu Zheng, Shuai Yang, Kun Zhou, Yinchuan Li, Nicu Sebe, Ying-Cong Chen

TL;DR
DVD introduces a novel deterministic framework that adapts pre-trained video diffusion models into depth regressors, achieving state-of-the-art zero-shot performance with minimal data and maintaining temporal coherence.
Contribution
The paper presents the first method to convert pre-trained video diffusion models into deterministic depth estimators, addressing limitations of existing models with innovative structural and coherence techniques.
Findings
State-of-the-art zero-shot performance on benchmarks
Uses 163x less task-specific data than previous methods
Maintains seamless long-video inference without complex alignment
Abstract
Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Human Pose and Action Recognition
