DVD: Deterministic Video Depth Estimation with Generative Priors

Hongfei Zhang; Harold Haodong Chen; Chenfei Liao; Jing He; Zixin Zhang; Haodong Li; Yihao Liang; Kanghao Chen; Bin Ren; Xu Zheng; Shuai Yang; Kun Zhou; Yinchuan Li; Nicu Sebe; Ying-Cong Chen

arXiv:2603.12250·cs.CV·March 13, 2026

DVD: Deterministic Video Depth Estimation with Generative Priors

Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Jing He, Zixin Zhang, Haodong Li, Yihao Liang, Kanghao Chen, Bin Ren, Xu Zheng, Shuai Yang, Kun Zhou, Yinchuan Li, Nicu Sebe, Ying-Cong Chen

PDF

Open Access 1 Models

TL;DR

DVD introduces a novel deterministic framework that adapts pre-trained video diffusion models into depth regressors, achieving state-of-the-art zero-shot performance with minimal data and maintaining temporal coherence.

Contribution

The paper presents the first method to convert pre-trained video diffusion models into deterministic depth estimators, addressing limitations of existing models with innovative structural and coherence techniques.

Findings

01

State-of-the-art zero-shot performance on benchmarks

02

Uses 163x less task-specific data than previous methods

03

Maintains seamless long-video inference without complex alignment

Abstract

Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
FayeHongfeiZhang/DVD
model· ♡ 10
♡ 10

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Human Pose and Action Recognition