Learning Temporally Consistent Video Depth from Video Diffusion Priors

Jiahao Shao; Yuanbo Yang; Hongyu Zhou; Youmin Zhang; Yujun Shen; Vitor Guizilini; Yue Wang; Matteo Poggi; Yiyi Liao

arXiv:2406.01493·cs.CV·June 10, 2025·1 cites

Learning Temporally Consistent Video Depth from Video Diffusion Priors

Jiahao Shao, Yuanbo Yang, Hongyu Zhou, Youmin Zhang, Yujun Shen, Vitor Guizilini, Yue Wang, Matteo Poggi, Yiyi Liao

PDF

Open Access 2 Models

TL;DR

This paper introduces ChronoDepth, a novel method for streamed video depth estimation that ensures cross-frame consistency by reformulating depth prediction as a conditional generation problem with a context-aware training strategy.

Contribution

It proposes a new framework that leverages contextual information across frames and clips to improve temporal consistency in video depth estimation.

Findings

01

Outperforms existing methods in maintaining temporal consistency.

02

Effective in arbitrarily long videos with cross-clip context.

03

Validated through extensive experiments.

Abstract

This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency. Therefore, we reformulate depth prediction into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, we design an effective training strategy to provide context within a clip. Extensive experimental results validate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques

MethodsContrastive Language-Image Pre-training · Diffusion