Depth Any Video with Scalable Synthetic Data
Honghui Yang, Di Huang, Wei Yin, Chunhua Shen, Haifeng Liu, Xiaofei, He, Binbin Lin, Wanli Ouyang, Tong He

TL;DR
This paper presents Depth Any Video, a scalable approach combining synthetic data generation and generative diffusion models to improve video depth estimation across diverse real-world videos with high spatial and temporal accuracy.
Contribution
It introduces a scalable synthetic data pipeline and a novel mixed-duration training strategy for generative video diffusion models in depth estimation.
Findings
Outperforms previous models in spatial accuracy
Achieves superior temporal consistency
Handles videos up to 150 frames with high resolution
Abstract
Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse virtual environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying…
Peer Reviews
Decision·ICLR 2025 Poster
* Good empirical results The method shows very good empirical results compared with the previous methods. The model outputs consistent depth estimates without temporal flickering, and good accuracy on public benchmarks. The ablation study in Table 3 validates the technical design choices of the method. * Clarity It's easy to follow the paper. The paper provides sufficient technical details on the datasets, training, architecture, etc.
* A bit of engineering work The paper is mostly about engineering. It adopts conditional flow matching, uses large-scale synthetic datasets to boost accuracy, and introduces mixed-duration training to improve memory usage. All these aspects attribute better accuracy and performance, but it doesn't necessarily provide novel findings. If wanting to emphasize, what would be the most interesting, novel findings of the paper? * Dataset licence and reproducibility It's curious if the collecte
- The paper constructs a large-scale synthetic dataset of 40,000 video depth clips from 12 diverse modern video games.This dataset provides a scalable and cost-effective way to gather ground-truth video depth data and helps the model generalize to real-world scenarios. - A mixed-duration training strategy is proposed. It includes frame dropout augmentation with rotary position encoding and a video packing technique. - Effective Model Design - The method achieves state-of-the-art performance amon
The work in the article is very solid, with good model performance and efficiency, and comprehensive evaluation. The only concern for the reviewer is how the author ensures that the dataset, which is a major contribution, will be open-sourced as promised. This is very important for the community, but there are many difficulties regarding copyright and other aspects. In addition, it is necessary to evaluate and compare the diversity of the dataset.
1. The motivation for the framework is clear and reasonable, considering the limited data, inference speed and the long video in the applications. 2. Collecting and annotating high-quality data can improve the model and also inspire the community. It would be more beneficial if the data or collection pipeline can be open-sourced 3. Extensive experiments and ablation studies demonstrate the effectiveness of the proposed method.
1. **The representation is more image depth estimation rather than video depth estimation.** If I understand correctly, although the paper focuses on video depth estimation, the predicted relative depth maps are independent for each frame, which is demonstrated in the input normalization and alignment during evaluation. Specifically, each frame is normalized based on the depth range of itself and the scale and shift are also aligned for each frame during inference. In my view, this is incorrect
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques
MethodsDiffusion
