Depth Any Video with Scalable Synthetic Data

Honghui Yang; Di Huang; Wei Yin; Chunhua Shen; Haifeng Liu; Xiaofei; He; Binbin Lin; Wanli Ouyang; Tong He

arXiv:2410.10815·cs.CV·March 13, 2025

Depth Any Video with Scalable Synthetic Data

Honghui Yang, Di Huang, Wei Yin, Chunhua Shen, Haifeng Liu, Xiaofei, He, Binbin Lin, Wanli Ouyang, Tong He

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

This paper presents Depth Any Video, a scalable approach combining synthetic data generation and generative diffusion models to improve video depth estimation across diverse real-world videos with high spatial and temporal accuracy.

Contribution

It introduces a scalable synthetic data pipeline and a novel mixed-duration training strategy for generative video diffusion models in depth estimation.

Findings

01

Outperforms previous models in spatial accuracy

02

Achieves superior temporal consistency

03

Handles videos up to 150 frames with high resolution

Abstract

Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse virtual environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

* Good empirical results The method shows very good empirical results compared with the previous methods. The model outputs consistent depth estimates without temporal flickering, and good accuracy on public benchmarks. The ablation study in Table 3 validates the technical design choices of the method. * Clarity It's easy to follow the paper. The paper provides sufficient technical details on the datasets, training, architecture, etc.

Weaknesses

* A bit of engineering work The paper is mostly about engineering. It adopts conditional flow matching, uses large-scale synthetic datasets to boost accuracy, and introduces mixed-duration training to improve memory usage. All these aspects attribute better accuracy and performance, but it doesn't necessarily provide novel findings. If wanting to emphasize, what would be the most interesting, novel findings of the paper? * Dataset licence and reproducibility It's curious if the collecte

Reviewer 02Rating 10Confidence 5

Strengths

- The paper constructs a large-scale synthetic dataset of 40,000 video depth clips from 12 diverse modern video games.This dataset provides a scalable and cost-effective way to gather ground-truth video depth data and helps the model generalize to real-world scenarios. - A mixed-duration training strategy is proposed. It includes frame dropout augmentation with rotary position encoding and a video packing technique. - Effective Model Design - The method achieves state-of-the-art performance amon

Weaknesses

The work in the article is very solid, with good model performance and efficiency, and comprehensive evaluation. The only concern for the reviewer is how the author ensures that the dataset, which is a major contribution, will be open-sourced as promised. This is very important for the community, but there are many difficulties regarding copyright and other aspects. In addition, it is necessary to evaluate and compare the diversity of the dataset.

Reviewer 03Rating 5Confidence 3

Strengths

1. The motivation for the framework is clear and reasonable, considering the limited data, inference speed and the long video in the applications. 2. Collecting and annotating high-quality data can improve the model and also inspire the community. It would be more beneficial if the data or collection pipeline can be open-sourced 3. Extensive experiments and ablation studies demonstrate the effectiveness of the proposed method.

Weaknesses

1. **The representation is more image depth estimation rather than video depth estimation.** If I understand correctly, although the paper focuses on video depth estimation, the predicted relative depth maps are independent for each frame, which is demonstrated in the input normalization and alignment during evaluation. Specifically, each frame is normalized based on the depth range of itself and the scale and shift are also aligned for each frame during inference. In my view, this is incorrect

Code & Models

Repositories

Nightmare-n/DepthAnyVideo
pytorchOfficial

Models

🤗
hhyangcs/depth-any-video
model· ♡ 7
♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques

MethodsDiffusion