4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation
Shuzhou Yang, Xiaodong Cun, Xiaoyu Li, Yaowei Li, Jian Zhang

TL;DR
4DVD introduces a cascaded diffusion model that decouples 4D content generation into layout prediction and structure-aware refinement, achieving high-quality 4D video synthesis with superior consistency and practical applicability.
Contribution
The paper proposes a novel cascaded diffusion approach for 4D content generation that separates layout prediction from detailed synthesis, improving quality and consistency over prior methods.
Findings
Achieves state-of-the-art results in 4D video synthesis.
Demonstrates superior cross-view and temporal consistency.
Introduces a new dataset, D-Objaverse, for training and evaluation.
Abstract
Given the high complexity of directly generating high-dimensional data such as 4D, we present 4DVD, a cascaded video diffusion model that generates 4D content in a decoupled manner. Unlike previous multi-view video methods that directly model 3D space and temporal features simultaneously with stacked cross view/temporal attention modules, 4DVD decouples this into two subtasks: coarse multi-view layout generation and structure-aware conditional generation, and effectively unifies them. Specifically, given a monocular video, 4DVD first predicts the dense view content of its layout with superior cross-view and temporal consistency. Based on the produced layout priors, a structure-aware spatio-temporal generation branch is developed, combining these coarse structural priors with the exquisite appearance content of input monocular video to generate final high-quality dense-view videos.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
