IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
Yuanhao Zhai, Kevin Lin, Linjie Li, Chung-Ching Lin, Jianfeng Wang,, Zhengyuan Yang, David Doermann, Junsong Yuan, Zicheng Liu, Lijuan Wang

TL;DR
IDOL introduces a unified dual-modal latent diffusion framework for high-quality, human-centric joint video-depth generation, effectively aligning spatial and motion features for improved synthesis quality.
Contribution
The paper proposes a novel dual-modal U-Net and motion consistency loss to enhance joint video-depth generation and spatial alignment in a unified framework.
Findings
Outperforms existing methods in video FVD and depth accuracy
Achieves superior spatial and motion consistency in generated outputs
Demonstrates effectiveness on TikTok and NTU120 datasets
Abstract
Significant advances have been made in human-centric video generation, yet the joint video-depth generation problem remains underexplored. Most existing monocular depth estimation methods may not generalize well to synthesized images or videos, and multi-view-based methods have difficulty controlling the human appearance and motion. In this work, we present IDOL (unIfied Dual-mOdal Latent diffusion) for high-quality human-centric joint video-depth generation. Our IDOL consists of two novel designs. First, to enable dual-modal generation and maximize the information exchange between video and depth generation, we propose a unified dual-modal U-Net, a parameter-sharing framework for joint video and depth denoising, wherein a modality label guides the denoising target, and cross-modal attention enables the mutual information flow. Second, to ensure a precise video-depth spatial alignment,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Softmax · Attention Is All You Need · Max Pooling · Concatenated Skip Connection · Convolution · U-Net · ALIGN
