IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint   Video-Depth Generation

Yuanhao Zhai; Kevin Lin; Linjie Li; Chung-Ching Lin; Jianfeng Wang,; Zhengyuan Yang; David Doermann; Junsong Yuan; Zicheng Liu; Lijuan Wang

arXiv:2407.10937·cs.CV·July 16, 2024

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

Yuanhao Zhai, Kevin Lin, Linjie Li, Chung-Ching Lin, Jianfeng Wang,, Zhengyuan Yang, David Doermann, Junsong Yuan, Zicheng Liu, Lijuan Wang

PDF

Open Access 1 Repo

TL;DR

IDOL introduces a unified dual-modal latent diffusion framework for high-quality, human-centric joint video-depth generation, effectively aligning spatial and motion features for improved synthesis quality.

Contribution

The paper proposes a novel dual-modal U-Net and motion consistency loss to enhance joint video-depth generation and spatial alignment in a unified framework.

Findings

01

Outperforms existing methods in video FVD and depth accuracy

02

Achieves superior spatial and motion consistency in generated outputs

03

Demonstrates effectiveness on TikTok and NTU120 datasets

Abstract

Significant advances have been made in human-centric video generation, yet the joint video-depth generation problem remains underexplored. Most existing monocular depth estimation methods may not generalize well to synthesized images or videos, and multi-view-based methods have difficulty controlling the human appearance and motion. In this work, we present IDOL (unIfied Dual-mOdal Latent diffusion) for high-quality human-centric joint video-depth generation. Our IDOL consists of two novel designs. First, to enable dual-modal generation and maximize the information exchange between video and depth generation, we propose a unified dual-modal U-Net, a parameter-sharing framework for joint video and depth denoising, wherein a modality label guides the denoising target, and cross-modal attention enables the mutual information flow. Second, to ensure a precise video-depth spatial alignment,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yhZhai/idol
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Softmax · Attention Is All You Need · Max Pooling · Concatenated Skip Connection · Convolution · U-Net · ALIGN