TL;DR
CoMoVi introduces a unified diffusion-based framework that synchronously generates 3D human motions and realistic videos by aligning 3D motions with 2D representations and employing dual-branch diffusion models.
Contribution
The paper proposes a novel co-generation framework with a dual-branch diffusion model and a new dataset, enabling high-quality, synchronized 3D motion and video generation.
Findings
Generated 3D human motions with improved generalization.
Produced high-quality human-centric videos without external motion references.
Curated the large-scale CoMoVi-Dataset for training and evaluation.
Abstract
In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions. Based on this, we present CoMoVi, a co-generative framework that generates 3D human motions and videos synchronously within a single diffusion denoising loop. However, since the 3D human motions and the 2D human-centric videos have a modality gap between each other, we propose to project the 3D human motion into an effective 2D human motion representation that effectively aligns with the 2D videos. Then, we design a dual-branch diffusion model to couple human motion and the video generation process with mutual feature interaction and 3D-2D cross attentions. To train and evaluate our model, we curate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
