VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan, Nikos Kolotouros,, Thiemo Alldieck, Cristian Sminchisescu

TL;DR
VLOGGER introduces a novel multimodal diffusion approach for generating high-quality, controllable, and identity-preserving human videos from a single image without person-specific training, advancing the realism and diversity of avatar synthesis.
Contribution
The paper presents a new diffusion-based architecture for audio-driven human video synthesis that does not require per-person training and introduces a large, diverse dataset called MENTOR.
Findings
Outperforms state-of-the-art in image quality, identity preservation, and temporal consistency.
Supports generation of upper-body gestures and diverse scenarios.
Enables applications in video editing and personalization.
Abstract
We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Social Robot Interaction and HRI
MethodsDiffusion
