VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

Enric Corona; Andrei Zanfir; Eduard Gabriel Bazavan; Nikos Kolotouros,; Thiemo Alldieck; Cristian Sminchisescu

arXiv:2403.08764·cs.CV·March 14, 2024·1 cites

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan, Nikos Kolotouros,, Thiemo Alldieck, Cristian Sminchisescu

PDF

Open Access

TL;DR

VLOGGER introduces a novel multimodal diffusion approach for generating high-quality, controllable, and identity-preserving human videos from a single image without person-specific training, advancing the realism and diversity of avatar synthesis.

Contribution

The paper presents a new diffusion-based architecture for audio-driven human video synthesis that does not require per-person training and introduces a large, diverse dataset called MENTOR.

Findings

01

Outperforms state-of-the-art in image quality, identity preservation, and temporal consistency.

02

Supports generation of upper-body gestures and diverse scenarios.

03

Enables applications in video editing and personalization.

Abstract

We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · Social Robot Interaction and HRI

MethodsDiffusion