AI killed the video star. Audio-driven diffusion model for expressive talking head generation
Baptiste Chopin, Tashvik Dhamija, Pranav Balaji, Yaohui Wang, Antitza Dantcheva

TL;DR
Dimitra++ is a new audio-driven framework for generating realistic talking head videos that accurately model lip motion, facial expressions, and head movements using a novel diffusion transformer conditioned on audio and reference images.
Contribution
It introduces a conditional Motion Diffusion Transformer with 3D facial representation for improved talking head synthesis.
Findings
Outperforms existing methods in realism and accuracy
Effective on VoxCeleb2 and CelebV-HQ datasets
Validated by quantitative, qualitative, and user studies
Abstract
We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Facial Rejuvenation and Surgery Techniques
