AI killed the video star. Audio-driven diffusion model for expressive talking head generation

Baptiste Chopin; Tashvik Dhamija; Pranav Balaji; Yaohui Wang; Antitza Dantcheva

arXiv:2511.22488·cs.CV·December 1, 2025

AI killed the video star. Audio-driven diffusion model for expressive talking head generation

Baptiste Chopin, Tashvik Dhamija, Pranav Balaji, Yaohui Wang, Antitza Dantcheva

PDF

Open Access

TL;DR

Dimitra++ is a new audio-driven framework for generating realistic talking head videos that accurately model lip motion, facial expressions, and head movements using a novel diffusion transformer conditioned on audio and reference images.

Contribution

It introduces a conditional Motion Diffusion Transformer with 3D facial representation for improved talking head synthesis.

Findings

01

Outperforms existing methods in realism and accuracy

02

Effective on VoxCeleb2 and CelebV-HQ datasets

03

Validated by quantitative, qualitative, and user studies

Abstract

We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Facial Rejuvenation and Surgery Techniques