Talking Head Generation with Probabilistic Audio-to-Visual Diffusion   Priors

Zhentao Yu; Zixin Yin; Deyu Zhou; Duomin Wang; Finn Wong; Baoyuan Wang

arXiv:2212.04248·cs.GR·December 9, 2022·1 cites

Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

Zhentao Yu, Zixin Yin, Deyu Zhou, Duomin Wang, Finn Wong, Baoyuan Wang

PDF

Open Access

TL;DR

This paper presents a probabilistic diffusion-based framework for one-shot audio-driven talking head generation that produces diverse, natural facial motions aligned with input audio, outperforming previous auto-regressive methods.

Contribution

The proposed diffusion prior enables diverse facial motion synthesis in talking head generation, improving realism and semantic consistency over prior deterministic approaches.

Findings

01

Outperforms auto-regressive prior on most metrics

02

Maintains audio-lip synchronization comparable to prior methods

03

Generates rich, natural lip-irrelevant facial motions

Abstract

In this paper, we introduce a simple and novel framework for one-shot audio-driven talking head generation. Unlike prior works that require additional driving sources for controlled synthesis in a deterministic manner, we instead probabilistically sample all the holistic lip-irrelevant facial motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the input audio while still maintaining both the photo-realism of audio-lip synchronization and the overall naturalness. This is achieved by our newly proposed audio-to-visual diffusion prior trained on top of the mapping between audio and disentangled non-lip facial representations. Thanks to the probabilistic nature of the diffusion prior, one big advantage of our framework is it can synthesize diverse facial motion sequences given the same audio clip, which is quite user-friendly for many real applications. Through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion