DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical   Diffusion for Audio-driven Talking Head Synthesis

Fa-Ting Hong; Yunfei Liu; Yu Li; Changyin Zhou; Fei Yu; Dan Xu

arXiv:2409.10281·cs.MM·September 17, 2024

DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis

Fa-Ting Hong, Yunfei Liu, Yu Li, Changyin Zhou, Fei Yu, Dan Xu

PDF

Open Access

TL;DR

DreamHead introduces a hierarchical diffusion framework that learns spatial-temporal facial correspondences from audio to generate realistic talking head videos, improving consistency and quality.

Contribution

It presents a novel hierarchical diffusion approach that predicts facial landmarks from audio and then synthesizes facial images, enhancing spatial-temporal coherence in talking head synthesis.

Findings

01

Produces high-fidelity talking head videos for multiple identities.

02

Effectively models spatial-temporal correspondence without sacrificing quality.

03

Outperforms existing methods in realism and consistency.

Abstract

Audio-driven talking head synthesis strives to generate lifelike video portraits from provided audio. The diffusion model, recognized for its superior quality and robust generalization, has been explored for this task. However, establishing a robust correspondence between temporal audio cues and corresponding spatial facial expressions with diffusion models remains a significant challenge in talking head generation. To bridge this gap, we present DreamHead, a hierarchical diffusion framework that learns spatial-temporal correspondences in talking head synthesis without compromising the model's intrinsic quality and adaptability.~DreamHead learns to predict dense facial landmarks from audios as intermediate signals to model the spatial and temporal correspondences.~Specifically, a first hierarchy of audio-to-landmark diffusion is first designed to predict temporally smooth and accurate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsDiffusion