Landmark-guided Diffusion Model for High-fidelity and Temporally   Coherent Talking Head Generation

Jintao Tan; Xize Cheng; Lingyu Xiong; Lei Zhu; Xiandong Li; Xianjia; Wu; Kai Gong; Minglei Li; Yi Cai

arXiv:2408.01732·cs.CV·August 6, 2024

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Jintao Tan, Xize Cheng, Lingyu Xiong, Lei Zhu, Xiandong Li, Xianjia, Wu, Kai Gong, Minglei Li, Yi Cai

PDF

Open Access

TL;DR

This paper presents a two-stage diffusion model that generates high-quality, temporally coherent talking head videos synchronized with speech by guiding the process with facial landmarks.

Contribution

It introduces a landmark-guided diffusion framework that improves lip synchronization and visual quality in talking head generation, addressing limitations of prior GAN and diffusion models.

Findings

01

Achieves superior lip synchronization and visual fidelity.

02

Reduces mouth jitter and enhances temporal coherence.

03

Outperforms existing methods in extensive experiments.

Abstract

Audio-driven talking head generation is a significant and challenging task applicable to various fields such as virtual avatars, film production, and online conferences. However, the existing GAN-based models emphasize generating well-synchronized lip shapes but overlook the visual quality of generated frames, while diffusion-based models prioritize generating high-quality frames but neglect lip shape matching, resulting in jittery mouth movements. To address the aforementioned problems, we introduce a two-stage diffusion-based model. The first stage involves generating synchronized facial landmarks based on the given speech. In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos. Extensive experiments demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Domain Adaptation and Few-Shot Learning