High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based   Diffusion Model

Weizhi Zhong; Junfan Lin; Peixin Chen; Liang Lin; Guanbin Li

arXiv:2408.05416·cs.CV·August 13, 2024

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Weizhi Zhong, Junfan Lin, Peixin Chen, Liang Lin, Guanbin Li

PDF

Open Access

TL;DR

This paper introduces a landmark-based diffusion model for generating high-fidelity, lip-synced talking face videos from audio, using end-to-end optimization and a novel TalkFormer module to improve synchronization and appearance detail preservation.

Contribution

It proposes a novel diffusion-based framework with a new TalkFormer module for end-to-end talking face synthesis, addressing limitations of previous GAN-based and multi-stage methods.

Findings

01

Produces high-quality, lip-synced videos with preserved appearance details

02

Outperforms previous methods in lip synchronization accuracy

03

Demonstrates robustness across diverse subjects and expressions

Abstract

Audio-driven talking face video generation has attracted increasing attention due to its huge industrial potential. Some previous methods focus on learning a direct mapping from audio to visual content. Despite progress, they often struggle with the ambiguity of the mapping process, leading to flawed results. An alternative strategy involves facial structural representations (e.g., facial landmarks) as intermediaries. This multi-stage approach better preserves the appearance details but suffers from error accumulation due to the independent optimization of different stages. Moreover, most previous methods rely on generative adversarial networks, prone to training instability and mode collapse. To address these challenges, our study proposes a novel landmark-based diffusion model for talking face generation, which leverages facial landmarks as intermediate representations while enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need · Diffusion · Focus · ALIGN