High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model
Weizhi Zhong, Junfan Lin, Peixin Chen, Liang Lin, Guanbin Li

TL;DR
This paper introduces a landmark-based diffusion model for generating high-fidelity, lip-synced talking face videos from audio, using end-to-end optimization and a novel TalkFormer module to improve synchronization and appearance detail preservation.
Contribution
It proposes a novel diffusion-based framework with a new TalkFormer module for end-to-end talking face synthesis, addressing limitations of previous GAN-based and multi-stage methods.
Findings
Produces high-quality, lip-synced videos with preserved appearance details
Outperforms previous methods in lip synchronization accuracy
Demonstrates robustness across diverse subjects and expressions
Abstract
Audio-driven talking face video generation has attracted increasing attention due to its huge industrial potential. Some previous methods focus on learning a direct mapping from audio to visual content. Despite progress, they often struggle with the ambiguity of the mapping process, leading to flawed results. An alternative strategy involves facial structural representations (e.g., facial landmarks) as intermediaries. This multi-stage approach better preserves the appearance details but suffers from error accumulation due to the independent optimization of different stages. Moreover, most previous methods rely on generative adversarial networks, prone to training instability and mode collapse. To address these challenges, our study proposes a novel landmark-based diffusion model for talking face generation, which leverages facial landmarks as intermediate representations while enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
MethodsSoftmax · Attention Is All You Need · Diffusion · Focus · ALIGN
