FD2Talk: Towards Generalized Talking Head Generation with Facial   Decoupled Diffusion Model

Ziyu Yao; Xuxin Cheng; Zhiqi Huang

arXiv:2408.09384·cs.CV·August 20, 2024

FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model

Ziyu Yao, Xuxin Cheng, Zhiqi Huang

PDF

Open Access

TL;DR

FD2Talk introduces a facial decoupled diffusion model for talking head generation, effectively separating motion and appearance to improve quality, diversity, and accuracy over previous methods.

Contribution

The paper proposes a novel multi-stage diffusion framework that decouples facial motion and appearance, enhancing generation quality and detail preservation in talking head synthesis.

Findings

01

Outperforms previous state-of-the-art methods in quality and diversity.

02

Accurately predicts facial motion from audio using Diffusion Transformer.

03

Effectively encodes appearance to guide realistic frame generation.

Abstract

Talking head generation is a significant research topic that still faces numerous challenges. Previous works often adopt generative adversarial networks or regression models, which are plagued by generation quality and average facial shape problem. Although diffusion models show impressive generative ability, their exploration in talking head generation remains unsatisfactory. This is because they either solely use the diffusion model to obtain an intermediate representation and then employ another pre-trained renderer, or they overlook the feature decoupling of complex facial details, such as expressions, head poses and appearance textures. Therefore, we propose a Facial Decoupled Diffusion model for Talking head generation called FD2Talk, which fully leverages the advantages of diffusion models and decouples the complex facial details through multi-stages. Specifically, we separate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Social Robot Interaction and HRI · Speech and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax