Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion   Model

Fei Shen; Cong Wang; Junyao Gao; Qin Guo; Jisheng Dang; Jinhui Tang,; Tat-Seng Chua

arXiv:2502.09533·cs.CV·February 14, 2025·3 cites

Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

Fei Shen, Cong Wang, Junyao Gao, Qin Guo, Jisheng Dang, Jinhui Tang,, Tat-Seng Chua

PDF

Open Access

TL;DR

This paper introduces MCDM, a novel diffusion-based model that leverages motion priors and a memory-efficient attention mechanism to generate consistent, synchronized TalkingFace videos over long durations, addressing previous challenges.

Contribution

The paper presents MCDM, a new diffusion model utilizing archived and current motion priors with a temporal attention mechanism for improved long-term TalkingFace generation.

Findings

01

MCDM effectively maintains identity and motion continuity in long-term videos.

02

The model achieves accurate lip sync and facial expressions over extended sequences.

03

The TalkingFace-Wild dataset provides a large multilingual benchmark for future research.

Abstract

Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations. To address these, we introduce the \textbf{M}otion-priors \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (\textbf{MCDM}), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency. The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsSoftmax · Attention Is All You Need · Diffusion · Contrastive Language-Image Pre-training