Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model
Fei Shen, Cong Wang, Junyao Gao, Qin Guo, Jisheng Dang, Jinhui Tang,, Tat-Seng Chua

TL;DR
This paper introduces MCDM, a novel diffusion-based model that leverages motion priors and a memory-efficient attention mechanism to generate consistent, synchronized TalkingFace videos over long durations, addressing previous challenges.
Contribution
The paper presents MCDM, a new diffusion model utilizing archived and current motion priors with a temporal attention mechanism for improved long-term TalkingFace generation.
Findings
MCDM effectively maintains identity and motion continuity in long-term videos.
The model achieves accurate lip sync and facial expressions over extended sequences.
The TalkingFace-Wild dataset provides a large multilingual benchmark for future research.
Abstract
Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations. To address these, we introduce the \textbf{M}otion-priors \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (\textbf{MCDM}), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency. The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsSoftmax · Attention Is All You Need · Diffusion · Contrastive Language-Image Pre-training
