MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control
Fatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia, Muhammad Awais, and Josef Kittler

TL;DR
MAGIC-Talk is a diffusion-based framework that generates customizable, temporally consistent talking faces from a single image, improving identity preservation, motion coherence, and long video quality.
Contribution
It introduces a novel one-shot diffusion approach with ReferenceNet and AnimateNet, enabling fine-grained editing and stable long-form video generation without multiple references.
Findings
Outperforms state-of-the-art in visual quality and synchronization
Maintains identity from a single image effectively
Reduces flickering and motion inconsistencies in long videos
Abstract
Audio-driven talking face generation has gained significant attention for applications in digital media and virtual avatars. While recent methods improve audio-lip synchronization, they often struggle with temporal consistency, identity preservation, and customization, especially in long video generation. To address these issues, we propose MAGIC-Talk, a one-shot diffusion-based framework for customizable and temporally stable talking face generation. MAGIC-Talk consists of ReferenceNet, which preserves identity and enables fine-grained facial editing via text prompts, and AnimateNet, which enhances motion coherence using structured motion priors. Unlike previous methods requiring multiple reference images or fine-tuning, MAGIC-Talk maintains identity from a single image while ensuring smooth transitions across frames. Additionally, a progressive latent fusion strategy is introduced to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
