ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion
Hoang-Son Vo, Quang-Vinh Nguyen, Seungwon Kim, Hyung-Jeong Yang, Soonja Yeom, and Soo-Hyung Kim

TL;DR
ATL-Diff is a new audio-driven talking head generation method that improves synchronization, reduces noise, and is computationally efficient, enabling near real-time high-quality facial animations for various applications.
Contribution
It introduces a landmark-guided noise diffusion framework that decouples audio and preserves identity, outperforming existing methods in synchronization and quality.
Findings
Outperforms state-of-the-art on MEAD and CREMA-D datasets.
Achieves near real-time processing with high-quality animations.
Effectively preserves facial identity and nuances.
Abstract
Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
