ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion

Hoang-Son Vo; Quang-Vinh Nguyen; Seungwon Kim; Hyung-Jeong Yang; Soonja Yeom; and Soo-Hyung Kim

arXiv:2507.12804·cs.CV·July 18, 2025

ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion

Hoang-Son Vo, Quang-Vinh Nguyen, Seungwon Kim, Hyung-Jeong Yang, Soonja Yeom, and Soo-Hyung Kim

PDF

Open Access

TL;DR

ATL-Diff is a new audio-driven talking head generation method that improves synchronization, reduces noise, and is computationally efficient, enabling near real-time high-quality facial animations for various applications.

Contribution

It introduces a landmark-guided noise diffusion framework that decouples audio and preserves identity, outperforming existing methods in synchronization and quality.

Findings

01

Outperforms state-of-the-art on MEAD and CREMA-D datasets.

02

Achieves near real-time processing with high-quality animations.

03

Effectively preserves facial identity and nuances.

Abstract

Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis