LPIPS-AttnWav2Lip: Generic Audio-Driven lip synchronization for Talking Head Generation in the Wild

Zhipeng Chen; Xinheng Wang; Lun Xie; Haijie Yuan; Hang Pan

arXiv:2602.00189·cs.SD·February 3, 2026

LPIPS-AttnWav2Lip: Generic Audio-Driven lip synchronization for Talking Head Generation in the Wild

Zhipeng Chen, Xinheng Wang, Lun Xie, Haijie Yuan, Hang Pan

PDF

Open Access

TL;DR

This paper introduces LPIPS-AttnWav2Lip, a novel audio-driven lip synchronization method that leverages a U-Net architecture with residual CBAM and semantic alignment to produce realistic talking head videos with precise lip sync.

Contribution

It presents a new generic approach combining advanced neural modules to improve lip synchronization accuracy and visual quality in talking head generation.

Findings

01

Achieves high lip synchronization accuracy

02

Generates high-quality realistic images

03

Outperforms existing methods in evaluations

Abstract

Researchers have shown a growing interest in Audio-driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio. We used the U-Net architecture based on residual CBAM to better encode and fuse audio and visual modal information. Additionally, the semantic alignment module extends the receptive field of the generator network to obtain the spatial and channel information of the visual features efficiently; and match statistical information of visual features with audio latent vector to achieve the adjustment and injection of the audio content information to the visual information. To achieve exact lip synchronization and to generate realistic high-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing