Style-Preserving Lip Sync via Audio-Aware Style Reference

Weizhi Zhong; Jichang Li; Yinqi Cai; Ming Li; Feng Gao; Liang Lin; and Guanbin Li

arXiv:2408.05412·cs.CV·June 19, 2025

Style-Preserving Lip Sync via Audio-Aware Style Reference

Weizhi Zhong, Jichang Li, Yinqi Cai, Ming Li, Feng Gao, Liang Lin, and Guanbin Li

PDF

Open Access

TL;DR

This paper introduces an audio-aware style reference method for lip sync that preserves individual speaking styles by leveraging advanced Transformer and diffusion models, resulting in more realistic and style-consistent talking face videos.

Contribution

It proposes a novel audio-aware style reference scheme combining Transformer-based lip motion prediction with a conditional diffusion model for realistic video synthesis.

Findings

01

Effective preservation of speaking styles in lip sync

02

High-fidelity realistic talking face generation

03

Superior lip sync accuracy compared to prior methods

Abstract

Audio-driven lip sync has recently drawn significant attention due to its widespread application in the multimedia domain. Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals, posing a notable challenge for audio-driven lip sync. Earlier methods for such task often bypassed the modeling of personalized speaking styles, resulting in sub-optimal lip sync conforming to the general styles. Recent lip sync techniques attempt to guide the lip sync for arbitrary audio by aggregating information from a style reference video, yet they can not preserve the speaking styles well due to their inaccuracy in style aggregation. This work proposes an innovative audio-aware style reference scheme that effectively leverages the relationships between input audio and reference audio from style reference video to address the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis

MethodsSoftmax · Attention Is All You Need · Diffusion