Style-Preserving Lip Sync via Audio-Aware Style Reference
Weizhi Zhong, Jichang Li, Yinqi Cai, Ming Li, Feng Gao, Liang Lin, and Guanbin Li

TL;DR
This paper introduces an audio-aware style reference method for lip sync that preserves individual speaking styles by leveraging advanced Transformer and diffusion models, resulting in more realistic and style-consistent talking face videos.
Contribution
It proposes a novel audio-aware style reference scheme combining Transformer-based lip motion prediction with a conditional diffusion model for realistic video synthesis.
Findings
Effective preservation of speaking styles in lip sync
High-fidelity realistic talking face generation
Superior lip sync accuracy compared to prior methods
Abstract
Audio-driven lip sync has recently drawn significant attention due to its widespread application in the multimedia domain. Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals, posing a notable challenge for audio-driven lip sync. Earlier methods for such task often bypassed the modeling of personalized speaking styles, resulting in sub-optimal lip sync conforming to the general styles. Recent lip sync techniques attempt to guide the lip sync for arbitrary audio by aggregating information from a style reference video, yet they can not preserve the speaking styles well due to their inaccuracy in style aggregation. This work proposes an innovative audio-aware style reference scheme that effectively leverages the relationships between input audio and reference audio from style reference video to address the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis
MethodsSoftmax · Attention Is All You Need · Diffusion
