SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro

TL;DR
SyncTalkFace introduces an audio-lip memory mechanism that enhances talking face generation by accurately aligning lip movements with input speech, capturing fine details at the phoneme level for superior lip-sync quality.
Contribution
The paper proposes Audio-Lip Memory to improve lip detail synthesis and introduces a visual-visual synchronization loss for better lip-sync accuracy in talking face generation.
Findings
Outperforms previous methods in lip-sync accuracy.
Generates high-quality videos with detailed lip movements.
Stores phoneme-level lip features in memory for precise synthesis.
Abstract
The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. However, they struggle to synthesize fine details of the lips varying at the phoneme level as they do not sufficiently provide visual information of the lips at the video synthesis step. To overcome this limitation, our work proposes Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence. It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis
MethodsALIGN
