Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T. Tan, Haizhou Li

TL;DR
This paper introduces a novel approach for talking face generation that incorporates a lip-reading expert to improve the intelligibility of lip movements, achieving state-of-the-art results in lip-reading accuracy and synchronization.
Contribution
It proposes using a lip-reading expert with contrastive learning and a transformer to enhance lip movement intelligibility and synchronization in speech-driven face generation.
Findings
Over 38% WER on LRS2 dataset
27.8% accuracy on LRW dataset
State-of-the-art lip-speech synchronization
Abstract
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input. The previous studies revealed the importance of lip-speech synchronization and visual quality. Despite much progress, they hardly focus on the content of lip movements i.e., the visual intelligibility of the spoken words, which is an important aspect of generation quality. To address the problem, we propose using a lip-reading expert to improve the intelligibility of the generated lip regions by penalizing the incorrect generation results. Moreover, to compensate for data scarcity, we train the lip-reading expert in an audio-visual self-supervised manner. With a lip-reading expert, we propose a novel contrastive learning to enhance lip-speech synchronization, and a transformer to encode audio synchronically with video, while considering global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis
MethodsContrastive Learning
