LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
Bowen Hao, Dongliang Zhou, Xiaojie Li, Xingyu Zhang, Liang Xie,, Jianlong Wu, Erwei Yin

TL;DR
LipGen is a novel framework that enhances visual speech recognition by generating synthetic lip videos guided by visemes, improving robustness and accuracy especially in challenging real-world scenarios.
Contribution
The paper introduces LipGen, a new approach that uses speech-driven synthetic data and viseme classification to improve lip reading models' robustness and discriminative power.
Findings
Outperforms state-of-the-art on LRW dataset
Shows significant improvements under challenging conditions
Utilizes viseme-guided synthetic data for robustness
Abstract
Visual speech recognition (VSR), commonly known as lip reading, has garnered significant attention due to its wide-ranging practical applications. The advent of deep learning techniques and advancements in hardware capabilities have significantly enhanced the performance of lip reading models. Despite these advancements, existing datasets predominantly feature stable video recordings with limited variability in lip movements. This limitation results in models that are highly sensitive to variations encountered in real-world scenarios. To address this issue, we propose a novel framework, LipGen, which aims to improve model robustness by leveraging speech-driven synthetic visual data, thereby mitigating the constraints of current datasets. Additionally, we introduce an auxiliary task that incorporates viseme classification alongside attention mechanisms. This approach facilitates the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis
MethodsSoftmax · Attention Is All You Need · Focus
