LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech   Recognition

Bowen Hao; Dongliang Zhou; Xiaojie Li; Xingyu Zhang; Liang Xie,; Jianlong Wu; Erwei Yin

arXiv:2501.04204·cs.CV·January 9, 2025

LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition

Bowen Hao, Dongliang Zhou, Xiaojie Li, Xingyu Zhang, Liang Xie,, Jianlong Wu, Erwei Yin

PDF

Open Access

TL;DR

LipGen is a novel framework that enhances visual speech recognition by generating synthetic lip videos guided by visemes, improving robustness and accuracy especially in challenging real-world scenarios.

Contribution

The paper introduces LipGen, a new approach that uses speech-driven synthetic data and viseme classification to improve lip reading models' robustness and discriminative power.

Findings

01

Outperforms state-of-the-art on LRW dataset

02

Shows significant improvements under challenging conditions

03

Utilizes viseme-guided synthetic data for robustness

Abstract

Visual speech recognition (VSR), commonly known as lip reading, has garnered significant attention due to its wide-ranging practical applications. The advent of deep learning techniques and advancements in hardware capabilities have significantly enhanced the performance of lip reading models. Despite these advancements, existing datasets predominantly feature stable video recordings with limited variability in lip movements. This limitation results in models that are highly sensitive to variations encountered in real-world scenarios. To address this issue, we propose a novel framework, LipGen, which aims to improve model robustness by leveraging speech-driven synthetic visual data, thereby mitigating the constraints of current datasets. Additionally, we introduce an auxiliary task that incorporates viseme classification alongside attention mechanisms. This approach facilitates the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis

MethodsSoftmax · Attention Is All You Need · Focus