Landmark-Guided Cross-Speaker Lip Reading with Mutual Information   Regularization

Linzhi Wu; Xingyu Zhang; Yakun Zhang; Changyan Zheng; Tiejun Liu,; Liang Xie; Ye Yan; Erwei Yin

arXiv:2403.16071·cs.AI·May 3, 2024·1 cites

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Linzhi Wu, Xingyu Zhang, Yakun Zhang, Changyan Zheng, Tiejun Liu,, Liang Xie, Ye Yan, Erwei Yin

PDF

Open Access

TL;DR

This paper introduces a landmark-guided lip reading method with mutual information regularization to improve cross-speaker robustness by reducing speaker-specific visual variations and capturing speaker-invariant features.

Contribution

It proposes using lip landmarks as input features and a mutual information regularization to enhance speaker-robust lip reading models, addressing inter-speaker variability.

Findings

01

Improved accuracy in cross-speaker lip reading tasks.

02

Effective reduction of speaker-specific appearance influence.

03

Enhanced model generalization across different speakers.

Abstract

Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system may perform poorly when handling a brand new speaker. To learn a speaker-robust lip reading model, a key insight is to reduce visual variations across speakers, avoiding the model overfitting to specific speakers. In this work, in view of both input visual clues and latent representations based on a hybrid CTC/attention architecture, we propose to exploit the lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features, diminishing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis