Improving the Gap in Visual Speech Recognition Between Normal and Silent   Speech Based on Metric Learning

Sara Kashiwagi; Keitaro Tanaka; Qi Feng; Shigeo Morishima

arXiv:2305.14203·eess.AS·October 17, 2023·1 cites

Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning

Sara Kashiwagi, Keitaro Tanaka, Qi Feng, Shigeo Morishima

PDF

Open Access

TL;DR

This paper introduces a metric learning approach for visual speech recognition that reduces the performance gap between normal and silent speech by aligning viseme representations in a shared latent space, improving silent speech recognition accuracy.

Contribution

The paper proposes a novel metric learning method based on visemes to enhance silent speech recognition, addressing data scarcity and model performance issues.

Findings

01

Improved silent speech recognition accuracy.

02

Effective viseme-based latent space mapping.

03

Robust performance with limited training data.

Abstract

This paper presents a novel metric learning approach to address the performance gap between normal and silent speech in visual speech recognition (VSR). The difference in lip movements between the two poses a challenge for existing VSR models, which exhibit degraded accuracy when applied to silent speech. To solve this issue and tackle the scarcity of training data for silent speech, we propose to leverage the shared literal content between normal and silent speech and present a metric learning approach based on visemes. Specifically, we aim to map the input of two speech types close to each other in a latent space if they have similar viseme representations. By minimizing the Kullback-Leibler divergence of the predicted viseme probability distributions between and within the two speech types, our model effectively learns and predicts viseme identities. Our evaluation demonstrates that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Music and Audio Processing