Visual-Only Recognition of Normal, Whispered and Silent Speech
Stavros Petridis, Jie Shen, Doruk Cetin, Maja Pantic

TL;DR
This paper introduces a new audiovisual database and investigates visual speech recognition across normal, whispered, and silent speech modes, revealing significant differences that challenge the assumption of data transferability between modes.
Contribution
It provides the first analysis of visual differences among speech modes and demonstrates the impact on recognition accuracy, highlighting the need for mode-specific training data.
Findings
Recognition accuracy drops when switching speech modes.
Silent speech shows the largest decrease in recognition performance.
Visual differences between speech modes are significant and affect system design.
Abstract
Silent speech interfaces have been recently proposed as a way to enable communication when the acoustic signal is not available. This introduces the need to build visual speech recognition systems for silent and whispered speech. However, almost all the recently proposed systems have been trained on vocalised data only. This is in contrast with evidence in the literature which suggests that lip movements change depending on the speech mode. In this work, we introduce a new audiovisual database which is publicly available and contains normal, whispered and silent speech. To the best of our knowledge, this is the first study which investigates the differences between the three speech modes using the visual modality only. We show that an absolute decrease in classification rate of up to 3.7% is observed when training and testing on normal and whispered, respectively, and vice versa. An…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
