Silent versus modal multi-speaker speech recognition from ultrasound and video
Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

TL;DR
This study compares silent and modal multi-speaker speech recognition using ultrasound and video imaging, highlighting challenges due to mode mismatch and proposing adaptation techniques to improve silent speech recognition accuracy.
Contribution
The paper introduces methods to address domain mismatch in silent speech recognition and analyzes articulatory differences between silent and modal speech modes.
Findings
Silent speech recognition underperforms compared to modal speech.
Domain adaptation techniques improve silent speech recognition.
Silent speech has longer duration and smaller articulatory space.
Abstract
We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
