Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces
Amin Honarmandi Shandiz, L\'aszl\'o T\'oth, G\'abor Gosztolya,, Alexandra Mark\'o, Tam\'as G\'abor Csap\'o

TL;DR
This paper introduces multi-speaker ultrasound-based speaker embeddings using an adapted x-vector framework, demonstrating low recognition error rates and potential for improving silent speech interface accuracy across speakers.
Contribution
It presents the first multi-speaker ultrasound speaker embeddings with effective speaker recognition and explores their application in multi-speaker silent speech synthesis.
Findings
Speaker recognition error rates below 3%
Embeddings generalize well to unseen speakers
Marginal error rate reduction in ultrasound-to-speech conversion
Abstract
Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
