Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks
L\'aszl\'o T\'oth, Amin Honarmandi Shandiz, G\'abor Gosztolya, Csap\'o, Tam\'as G\'abor

TL;DR
This paper introduces a spatial transformer network module to ultrasound tongue imaging-based silent speech interfaces, significantly improving cross-session and cross-speaker adaptation efficiency by reducing mean squared error with minimal additional training.
Contribution
The study demonstrates that adapting only the spatial transformer network component enables substantial performance gains, simplifying speaker and session adaptation in silent speech interfaces.
Findings
88% reduction in mean squared error when adapting the STN alone
92% error reduction across different recording sessions for the same speaker
Minimal additional training required for effective adaptation
Abstract
Thanks to the latest deep learning algorithms, silent speech interfaces (SSI) are now able to synthesize intelligible speech from articulatory movement data under certain conditions. However, the resulting models are rather speaker-specific, making a quick switch between users troublesome. Even for the same speaker, these models perform poorly cross-session, i.e. after dismounting and re-mounting the recording equipment. To aid quick speaker and session adaptation of ultrasound tongue imaging-based SSI models, we extend our deep networks with a spatial transformer network (STN) module, capable of performing an affine transformation on the input images. Although the STN part takes up only about 10% of the network, our experiments show that adapting just the STN module might allow to reduce MSE by 88% on the average, compared to retraining the whole network. The improvement is even larger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders
MethodsSpatial Transformer
