Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using   Spatial Transformer Networks

L\'aszl\'o T\'oth; Amin Honarmandi Shandiz; G\'abor Gosztolya; Csap\'o; Tam\'as G\'abor

arXiv:2305.19130·cs.SD·October 18, 2023·1 cites

Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks

L\'aszl\'o T\'oth, Amin Honarmandi Shandiz, G\'abor Gosztolya, Csap\'o, Tam\'as G\'abor

PDF

Open Access

TL;DR

This paper introduces a spatial transformer network module to ultrasound tongue imaging-based silent speech interfaces, significantly improving cross-session and cross-speaker adaptation efficiency by reducing mean squared error with minimal additional training.

Contribution

The study demonstrates that adapting only the spatial transformer network component enables substantial performance gains, simplifying speaker and session adaptation in silent speech interfaces.

Findings

01

88% reduction in mean squared error when adapting the STN alone

02

92% error reduction across different recording sessions for the same speaker

03

Minimal additional training required for effective adaptation

Abstract

Thanks to the latest deep learning algorithms, silent speech interfaces (SSI) are now able to synthesize intelligible speech from articulatory movement data under certain conditions. However, the resulting models are rather speaker-specific, making a quick switch between users troublesome. Even for the same speaker, these models perform poorly cross-session, i.e. after dismounting and re-mounting the recording equipment. To aid quick speaker and session adaptation of ultrasound tongue imaging-based SSI models, we extend our deep networks with a spatial transformer network (STN) module, capable of performing an affine transformation on the input images. Although the STN part takes up only about 10% of the network, our experiments show that adapting just the STN module might allow to reduce MSE by 88% on the average, compared to retraining the whole network. The improvement is even larger…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders

MethodsSpatial Transformer