Conformer-based Ultrasound-to-Speech Conversion
Ibrahim Ibrahimov, Zaink\'o Csaba, G\'abor Gosztolya

TL;DR
This study explores Conformer-based neural networks for ultrasound-to-speech conversion, comparing architectures and assessing perceptual quality, with findings indicating potential benefits in perceptual quality and training efficiency.
Contribution
Introduces Conformer-based models for ultrasound-to-speech conversion and compares their performance to CNN baselines, highlighting advantages in perceptual quality and training speed.
Findings
Conformer with bi-LSTM improves perceptual speech quality.
Conformer Base matches CNN baseline performance.
Conformer Base trains three times faster than the CNN baseline.
Abstract
Deep neural networks have shown promising potential for ultrasound-to-speech conversion task towards Silent Speech Interfaces. In this work, we applied two Conformer-based DNN architectures (Base and one with bi-LSTM) for this task. Speaker-specific models were trained on the data of four speakers from the Ultrasuite-Tal80 dataset, while the generated mel spectrograms were synthesized to audio waveform using a HiFi-GAN vocoder. Compared to a standard 2D-CNN baseline, objective measurements (MSE and mel cepstral distortion) showed no statistically significant improvement for either model. However, a MUSHRA listening test revealed that Conformer with bi-LSTM provided better perceptual quality, while Conformer Base matched the performance of the baseline along with a 3x faster training time due to its simpler architecture. These findings suggest that Conformer-based models, especially the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Hearing Loss and Rehabilitation
