EMA2S: An End-to-End Multimodal Articulatory-to-Speech System
Yu-Wen Chen, Kuo-Hsuan Hung, Shang-Yi Chuang, Jonathan Sherman,, Wen-Chin Huang, Xugang Lu, Yu Tsao

TL;DR
EMA2S is an end-to-end multimodal system that converts articulatory movements into speech, improving synthesis quality through joint training of spectrogram and deep features, with potential applications in medical and silent speech scenarios.
Contribution
This work introduces EMA2S, a novel multimodal neural network that directly maps articulatory movements to speech signals using joint training of multiple audio features.
Findings
EMA2S outperforms baseline systems in objective metrics.
Joint mel-spectrogram and deep feature loss training enhances performance.
Experimental results confirm the effectiveness of multimodal joint-training.
Abstract
Synthesized speech from articulatory movements can have real-world use for patients with vocal cord disorders, situations requiring silent speech, or in high-noise environments. In this work, we present EMA2S, an end-to-end multimodal articulatory-to-speech system that directly converts articulatory movements to speech signals. We use a neural-network-based vocoder combined with multimodal joint-training, incorporating spectrogram, mel-spectrogram, and deep features. The experimental results confirm that the multimodal approach of EMA2S outperforms the baseline system in terms of both objective evaluation and subjective evaluation metrics. Moreover, results demonstrate that joint mel-spectrogram and deep feature loss training can effectively improve system performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
