EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Yu-Wen Chen; Kuo-Hsuan Hung; Shang-Yi Chuang; Jonathan Sherman,; Wen-Chin Huang; Xugang Lu; Yu Tsao

arXiv:2102.03786·eess.AS·June 10, 2021

EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Yu-Wen Chen, Kuo-Hsuan Hung, Shang-Yi Chuang, Jonathan Sherman,, Wen-Chin Huang, Xugang Lu, Yu Tsao

PDF

Open Access

TL;DR

EMA2S is an end-to-end multimodal system that converts articulatory movements into speech, improving synthesis quality through joint training of spectrogram and deep features, with potential applications in medical and silent speech scenarios.

Contribution

This work introduces EMA2S, a novel multimodal neural network that directly maps articulatory movements to speech signals using joint training of multiple audio features.

Findings

01

EMA2S outperforms baseline systems in objective metrics.

02

Joint mel-spectrogram and deep feature loss training enhances performance.

03

Experimental results confirm the effectiveness of multimodal joint-training.

Abstract

Synthesized speech from articulatory movements can have real-world use for patients with vocal cord disorders, situations requiring silent speech, or in high-noise environments. In this work, we present EMA2S, an end-to-end multimodal articulatory-to-speech system that directly converts articulatory movements to speech signals. We use a neural-network-based vocoder combined with multimodal joint-training, incorporating spectrogram, mel-spectrogram, and deep features. The experimental results confirm that the multimodal approach of EMA2S outperforms the baseline system in terms of both objective evaluation and subjective evaluation metrics. Moreover, results demonstrate that joint mel-spectrogram and deep feature loss training can effectively improve system performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing