Deep Speech Synthesis from Multimodal Articulatory Representations
Peter Wu, Bohan Yu, Kevin Scheck, Alan W Black, Aditi S., Krishnapriyan, Irene Y. Chen, Tanja Schultz, Shinji Watanabe, Gopala K., Anumanchipalli

TL;DR
This paper introduces a multimodal pre-training framework for articulatory-to-acoustic speech synthesis that significantly improves performance in low-resource settings by leveraging MRI and electromyography data.
Contribution
The paper presents a novel multimodal pre-training approach that enhances speech synthesis quality from limited articulatory data, outperforming unimodal baselines.
Findings
36% reduction in word error rate with transfer learning
Outperforms unimodal baselines on multiple metrics
Improves intelligibility of synthesized speech
Abstract
The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intelligibility of synthesized outputs improves noticeably. For example, compared to prior work, utilizing our proposed transfer learning methods improves the MRI-to-speech performance by 36% word error rate. In addition to these intelligibility results, our multimodal pre-trained models consistently outperform unimodal baselines on three objective and subjective synthesis quality metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Phonetics and Phonology Research
