Deep Speech Synthesis from Multimodal Articulatory Representations

Peter Wu; Bohan Yu; Kevin Scheck; Alan W Black; Aditi S.; Krishnapriyan; Irene Y. Chen; Tanja Schultz; Shinji Watanabe; Gopala K.; Anumanchipalli

arXiv:2412.13387·eess.AS·December 19, 2024

Deep Speech Synthesis from Multimodal Articulatory Representations

Peter Wu, Bohan Yu, Kevin Scheck, Alan W Black, Aditi S., Krishnapriyan, Irene Y. Chen, Tanja Schultz, Shinji Watanabe, Gopala K., Anumanchipalli

PDF

Open Access

TL;DR

This paper introduces a multimodal pre-training framework for articulatory-to-acoustic speech synthesis that significantly improves performance in low-resource settings by leveraging MRI and electromyography data.

Contribution

The paper presents a novel multimodal pre-training approach that enhances speech synthesis quality from limited articulatory data, outperforming unimodal baselines.

Findings

01

36% reduction in word error rate with transfer learning

02

Outperforms unimodal baselines on multiple metrics

03

Improves intelligibility of synthesized speech

Abstract

The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intelligibility of synthesized outputs improves noticeably. For example, compared to prior work, utilizing our proposed transfer learning methods improves the MRI-to-speech performance by 36% word error rate. In addition to these intelligibility results, our multimodal pre-trained models consistently outperform unimodal baselines on three objective and subjective synthesis quality metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Phonetics and Phonology Research