Pronunciation Dictionary-Free Multilingual Speech Synthesis by Combining   Unsupervised and Supervised Phonetic Representations

Chang Liu; Zhen-Hua Ling; Ling-Hui Chen

arXiv:2206.00951·eess.AS·June 3, 2022

Pronunciation Dictionary-Free Multilingual Speech Synthesis by Combining Unsupervised and Supervised Phonetic Representations

Chang Liu, Zhen-Hua Ling, Ling-Hui Chen

PDF

Open Access

TL;DR

This paper introduces a multilingual speech synthesis approach that combines unsupervised and supervised phonetic representations to eliminate the need for pronunciation dictionaries, improving synthesis quality across six languages.

Contribution

It presents a novel method integrating UPRs and SPRs for multilingual speech synthesis, leveraging pretrained models and a new acoustic model architecture.

Findings

01

Outperforms direct mel-spectrogram prediction methods.

02

Achieves better results than models using only UPRs or SPRs.

03

Effective across six diverse languages.

Abstract

This paper proposes a multilingual speech synthesis method which combines unsupervised phonetic representations (UPR) and supervised phonetic representations (SPR) to avoid reliance on the pronunciation dictionaries of target languages. In this method, a pretrained wav2vec 2.0 model is adopted to extract UPRs and a language-independent automatic speech recognition (LI-ASR) model is built with a connectionist temporal classification (CTC) loss to extract segment-level SPRs from the audio data of target languages. Then, an acoustic model is designed, which first predicts UPRs and SPRs from texts separately and then combines the predicted UPRs and SPRs to generate mel-spectrograms. The results of our experiments on six languages show that the proposed method outperformed the methods that directly predicted mel-spectrograms from character or phoneme sequences and the ablated models that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing