Unsupervised word-level prosody tagging for controllable speech   synthesis

Yiwei Guo; Chenpeng Du; Kai Yu

arXiv:2202.07200·eess.AS·February 17, 2022

Unsupervised word-level prosody tagging for controllable speech synthesis

Yiwei Guo, Chenpeng Du, Kai Yu

PDF

Open Access

TL;DR

This paper introduces an unsupervised method for word-level prosody tagging in neural TTS, enabling manual control over speech prosody without needing reference signals, and improves naturalness and controllability.

Contribution

It proposes a novel two-stage unsupervised approach combining phonetic-based grouping and GMM clustering for word-level prosody tagging in TTS.

Findings

01

TTS with word-level prosody tags outperforms FastSpeech2 in naturalness.

02

The method enables effective manipulation of word-level prosody.

03

Improved controllability of speech synthesis without reference signals.

Abstract

Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques