Unsupervised word-level prosody tagging for controllable speech synthesis
Yiwei Guo, Chenpeng Du, Kai Yu

TL;DR
This paper introduces an unsupervised method for word-level prosody tagging in neural TTS, enabling manual control over speech prosody without needing reference signals, and improves naturalness and controllability.
Contribution
It proposes a novel two-stage unsupervised approach combining phonetic-based grouping and GMM clustering for word-level prosody tagging in TTS.
Findings
TTS with word-level prosody tags outperforms FastSpeech2 in naturalness.
The method enables effective manipulation of word-level prosody.
Improved controllability of speech synthesis without reference signals.
Abstract
Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
