On granularity of prosodic representations in expressive text-to-speech
Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafal Sienkiewicz,, Daniel Korzekwa, Viacheslav Klimkov

TL;DR
This paper compares different granularities of prosodic representations in expressive TTS, finding that word-level embeddings offer a good balance, significantly improving naturalness without losing intelligibility.
Contribution
It systematically evaluates prosodic embedding levels and demonstrates that word-level representations optimize naturalness and stability in expressive speech synthesis.
Findings
Utterance-level embeddings lack capacity.
Phoneme-level embeddings cause instability.
Word-level embeddings improve naturalness by 90%.
Abstract
In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonetic input and simplify that mapping. This paper compares prosodic embeddings at different levels of granularity and examines their prediction from text. We show that utterance-level embeddings have insufficient capacity and phoneme-level tend to introduce instabilities when predicted from text. Word-level representations impose balance between capacity and predictability. As a result, we close the gap in naturalness by 90% between synthetic speech and recordings on LibriTTS dataset, without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing
