On granularity of prosodic representations in expressive text-to-speech

Mikolaj Babianski; Kamil Pokora; Raahil Shah; Rafal Sienkiewicz,; Daniel Korzekwa; Viacheslav Klimkov

arXiv:2301.11446·eess.AS·February 13, 2023·1 cites

On granularity of prosodic representations in expressive text-to-speech

Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafal Sienkiewicz,, Daniel Korzekwa, Viacheslav Klimkov

PDF

Open Access

TL;DR

This paper compares different granularities of prosodic representations in expressive TTS, finding that word-level embeddings offer a good balance, significantly improving naturalness without losing intelligibility.

Contribution

It systematically evaluates prosodic embedding levels and demonstrates that word-level representations optimize naturalness and stability in expressive speech synthesis.

Findings

01

Utterance-level embeddings lack capacity.

02

Phoneme-level embeddings cause instability.

03

Word-level embeddings improve naturalness by 90%.

Abstract

In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonetic input and simplify that mapping. This paper compares prosodic embeddings at different levels of granularity and examines their prediction from text. We show that utterance-level embeddings have insufficient capacity and phoneme-level tend to introduce instabilities when predicted from text. Word-level representations impose balance between capacity and predictability. As a result, we close the gap in naturalness by 90% between synthetic speech and recordings on LibriTTS dataset, without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing