Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis
Karolos Nikitaras, Konstantinos Klapsas, Nikolaos Ellinas, Georgia, Maniati, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris,, Pirros Tsiakoulis

TL;DR
This paper introduces a novel expressive speech synthesis model that leverages token-level latent prosodic variables to effectively capture and control both fine-grained and utterance-level speech attributes, improving diversity and disentanglement.
Contribution
It proposes a method to use token-level latent spaces combined with a prior network to better capture and control utterance-level prosody in speech synthesis.
Findings
The model effectively captures diverse prosodic features.
It improves control over utterance-level speech attributes.
Qualitative and quantitative evaluations validate the approach.
Abstract
This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the corresponding level. We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. Therefore, a trade-off arises between the diversity of the token-level and utterance-level representations and their disentanglement. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Phonetics and Phonology Research
