Learning utterance-level representations through token-level acoustic   latents prediction for Expressive Speech Synthesis

Karolos Nikitaras; Konstantinos Klapsas; Nikolaos Ellinas; Georgia; Maniati; June Sig Sung; Inchul Hwang; Spyros Raptis; Aimilios Chalamandaris,; Pirros Tsiakoulis

arXiv:2211.00523·cs.SD·November 2, 2022

Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

Karolos Nikitaras, Konstantinos Klapsas, Nikolaos Ellinas, Georgia, Maniati, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris,, Pirros Tsiakoulis

PDF

Open Access

TL;DR

This paper introduces a novel expressive speech synthesis model that leverages token-level latent prosodic variables to effectively capture and control both fine-grained and utterance-level speech attributes, improving diversity and disentanglement.

Contribution

It proposes a method to use token-level latent spaces combined with a prior network to better capture and control utterance-level prosody in speech synthesis.

Findings

01

The model effectively captures diverse prosodic features.

02

It improves control over utterance-level speech attributes.

03

Qualitative and quantitative evaluations validate the approach.

Abstract

This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the corresponding level. We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. Therefore, a trade-off arises between the diversity of the token-level and utterance-level representations and their disentanglement. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Phonetics and Phonology Research