Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive   Speech Synthesis

Julian Za\"idi; Hugo Seut\'e; Benjamin van Niekerk; Marc-Andr\'e; Carbonneau

arXiv:2108.02271·cs.SD·July 13, 2023·1 cites

Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Julian Za\"idi, Hugo Seut\'e, Benjamin van Niekerk, Marc-Andr\'e, Carbonneau

PDF

Open Access 2 Repos

TL;DR

Daft-Exprt is a novel multi-speaker speech synthesis model that effectively transfers expressive prosody across speakers and texts, using advanced conditioning and disentanglement techniques to produce natural, expressive speech.

Contribution

It introduces a new model with FiLM conditioning and adversarial training for cross-speaker prosody transfer on any text, outperforming existing baselines in expressiveness and naturalness.

Findings

01

Significantly outperforms baselines in prosody transfer accuracy.

02

Achieves naturalness comparable to state-of-the-art expressive models.

03

Successfully disentangles speaker identity from prosody representations.

Abstract

This paper presents Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art for cross-speaker prosody transfer on any text. This is one of the most challenging, and rarely directly addressed, task in speech synthesis, especially for highly expressive data. Daft-Exprt uses FiLM conditioning layers to strategically inject different prosodic information in all parts of the architecture. The model explicitly encodes traditional low-level prosody features such as pitch, loudness and duration, but also higher level prosodic information that helps generating convincing voices in highly expressive styles. Speaker identity and prosodic information are disentangled through an adversarial training strategy that enables accurate prosody transfer across speakers. Experimental results show that Daft-Exprt significantly outperforms strong baselines on inter-text cross-speaker prosody…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques