CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many   Fine-Grained Prosody Transfer

Sri Karlapati; Penny Karanasou; Mateusz Lajszczak; Ammar Abbas; Alexis; Moinet; Peter Makarov; Ray Li; Arent van Korlaar; Simon Slangen; Thomas; Drugman

arXiv:2206.13443·eess.AS·June 28, 2022

CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

Sri Karlapati, Penny Karanasou, Mateusz Lajszczak, Ammar Abbas, Alexis, Moinet, Peter Makarov, Ray Li, Arent van Korlaar, Simon Slangen, Thomas, Drugman

PDF

Open Access

TL;DR

CopyCat2 is a versatile multi-speaker TTS model that synthesizes expressive speech with fine-grained prosody transfer, using a novel two-stage training process to improve naturalness and speaker similarity.

Contribution

It introduces a novel two-stage training approach enabling multi-speaker TTS with contextually appropriate prosody and fine-grained prosody transfer between speakers.

Findings

01

Reduces naturalness gap by 22.79% compared to baseline.

02

Achieves 33.15% improvement in target speaker similarity.

03

Effective in both multi-speaker TTS and prosody transfer tasks.

Abstract

In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel approach to two-stage training. In Stage I, the model learns speaker-independent word-level prosody representations from speech which it uses for many-to-many fine-grained prosody transfer. In Stage II, we learn to predict these prosody representations using the contextual information available in text, thereby, enabling multi-speaker TTS with contextually appropriate prosody. We compare CC2 to two strong baselines, one in TTS with contextually appropriate prosody, and one in fine-grained prosody…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling