CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer
Sri Karlapati, Penny Karanasou, Mateusz Lajszczak, Ammar Abbas, Alexis, Moinet, Peter Makarov, Ray Li, Arent van Korlaar, Simon Slangen, Thomas, Drugman

TL;DR
CopyCat2 is a versatile multi-speaker TTS model that synthesizes expressive speech with fine-grained prosody transfer, using a novel two-stage training process to improve naturalness and speaker similarity.
Contribution
It introduces a novel two-stage training approach enabling multi-speaker TTS with contextually appropriate prosody and fine-grained prosody transfer between speakers.
Findings
Reduces naturalness gap by 22.79% compared to baseline.
Achieves 33.15% improvement in target speaker similarity.
Effective in both multi-speaker TTS and prosody transfer tasks.
Abstract
In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel approach to two-stage training. In Stage I, the model learns speaker-independent word-level prosody representations from speech which it uses for many-to-many fine-grained prosody transfer. In Stage II, we learn to predict these prosody representations using the contextual information available in text, thereby, enabling multi-speaker TTS with contextually appropriate prosody. We compare CC2 to two strong baselines, one in TTS with contextually appropriate prosody, and one in fine-grained prosody…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
