MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech
Taejun Bak, Youngsik Eom, SeungJae Choi, Young-Sun Joo

TL;DR
MultiVerse is a zero-shot multi-task TTS system that efficiently performs speech synthesis and style transfer with minimal data, leveraging disentanglement and advanced prosody modeling to achieve high-quality, cross-lingual, and style-consistent speech output.
Contribution
It introduces a novel zero-shot multi-task TTS framework that requires less data, employs source-filter disentanglement, and combines autoregressive and non-autoregressive prosody modeling for improved performance.
Findings
Achieves zero-shot TTS performance comparable to data-driven systems with less data.
Significantly outperforms other zero-shot TTS systems trained on the same small datasets.
Enhances prosody similarity through a novel combined prosody modeling approach.
Abstract
Text-to-speech (TTS) systems that scale up the amount of training data have achieved significant improvements in zero-shot speech synthesis. However, these systems have certain limitations: they require a large amount of training data, which increases costs, and often overlook prosody similarity. To address these issues, we propose MultiVerse, a zero-shot multi-task TTS system that is able to perform TTS or speech style transfer in zero-shot and cross-lingual conditions. MultiVerse requires much less training data than traditional data-driven approaches. To ensure zero-shot performance even with limited data, we leverage source-filter theory-based disentanglement, utilizing the prompt for modeling filter-related and source-related representations. Additionally, to further enhance prosody similarity, we adopt a prosody modeling approach combining prompt-based autoregressive and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
