MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

Taejun Bak; Youngsik Eom; SeungJae Choi; Young-Sun Joo

arXiv:2410.03192·eess.AS·October 7, 2024

MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

Taejun Bak, Youngsik Eom, SeungJae Choi, Young-Sun Joo

PDF

Open Access 1 Video

TL;DR

MultiVerse is a zero-shot multi-task TTS system that efficiently performs speech synthesis and style transfer with minimal data, leveraging disentanglement and advanced prosody modeling to achieve high-quality, cross-lingual, and style-consistent speech output.

Contribution

It introduces a novel zero-shot multi-task TTS framework that requires less data, employs source-filter disentanglement, and combines autoregressive and non-autoregressive prosody modeling for improved performance.

Findings

01

Achieves zero-shot TTS performance comparable to data-driven systems with less data.

02

Significantly outperforms other zero-shot TTS systems trained on the same small datasets.

03

Enhances prosody similarity through a novel combined prosody modeling approach.

Abstract

Text-to-speech (TTS) systems that scale up the amount of training data have achieved significant improvements in zero-shot speech synthesis. However, these systems have certain limitations: they require a large amount of training data, which increases costs, and often overlook prosody similarity. To address these issues, we propose MultiVerse, a zero-shot multi-task TTS system that is able to perform TTS or speech style transfer in zero-shot and cross-lingual conditions. MultiVerse requires much less training data than traditional data-driven approaches. To ensure zero-shot performance even with limited data, we leverage source-filter theory-based disentanglement, utilizing the prompt for modeling filter-related and source-related representations. Additionally, to further enhance prosody similarity, we adopt a prosody modeling approach combining prompt-based autoregressive and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques