StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion   for Zero-shot Text-to-speech Synthesis

Zhiyong Chen; Xinnuo Li; Zhiqi Ai; Shugong Xu

arXiv:2409.15741·eess.AS·September 25, 2024

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

Zhiyong Chen, Xinnuo Li, Zhiqi Ai, Shugong Xu

PDF

Open Access

TL;DR

StyleFusion-TTS is a novel zero-shot TTS system that uses multimodal inputs and hierarchical conformer-based feature fusion to improve style and speaker control, enhancing naturalness and editability.

Contribution

It introduces a general front-end encoder for multimodal inputs and a hierarchical conformer structure for effective feature fusion in zero-shot TTS.

Findings

01

Promising subjective and objective evaluation results

02

Effective disentanglement of style and speaker embeddings

03

Enhanced naturalness and controllability in zero-shot TTS

Abstract

We introduce StyleFusion-TTS, a prompt and/or audio referenced, style and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs including text prompts, audio references, and speaker timbre references in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques