MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis
Wenhao Guan, Yishuang Li, Tao Li, Hukai Huang, Feng Wang, Jiayan Lin,, Lingyan Huang, Lin Li, Qingyang Hong

TL;DR
This paper introduces MM-TTS, a flexible multi-modal prompt-based TTS system that uses various modalities like speech, images, and text to control speech style, supported by a new dataset and novel modeling techniques.
Contribution
The paper proposes a unified multi-modal style prompt encoder, Style Adaptive Convolutions, and a Rectified Flow based Refiner, enabling flexible style transfer in TTS from diverse input modalities.
Findings
Effective multi-modal style transfer demonstrated on MEAD-TTS dataset.
Supports arbitrary modality prompts for expressive speech synthesis.
Achieves higher fidelity and style control compared to previous methods.
Abstract
The style transfer task in Text-to-Speech refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide style transfer. In this paper, we propose a more flexible multi-modal and style controllable TTS framework named MM-TTS. It can utilize any modality as the prompt in unified multi-modal prompt space, including reference speech, emotional facial images, and text descriptions, to control the style of the generated speech in a system. The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects:1)aligning the multi-modal information into a unified style space to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis
