MM-TTS: Multi-modal Prompt based Style Transfer for Expressive   Text-to-Speech Synthesis

Wenhao Guan; Yishuang Li; Tao Li; Hukai Huang; Feng Wang; Jiayan Lin,; Lingyan Huang; Lin Li; Qingyang Hong

arXiv:2312.10687·eess.AS·February 1, 2024·1 cites

MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis

Wenhao Guan, Yishuang Li, Tao Li, Hukai Huang, Feng Wang, Jiayan Lin,, Lingyan Huang, Lin Li, Qingyang Hong

PDF

Open Access 1 Video

TL;DR

This paper introduces MM-TTS, a flexible multi-modal prompt-based TTS system that uses various modalities like speech, images, and text to control speech style, supported by a new dataset and novel modeling techniques.

Contribution

The paper proposes a unified multi-modal style prompt encoder, Style Adaptive Convolutions, and a Rectified Flow based Refiner, enabling flexible style transfer in TTS from diverse input modalities.

Findings

01

Effective multi-modal style transfer demonstrated on MEAD-TTS dataset.

02

Supports arbitrary modality prompts for expressive speech synthesis.

03

Achieves higher fidelity and style control compared to previous methods.

Abstract

The style transfer task in Text-to-Speech refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide style transfer. In this paper, we propose a more flexible multi-modal and style controllable TTS framework named MM-TTS. It can utilize any modality as the prompt in unified multi-modal prompt space, including reference speech, emotional facial images, and text descriptions, to control the style of the generated speech in a system. The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects:1)aligning the multi-modal information into a unified style space to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis· underline

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis