Laugh Now Cry Later: Controlling Time-Varying Emotional States of   Flow-Matching-Based Zero-Shot Text-to-Speech

Haibin Wu; Xiaofei Wang; Sefik Emre Eskimez; Manthan Thakker; Daniel; Tompkins; Chung-Hsien Tsai; Canrun Li; Zhen Xiao; Sheng Zhao; Jinyu Li,; Naoyuki Kanda

arXiv:2407.12229·eess.AS·September 18, 2024

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

Haibin Wu, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Daniel, Tompkins, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Sheng Zhao, Jinyu Li,, Naoyuki Kanda

PDF

Open Access 1 Repo

TL;DR

This paper presents EmoCtrl-TTS, a zero-shot text-to-speech system capable of generating emotionally expressive speech with nonverbal vocalizations, adapting to different speakers and emotional states using a large curated dataset.

Contribution

It introduces EmoCtrl-TTS, a novel flow-matching-based TTS model that controls time-varying emotions and NVs in zero-shot scenarios, trained on over 27,000 hours of expressive data.

Findings

01

Outperforms baseline models in emotional mimicry

02

Capable of generating diverse NVs and emotion dynamics

03

Effective in speech-to-speech translation contexts

Abstract

People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. EmoCtrl-TTS leverages arousal and valence values, as well as laughter embeddings, to condition the flow-matching-based zero-shot TTS. To achieve high-quality emotional speech generation, EmoCtrl-TTS is trained using more than 27,000 hours of expressive data curated based on pseudo-labeling. Comprehensive evaluations demonstrate that EmoCtrl-TTS excels in mimicking the emotions of audio prompts in speech-to-speech translation scenarios. We also show that EmoCtrl-TTS can capture emotion changes,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hbwu-ntu/emoctrltts-eval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis