Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech
Haibin Wu, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Daniel, Tompkins, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Sheng Zhao, Jinyu Li,, Naoyuki Kanda

TL;DR
This paper presents EmoCtrl-TTS, a zero-shot text-to-speech system capable of generating emotionally expressive speech with nonverbal vocalizations, adapting to different speakers and emotional states using a large curated dataset.
Contribution
It introduces EmoCtrl-TTS, a novel flow-matching-based TTS model that controls time-varying emotions and NVs in zero-shot scenarios, trained on over 27,000 hours of expressive data.
Findings
Outperforms baseline models in emotional mimicry
Capable of generating diverse NVs and emotion dynamics
Effective in speech-to-speech translation contexts
Abstract
People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. EmoCtrl-TTS leverages arousal and valence values, as well as laughter embeddings, to condition the flow-matching-based zero-shot TTS. To achieve high-quality emotional speech generation, EmoCtrl-TTS is trained using more than 27,000 hours of expressive data curated based on pseudo-labeling. Comprehensive evaluations demonstrate that EmoCtrl-TTS excels in mimicking the emotions of audio prompts in speech-to-speech translation scenarios. We also show that EmoCtrl-TTS can capture emotion changes,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
