Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
Naoyuki Kanda, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker,, Hemin Yang, Zirun Zhu, Min Tang, Canrun Li, Chung-Hsien Tsai, Zhen Xiao,, Yufei Xia, Jinzhu Li, Yanqing Liu, Sheng Zhao, Michael Zeng

TL;DR
ELaTE is a novel zero-shot TTS system that generates natural, controllable laughter in speech based on short audio prompts, enhancing expressiveness and user experience in speech synthesis.
Contribution
The paper introduces ELaTE, a zero-shot TTS model capable of generating realistic laughter with precise control, leveraging flow-matching and fine-tuning with laughter-specific data.
Findings
Higher quality laughter generation than baseline models
Enhanced controllability over laughter timing and expression
Effective fine-tuning with small laughter-conditioned datasets
Abstract
Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing and variety of the laughter to be generated. In this work, we propose ELaTE, a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt with precise control of laughter timing and expression. Specifically, ELaTE works on the audio prompt to mimic the voice characteristic, the text prompt to indicate the contents of the generated speech, and the input to control the laughter expression, which can be either the start and end times of laughter, or the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Natural Language Processing Techniques
