CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

Helin Wang; Jiarui Hai; Dading Chong; Karan Thakkar; Tiantian Feng; Dongchao Yang; Junhyeok Lee; Thomas Thebaud; Laureano Moro Velazquez; Jesus Villalba; Zengyi Qin; Shrikanth Narayanan; Mounya Elhiali; Najim Dehak

arXiv:2506.02863·eess.AS·September 29, 2025

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

Helin Wang, Jiarui Hai, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhiali, Najim Dehak

PDF

1 Models 5 Datasets

TL;DR

CapSpeech introduces a large, comprehensive benchmark dataset for style-captioned text-to-speech synthesis, enabling improved research and development of diverse, high-fidelity speech synthesis applications.

Contribution

We present CapSpeech, the largest dataset for CapTTS with diverse annotations and new datasets for specific tasks, facilitating advanced research in style-captioned TTS applications.

Findings

01

High-fidelity, intelligible speech synthesis achieved across styles

02

CapSpeech dataset is the largest with extensive annotations

03

Experiments reveal challenges and insights in CapTTS development

Abstract

Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchmark designed for a series of CapTTS-related tasks, including style-captioned text-to-speech synthesis with sound events (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. In addition, we introduce two new datasets collected and recorded by a professional voice actor and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
OpenSound/CapSpeech-models
model· ♡ 14
♡ 14

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.