AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis
Tianhua Qi, Wenming Zheng, Bj\"orn W. Schuller, Zhaojie Luo, Haizhou Li

TL;DR
AffectSpeech is a large-scale emotional speech dataset with detailed annotations across multiple dimensions, enabling improved speech emotion captioning and synthesis through a human-LLM collaborative annotation process.
Contribution
This work introduces AffectSpeech, a comprehensive emotional speech dataset with fine-grained annotations and a novel human-LLM collaborative labeling pipeline.
Findings
Models trained on AffectSpeech outperform baselines in emotion captioning.
AffectSpeech enables more expressive and accurate speech emotion synthesis.
Annotations improve model robustness through linguistic diversity.
Abstract
Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have shown that textual descriptions provide a more flexible and interpretable alternative for representing affective characteristics in speech. However, progress in this direction is hindered by the lack of an emotional speech dataset aligned with reliable and fine-grained natural language annotations. To tackle this, we introduce AffectSpeech, a large-scale corpus of human-recorded speech enriched with structured descriptions for fine-grained emotion analysis and generation. Each utterance is characterized across six complementary dimensions, including sentiment polarity, open-vocabulary emotion captions, intensity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
