EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses
Shuhao Xu, Yifan Hu, Jingjing Wu, Zhihao Du, Zheng Lian, Rui Liu

TL;DR
This paper introduces EmoTransCap, a new dataset and pipeline for emotion transition-aware speech captioning that captures discourse-level emotional dynamics and enhances emotional expressiveness in speech synthesis.
Contribution
It presents the first large-scale dataset for discourse-level emotion transitions, a multi-task model for emotion transition recognition, and a controllable speech synthesis system incorporating emotional dynamics.
Findings
The dataset effectively captures emotion transitions at discourse level.
The MTETR model accurately detects emotion transitions and diarization.
The speech synthesis system improves emotional expressiveness and control.
Abstract
Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
