EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

Shuhao Xu; Yifan Hu; Jingjing Wu; Zhihao Du; Zheng Lian; Rui Liu

arXiv:2604.26417·cs.CL·April 30, 2026

EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

Shuhao Xu, Yifan Hu, Jingjing Wu, Zhihao Du, Zheng Lian, Rui Liu

PDF

TL;DR

This paper introduces EmoTransCap, a new dataset and pipeline for emotion transition-aware speech captioning that captures discourse-level emotional dynamics and enhances emotional expressiveness in speech synthesis.

Contribution

It presents the first large-scale dataset for discourse-level emotion transitions, a multi-task model for emotion transition recognition, and a controllable speech synthesis system incorporating emotional dynamics.

Findings

01

The dataset effectively captures emotion transitions at discourse level.

02

The MTETR model accurately detects emotion transitions and diarization.

03

The speech synthesis system improves emotional expressiveness and control.

Abstract

Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.