EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions

Xin Jing; Andreas Triantafyllopoulos; Jiadong Wang; Shahin Amiriparian; Jun Luo; Bj\"orn Schuller

arXiv:2603.09820·cs.SD·March 11, 2026

EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions

Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang, Shahin Amiriparian, Jun Luo, Bj\"orn Schuller

PDF

Open Access

TL;DR

EmoSURA introduces an atomic verification framework for evaluating detailed emotional speech captions, improving correlation with human judgments over traditional metrics, especially for long-form descriptions.

Contribution

The paper presents EmoSURA, a novel evaluation method that decomposes captions into atomic units and verifies them against speech, along with SURABench, a new benchmark resource.

Findings

01

EmoSURA correlates positively with human judgments.

02

Traditional metrics negatively correlate with human assessments for long captions.

03

SURABench provides a standardized evaluation dataset.

Abstract

Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Multimodal Machine Learning Applications · Sentiment Analysis and Opinion Mining