Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance

Huang-Cheng Chou; Haibin Wu; Hung-yi Lee; Chi-Chun Lee

arXiv:2409.10762·eess.AS·October 15, 2025

Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance

Huang-Cheng Chou, Haibin Wu, Hung-yi Lee, Chi-Chun Lee

PDF

Open Access

TL;DR

This paper investigates how the modality of perceptual evaluations used for labeling affects the performance of speech emotion recognition systems, finding voice-only labels yield better results.

Contribution

It provides a comprehensive comparison of emotion labels elicited from different modalities and introduces an all-inclusive label approach for training SER systems.

Findings

01

Voice-only elicited labels improve SER performance.

02

Multimodal labels do not outperform voice-only labels.

03

An all-inclusive label combining multiple modalities was evaluated.

Abstract

Speech Emotion Recognition (SER) systems rely on speech input and emotional labels annotated by humans. However, various emotion databases collect perceptional evaluations in different ways. For instance, the IEMOCAP dataset uses video clips with sounds for annotators to provide their emotional perceptions. However, the most significant English emotion dataset, the MSP-PODCAST, only provides speech for raters to choose the emotional ratings. Nevertheless, using speech as input is the standard approach to training SER systems. Therefore, the open question is the emotional labels elicited by which scenarios are the most effective for training SER systems. We comprehensively compare the effectiveness of SER systems trained with labels elicited by different modality stimuli and evaluate the SER systems on various testing conditions. Also, we introduce an all-inclusive label that combines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition