EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life Speech
Luc\'ia G\'omez-Zaragoz\'a, Roc\'io del Amor, Mar\'ia Jos\'e, Castro-Bleda, Valery Naranjo, Mariano Alca\~niz Raya, Javier Mar\'in-Morales

TL;DR
This paper introduces EMOVOME, a new dataset of spontaneous real-life speech emotions from Spanish speakers, and evaluates state-of-the-art models showing significant improvements over baselines and highlighting challenges in real-world emotion recognition.
Contribution
The paper presents EMOVOME, a publicly available dataset of spontaneous speech emotions in real-life conversations, and benchmarks SER models demonstrating the gap between controlled and real-world data.
Findings
Pre-trained UniSpeech-SAT-Large achieved 61.64% UA for valence prediction.
EMOVOME outperformed baseline models by approximately 10%.
Combining expert and non-expert annotations improved fairness and results.
Abstract
Spontaneous datasets for Speech Emotion Recognition (SER) are scarce and frequently derived from laboratory environments or staged scenarios, such as TV shows, limiting their application in real-world contexts. We developed and publicly released the Emotional Voice Messages (EMOVOME) dataset, including 999 voice messages from real conversations of 100 Spanish speakers on a messaging app, labeled in continuous and discrete emotions by expert and non-expert annotators. We evaluated speaker-independent SER models using acoustic features as baseline and transformer-based models. We compared the results with reference datasets including acted and elicited speech, and analyzed the influence of annotators and gender fairness. The pre-trained UniSpeech-SAT-Large model achieved the highest results, 61.64% and 55.57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
MethodsSparse Evolutionary Training
