EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life   Speech

Luc\'ia G\'omez-Zaragoz\'a; Roc\'io del Amor; Mar\'ia Jos\'e; Castro-Bleda; Valery Naranjo; Mariano Alca\~niz Raya; Javier Mar\'in-Morales

arXiv:2403.02167·eess.AS·December 5, 2024·2 cites

EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life Speech

Luc\'ia G\'omez-Zaragoz\'a, Roc\'io del Amor, Mar\'ia Jos\'e, Castro-Bleda, Valery Naranjo, Mariano Alca\~niz Raya, Javier Mar\'in-Morales

PDF

Open Access 1 Repo

TL;DR

This paper introduces EMOVOME, a new dataset of spontaneous real-life speech emotions from Spanish speakers, and evaluates state-of-the-art models showing significant improvements over baselines and highlighting challenges in real-world emotion recognition.

Contribution

The paper presents EMOVOME, a publicly available dataset of spontaneous speech emotions in real-life conversations, and benchmarks SER models demonstrating the gap between controlled and real-world data.

Findings

01

Pre-trained UniSpeech-SAT-Large achieved 61.64% UA for valence prediction.

02

EMOVOME outperformed baseline models by approximately 10%.

03

Combining expert and non-expert annotations improved fairness and results.

Abstract

Spontaneous datasets for Speech Emotion Recognition (SER) are scarce and frequently derived from laboratory environments or staged scenarios, such as TV shows, limiting their application in real-world contexts. We developed and publicly released the Emotional Voice Messages (EMOVOME) dataset, including 999 voice messages from real conversations of 100 Spanish speakers on a messaging app, labeled in continuous and discrete emotions by expert and non-expert annotators. We evaluated speaker-independent SER models using acoustic features as baseline and transformer-based models. We compared the results with reference datasets including acted and elicited speech, and analyzed the influence of annotators and gender fairness. The pre-trained UniSpeech-SAT-Large model achieved the highest results, 61.64% and 55.57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luciagomza/ser_emovome
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition

MethodsSparse Evolutionary Training