Emotional Voice Messages (EMOVOME) database: emotion recognition in   spontaneous voice messages

Luc\'ia G\'omez Zaragoz\'a (1); Roc\'io del Amor (1); Elena Parra; Vargas (1); Valery Naranjo (1); Mariano Alca\~niz Raya (1); Javier; Mar\'in-Morales (1) ((1) HUMAN-tech Institute; Universitat Polit\`enica de; Val\`encia; Valencia; Spain)

arXiv:2402.17496·cs.SD·June 14, 2024·1 cites

Emotional Voice Messages (EMOVOME) database: emotion recognition in spontaneous voice messages

Luc\'ia G\'omez Zaragoz\'a (1), Roc\'io del Amor (1), Elena Parra, Vargas (1), Valery Naranjo (1), Mariano Alca\~niz Raya (1), Javier, Mar\'in-Morales (1) ((1) HUMAN-tech Institute, Universitat Polit\`enica de, Val\`encia, Valencia, Spain)

PDF

Open Access

TL;DR

The EMOVOME database offers a large, naturalistic Spanish speech dataset with emotion labels, enabling improved emotion recognition models in real-world voice messages, with baseline results provided.

Contribution

This paper introduces EMOVOME, a novel spontaneous speech dataset with emotion annotations, and establishes baseline emotion recognition models using speech and text.

Findings

01

Speech-based models achieved around 49% accuracy for valence.

02

Text-based models achieved over 61% accuracy for valence.

03

The dataset provides a valuable resource for emotion recognition research in natural settings.

Abstract

Emotional Voice Messages (EMOVOME) is a spontaneous speech dataset containing 999 audio messages from real conversations on a messaging app from 100 Spanish speakers, gender balanced. Voice messages were produced in-the-wild conditions before participants were recruited, avoiding any conscious bias due to laboratory environment. Audios were labeled in valence and arousal dimensions by three non-experts and two experts, which were then combined to obtain a final label per dimension. The experts also provided an extra label corresponding to seven emotion categories. To set a baseline for future investigations using EMOVOME, we implemented emotion recognition models using both speech and audio transcriptions. For speech, we used the standard eGeMAPS feature set and support vector machines, obtaining 49.27% and 44.71% unweighted accuracy for valence and arousal respectively. For text, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Sparse Evolutionary Training · Softmax · WordPiece · Residual Connection · Linear Layer · Weight Decay · Dropout · Layer Normalization