Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025

Alef Iury Siqueira Ferreira; Lucas Rafael Gris; Alexandre Ferro Filho; Lucas \'Olives; Daniel Ribeiro; Luiz Fernando; Fernanda Lustosa; Rodrigo Tanaka; Frederico Santos de Oliveira; Arlindo Galv\~ao Filho

arXiv:2506.02088·cs.SD·June 4, 2025

Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025

Alef Iury Siqueira Ferreira, Lucas Rafael Gris, Alexandre Ferro Filho, Lucas \'Olives, Daniel Ribeiro, Luiz Fernando, Fernanda Lustosa, Rodrigo Tanaka, Frederico Santos de Oliveira, Arlindo Galv\~ao Filho

PDF

Open Access

TL;DR

This paper introduces a robust multimodal speech emotion recognition system for naturalistic speech, combining audio, text, prosodic, spectral features, and graph-based fusion techniques, achieving notable performance improvements.

Contribution

It presents a novel fusion approach using Graph Attention Networks and integrates prosodic and spectral cues, advancing emotion recognition in spontaneous speech conditions.

Findings

01

Achieved a Macro F1-score of 39.79% on the test set.

02

Demonstrated effectiveness of graph-based fusion techniques.

03

Validated the benefit of prosodic and spectral features.

Abstract

Training SER models in natural, spontaneous speech is especially challenging due to the subtle expression of emotions and the unpredictable nature of real-world audio. In this paper, we present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge, focusing on categorical emotion recognition. Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues. In particular, we investigate the effectiveness of Fundamental Frequency (F0) quantization and the use of a pretrained audio tagging model. We also employ an ensemble model to improve robustness. On the official test set, our system achieved a Macro F1-score of 39.79% (42.20% on validation). Our results underscore the potential of these methods, and analysis of fusion techniques confirmed the effectiveness of Graph Attention Networks. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition

MethodsSoftmax · Attention Is All You Need