MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

Georgios Chatzichristodoulou; Despoina Kosmopoulou; Antonios Kritikos; Anastasia Poulopoulou; Efthymios Georgiou; Athanasios Katsamanis; Vassilis Katsouros; Alexandros Potamianos

arXiv:2506.09556·cs.CL·September 5, 2025

MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

Georgios Chatzichristodoulou, Despoina Kosmopoulou, Antonios Kritikos, Anastasia Poulopoulou, Efthymios Georgiou, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos

PDF

Open Access 1 Repo

TL;DR

MEDUSA is a comprehensive multimodal framework for speech emotion recognition that effectively manages class imbalance and emotion ambiguity through a four-stage training process involving ensemble classifiers, a novel fusion mechanism, and meta-classification.

Contribution

It introduces MEDUSA, a novel multi-stage training framework with a deep cross-modal transformer and soft target learning for naturalistic speech emotion recognition.

Findings

01

Ranked 1st in Interspeech 2025 SER Challenge.

02

Effective handling of class imbalance and emotion ambiguity.

03

Utilizes a novel deep fusion mechanism and multi-stage training.

Abstract

SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

emopodntua/medusa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Music and Audio Processing

MethodsMixup · Manifold Mixup