Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

Cong Wang; Yizhong Geng; Yuhua Wen; Qifei Li; Yingming Gao; Ruimin Wang; Chunfeng Wang; Hao Li; Ya Li; Wei Chen

arXiv:2512.04551·cs.SD·March 6, 2026

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

Cong Wang, Yizhong Geng, Yuhua Wen, Qifei Li, Yingming Gao, Ruimin Wang, Chunfeng Wang, Hao Li, Ya Li, Wei Chen

PDF

Open Access

TL;DR

This paper introduces a multi-loss learning framework for speech emotion recognition that combines energy-adaptive mixup, frame-level attention, and multiple loss functions to improve accuracy and robustness across several datasets.

Contribution

The paper proposes a novel multi-loss learning approach integrating energy-adaptive mixup and frame-level attention for enhanced speech emotion recognition performance.

Findings

01

Achieves state-of-the-art results on four SER datasets.

02

Effectively handles class imbalance and emotional variability.

03

Demonstrates robustness and improved feature discrimination.

Abstract

Speech emotion recognition (SER) is an important technology in human-computer interaction. However, achieving high performance is challenging due to emotional complexity and scarce annotated data. To tackle these challenges, we propose a multi-loss learning (MLL) framework integrating an energy-adaptive mixup (EAM) method and a frame-level attention module (FLAM). The EAM method leverages SNR-based augmentation to generate diverse speech samples capturing subtle emotional variations. FLAM enhances frame-level feature extraction for multi-frame emotional cues. Our MLL strategy combines Kullback-Leibler divergence, focal, center, and supervised contrastive loss to optimize learning, address class imbalance, and improve feature separability. We evaluate our method on four widely used SER datasets: IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE. The results demonstrate our method achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Speech and Audio Processing