Lessons Learnt: Revisit Key Training Strategies for Effective Speech Emotion Recognition in the Wild

Jing-Tong Tzeng; Bo-Hao Su; Ya-Tse Wu; Hsing-Hang Chou; Chi-Chun Lee

arXiv:2508.07282·eess.AS·September 26, 2025·Interspeech

Lessons Learnt: Revisit Key Training Strategies for Effective Speech Emotion Recognition in the Wild

Jing-Tong Tzeng, Bo-Hao Su, Ya-Tse Wu, Hsing-Hang Chou, Chi-Chun Lee

PDF

TL;DR

This paper revisits and optimizes key training strategies for speech emotion recognition in naturalistic conditions, demonstrating that simple modifications can significantly improve model robustness and performance.

Contribution

The study identifies effective training strategies like balancing, activation functions, and fine-tuning that enhance SER performance without increasing model complexity.

Findings

01

Achieved a valence CCC of 0.6953 with a multi-modal fusion model.

02

Fine-tuning RoBERTa and WavLM separately improves valence performance.

03

Focal loss and activation functions boost performance without added complexity.

Abstract

In this study, we revisit key training strategies in machine learning often overlooked in favor of deeper architectures. Specifically, we explore balancing strategies, activation functions, and fine-tuning techniques to enhance speech emotion recognition (SER) in naturalistic conditions. Our findings show that simple modifications improve generalization with minimal architectural changes. Our multi-modal fusion model, integrating these optimizations, achieves a valence CCC of 0.6953, the best valence score in Task 2: Emotional Attribute Regression. Notably, fine-tuning RoBERTa and WavLM separately in a single-modality setting, followed by feature fusion without training the backbone extractor, yields the highest valence performance. Additionally, focal loss and activation functions significantly enhance performance without increasing complexity. These results suggest that refining core…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.