Developing a Top-tier Framework in Naturalistic Conditions Challenge for Categorized Emotion Prediction: From Speech Foundation Models and Learning Objective to Data Augmentation and Engineering Choices
Tiantian Feng, Thanathai Lertpetchpun, Dani Byrd, Shrikanth Narayanan

TL;DR
This paper presents the ILERR system for naturalistic speech emotion recognition, demonstrating that a simple, well-designed model can outperform most submissions in the INTERSPEECH 2025 challenge, with significant improvements from data and engineering choices.
Contribution
The paper introduces a robust, reproducible SER system tailored for natural emotional speech, emphasizing modeling, data augmentation, and engineering strategies that enhance performance.
Findings
Single system outperforms 95% of submissions with Macro-F1 > 0.4
Ensemble of three systems achieves top-3 ranking
Effective modeling and data choices improve emotion recognition in natural speech
Abstract
Speech emotion recognition (SER), particularly for naturally expressed emotions, remains a challenging computational task. Key challenges include the inherent subjectivity in emotion annotation and the imbalanced distribution of emotion labels in datasets. This paper introduces the \texttt{SAILER} system developed for participation in the INTERSPEECH 2025 Emotion Recognition Challenge (Task 1). The challenge dataset, which contains natural emotional speech from podcasts, serves as a valuable resource for studying imbalanced and subjective emotion annotations. Our system is designed to be simple, reproducible, and effective, highlighting critical choices in modeling, learning objectives, data augmentation, and engineering choices. Results show that even a single system (without ensembling) can outperform more than 95\% of the submissions, with a Macro-F1 score exceeding 0.4. Moreover, an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis
