Developing a Top-tier Framework in Naturalistic Conditions Challenge for Categorized Emotion Prediction: From Speech Foundation Models and Learning Objective to Data Augmentation and Engineering Choices

Tiantian Feng; Thanathai Lertpetchpun; Dani Byrd; Shrikanth Narayanan

arXiv:2505.22133·cs.SD·June 3, 2025

Developing a Top-tier Framework in Naturalistic Conditions Challenge for Categorized Emotion Prediction: From Speech Foundation Models and Learning Objective to Data Augmentation and Engineering Choices

Tiantian Feng, Thanathai Lertpetchpun, Dani Byrd, Shrikanth Narayanan

PDF

Open Access 1 Repo

TL;DR

This paper presents the ILERR system for naturalistic speech emotion recognition, demonstrating that a simple, well-designed model can outperform most submissions in the INTERSPEECH 2025 challenge, with significant improvements from data and engineering choices.

Contribution

The paper introduces a robust, reproducible SER system tailored for natural emotional speech, emphasizing modeling, data augmentation, and engineering strategies that enhance performance.

Findings

01

Single system outperforms 95% of submissions with Macro-F1 > 0.4

02

Ensemble of three systems achieves top-3 ranking

03

Effective modeling and data choices improve emotion recognition in natural speech

Abstract

Speech emotion recognition (SER), particularly for naturally expressed emotions, remains a challenging computational task. Key challenges include the inherent subjectivity in emotion annotation and the imbalanced distribution of emotion labels in datasets. This paper introduces the \texttt{SAILER} system developed for participation in the INTERSPEECH 2025 Emotion Recognition Challenge (Task 1). The challenge dataset, which contains natural emotional speech from podcasts, serves as a valuable resource for studying imbalanced and subjective emotion annotations. Our system is designed to be simple, reproducible, and effective, highlighting critical choices in modeling, learning objectives, data augmentation, and engineering choices. Results show that even a single system (without ensembling) can outperform more than 95\% of the submissions, with a Macro-F1 score exceeding 0.4. Moreover, an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tiantiaf0627/vox-profile-release
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis