TL;DR
Abhinaya is a multi-modal speech emotion recognition system that combines speech, text, and speech-text models, employing advanced fine-tuning and class imbalance techniques to achieve state-of-the-art results in naturalistic conditions.
Contribution
This work introduces a novel multi-modal SER system integrating speech and text models with tailored training strategies for naturalistic data.
Findings
Ranked 4th among 166 submissions in the challenge
Achieved state-of-the-art performance after full training
Effectively handled class imbalance and variability
Abstract
Speech emotion recognition (SER) in naturalistic settings remains a challenge due to the intrinsic variability, diverse recording conditions, and class imbalance. As participants in the Interspeech Naturalistic SER Challenge which focused on these complexities, we present Abhinaya, a system integrating speech-based, text-based, and speech-text models. Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations, leverages large language models (LLM) for textual context, and employs speech-text modeling with an SLLM to capture nuanced emotional cues. To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting. Despite one model not being fully trained, the Abhinaya system ranked 4th among 166 submissions. Upon completion of training, it achieved state-of-the-art performance among…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
