Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition
Xuechen Wang, Shiwan Zhao, Yong Qin

TL;DR
This paper enhances Speech Emotion Recognition by integrating supervised contrastive learning with nearest neighbor search, utilizing pre-trained models and novel loss functions to improve discriminative ability and boundary clarity.
Contribution
It introduces a combined loss function and an inference interpolation method that leverage nearest neighbor search, advancing SER performance with limited data.
Findings
Outperforms state-of-the-art on IEMOCAP dataset
Improves inter-class separation and intra-class compactness
Enhances model robustness with limited data
Abstract
Speech Emotion Recognition (SER) is a challenging task due to limited data and blurred boundaries of certain emotions. In this paper, we present a comprehensive approach to improve the SER performance throughout the model lifecycle, including pre-training, fine-tuning, and inference stages. To address the data scarcity issue, we utilize a pre-trained model, wav2vec2.0. During fine-tuning, we propose a novel loss function that combines cross-entropy loss with supervised contrastive learning loss to improve the model's discriminative ability. This approach increases the inter-class distances and decreases the intra-class distances, mitigating the issue of blurred boundaries. Finally, to leverage the improved distances, we propose an interpolation method at the inference stage that combines the model prediction with the output from a k-nearest neighbors model. Our experiments on IEMOCAP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
