Speech Emotion Recognition via Contrastive Loss under Siamese Networks
Zheng Lian, Ya Li, Jianhua Tao, Jian Huang

TL;DR
This paper introduces a contrastive loss function within Siamese networks for speech emotion recognition, improving discriminative feature learning and achieving higher accuracy on the IEMOCAP dataset.
Contribution
The study proposes using contrastive loss in Siamese networks for speech emotion recognition, enhancing feature discrimination over traditional cross-entropy methods.
Findings
Achieved 62.19% weighted accuracy on IEMOCAP.
Outperformed baseline systems by 1.14% in weighted accuracy.
Demonstrated the effectiveness of contrastive loss in emotion classification.
Abstract
Speech emotion recognition is an important aspect of human-computer interaction. Prior work proposes various end-to-end models to improve the classification performance. However, most of them rely on the cross-entropy loss together with softmax as the supervision component, which does not explicitly encourage discriminative learning of features. In this paper, we introduce the contrastive loss function to encourage intra-class compactness and inter-class separability between learnable features. Furthermore, multiple feature selection methods and pairwise sample selection methods are evaluated. To verify the performance of the proposed system, we conduct experiments on The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database, a common evaluation corpus. Experimental results reveal the advantages of the proposed method, which reaches 62.19% in the weighted accuracy and 63.21% in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFeature Selection · Softmax
