Exploring Self-Supervised Multi-view Contrastive Learning for Speech   Emotion Recognition with Limited Annotations

Bulat Khaertdinov; Pedro Jeuris; Annanda Sousa; Enrique Hortal

arXiv:2406.07900·cs.CL·February 25, 2025

Exploring Self-Supervised Multi-view Contrastive Learning for Speech Emotion Recognition with Limited Annotations

Bulat Khaertdinov, Pedro Jeuris, Annanda Sousa, Enrique Hortal

PDF

Open Access

TL;DR

This paper introduces a multi-view self-supervised learning approach for speech emotion recognition that enhances performance in low-annotation scenarios by leveraging multiple speech representations, including large speech models.

Contribution

It proposes a novel multi-view SSL pre-training framework applicable to various speech representations to improve SER with limited labeled data.

Findings

01

Boosts SER performance by up to 10% in unweighted average recall.

02

Effective with various speech representations including wav2vec 2.0 and paralinguistic features.

03

Demonstrates significant gains in low-annotation settings.

Abstract

Recent advancements in Deep and Self-Supervised Learning (SSL) have led to substantial improvements in Speech Emotion Recognition (SER) performance, reaching unprecedented levels. However, obtaining sufficient amounts of accurately labeled data for training or fine-tuning the models remains a costly and challenging task. In this paper, we propose a multi-view SSL pre-training technique that can be applied to various representations of speech, including the ones generated by large speech models, to improve SER performance in scenarios where annotations are limited. Our experiments, based on wav2vec 2.0, spectral and paralinguistic features, demonstrate that the proposed framework boosts the SER performance, by up to 10% in Unweighted Average Recall, in settings with extremely sparse data annotations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing