On the use of Self-supervised Pre-trained Acoustic and Linguistic   Features for Continuous Speech Emotion Recognition

Manon Macary; Marie Tahon; Yannick Est\`eve; Anthony Rousseau

arXiv:2011.09212·cs.CL·November 19, 2020

On the use of Self-supervised Pre-trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition

Manon Macary, Marie Tahon, Yannick Est\`eve, Anthony Rousseau

PDF

TL;DR

This paper demonstrates that combining self-supervised pre-trained acoustic (wav2vec) and linguistic (camemBERT) features significantly improves continuous speech emotion recognition accuracy on French datasets, especially with limited labeled data.

Contribution

It is the first study to show the effectiveness of jointly using wav2vec and BERT-like features for continuous speech emotion recognition.

Findings

01

Achieved a CCC of 0.825 on AlloSat, outperforming traditional MFCC and word2vec features.

02

First to demonstrate joint use of wav2vec and camemBERT for SER.

03

Shows significant improvement in emotion recognition accuracy with limited labeled data.

Abstract

Pre-training for feature extraction is an increasingly studied approach to get better continuous representations of audio and text content. In the present work, we use wav2vec and camemBERT as self-supervised learned models to represent our data in order to perform continuous emotion recognition from speech (SER) on AlloSat, a large French emotional database describing the satisfaction dimension, and on the state of the art corpus SEWA focusing on valence, arousal and liking dimensions. To the authors' knowledge, this paper presents the first study showing that the joint use of wav2vec and BERT-like pre-trained features is very relevant to deal with continuous SER task, usually characterized by a small amount of labeled training data. Evaluated by the well-known concordance correlation coefficient (CCC), our experiments show that we can reach a CCC value of 0.825 instead of 0.592 when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.