Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings

Leonardo Pepino; Pablo Riera; Luciana Ferrer

arXiv:2104.03502·cs.SD·April 9, 2021

Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings

Leonardo Pepino, Pablo Riera, Luciana Ferrer

PDF

2 Repos

TL;DR

This paper introduces a transfer learning approach for speech emotion recognition using pre-trained wav2vec 2.0 embeddings combined with simple neural networks, achieving superior results on standard datasets.

Contribution

It proposes a novel method of combining multiple layers of wav2vec 2.0 with trainable weights and compares finetuned versus non-finetuned models for emotion recognition.

Findings

01

Superior performance on IEMOCAP and RAVDESS datasets

02

Effective use of multi-layer wav2vec 2.0 features

03

Demonstrates benefits of transfer learning in small datasets

Abstract

Emotion recognition datasets are relatively small, making the use of the more sophisticated deep learning approaches challenging. In this work, we propose a transfer learning method for speech emotion recognition where features extracted from pre-trained wav2vec 2.0 models are modeled using simple neural networks. We propose to combine the output of several layers from the pre-trained model using trainable weights which are learned jointly with the downstream model. Further, we compare performance using two different wav2vec 2.0 models, with and without finetuning for speech recognition. We evaluate our proposed approaches on two standard emotion databases IEMOCAP and RAVDESS, showing superior performance compared to results in the literature.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.