TL;DR
This paper introduces a transfer learning approach for speech emotion recognition using pre-trained wav2vec 2.0 embeddings combined with simple neural networks, achieving superior results on standard datasets.
Contribution
It proposes a novel method of combining multiple layers of wav2vec 2.0 with trainable weights and compares finetuned versus non-finetuned models for emotion recognition.
Findings
Superior performance on IEMOCAP and RAVDESS datasets
Effective use of multi-layer wav2vec 2.0 features
Demonstrates benefits of transfer learning in small datasets
Abstract
Emotion recognition datasets are relatively small, making the use of the more sophisticated deep learning approaches challenging. In this work, we propose a transfer learning method for speech emotion recognition where features extracted from pre-trained wav2vec 2.0 models are modeled using simple neural networks. We propose to combine the output of several layers from the pre-trained model using trainable weights which are learned jointly with the downstream model. Further, we compare performance using two different wav2vec 2.0 models, with and without finetuning for speech recognition. We evaluate our proposed approaches on two standard emotion databases IEMOCAP and RAVDESS, showing superior performance compared to results in the literature.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
