Improved Speech Emotion Recognition using Transfer Learning and   Spectrogram Augmentation

Sarala Padi; Seyed Omid Sadjadi; Dinesh Manocha; Ram D. Sriram

arXiv:2108.02510·cs.SD·August 17, 2021

Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Sarala Padi, Seyed Omid Sadjadi, Dinesh Manocha, Ram D. Sriram

PDF

TL;DR

This paper enhances speech emotion recognition by combining transfer learning from speaker recognition models with spectrogram augmentation, addressing data scarcity and improving accuracy on the IEMOCAP dataset.

Contribution

It introduces a transfer learning approach using a pre-trained ResNet with statistics pooling and applies spectrogram augmentation to boost SER performance.

Findings

01

Transfer learning improves emotion classification accuracy.

02

Spectrogram augmentation enhances model generalization.

03

Combined methods achieve state-of-the-art results on IEMOCAP.

Abstract

Automatic speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity, i.e., insufficient amounts of carefully labeled data to build and fully explore complex deep learning models for emotion classification. This paper aims to address this challenge using a transfer learning strategy combined with spectrogram augmentation. Specifically, we propose a transfer learning approach that leverages a pre-trained residual network (ResNet) model including a statistics pooling layer from speaker recognition trained using large amounts of speaker-labeled data. The statistics pooling layer enables the model to efficiently process variable-length input, thereby eliminating the need for sequence truncation which is commonly used in SER systems. In addition, we adopt a spectrogram augmentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.