GMP-TL: Gender-augmented Multi-scale Pseudo-label Enhanced Transfer   Learning for Speech Emotion Recognition

Yu Pan; Yuguang Yang; Heng Lu; Lei Ma; Jianjun Zhao

arXiv:2405.02151·cs.SD·September 24, 2024

GMP-TL: Gender-augmented Multi-scale Pseudo-label Enhanced Transfer Learning for Speech Emotion Recognition

Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao

PDF

Open Access

TL;DR

GMP-TL is a novel speech emotion recognition framework that uses gender-augmented multi-scale pseudo-label transfer learning to improve emotion detection accuracy at both frame and utterance levels.

Contribution

It introduces a two-stage fine-tuning approach leveraging multi-scale pseudo-labels and gender augmentation for enhanced SER performance.

Findings

01

Achieves 80.0% WAR and 82.0% UAR on IEMOCAP

02

Outperforms state-of-the-art unimodal SER methods

03

Comparable to multimodal SER approaches

Abstract

The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, current research typically relies on utterance-level emotion labels, inadequately capturing the complexity of emotions within a single utterance. In this paper, we introduce GMP-TL, a novel SER framework that employs gender-augmented multi-scale pseudo-label (GMP) based transfer learning to mitigate this gap. Specifically, GMP-TL initially uses the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level GMPs. Subsequently, to fully leverage frame-level GMPs and utterance-level emotion labels, a two-stage model fine-tuning approach is presented to further optimize GMP-TL. Experiments on IEMOCAP show that our GMP-TL attains a WAR of 80.0% and an UAR of 82.0%, achieving superior performance compared to state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition

Methodsk-Means Clustering