ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37   Emotion Datasets

Shahin Amiriparian; Filip Packa\'n; Maurice Gerczuk; Bj\"orn W.; Schuller

arXiv:2406.10275·cs.CL·June 18, 2024

ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets

Shahin Amiriparian, Filip Packa\'n, Maurice Gerczuk, Bj\"orn W., Schuller

PDF

Open Access

TL;DR

This paper introduces ExHuBERT, an improved speech emotion recognition model built by extending and fine-tuning HuBERT on a large, multi-lingual emotion dataset, achieving state-of-the-art results across diverse datasets.

Contribution

We propose ExHuBERT, a novel model enhancement technique involving backbone extension and fine-tuning on a comprehensive multi-lingual emotion dataset, EmoSet++, to improve SER performance.

Findings

01

ExHuBERT outperforms previous models on unseen datasets.

02

The multi-lingual EmoSet++ dataset enhances model generalization.

03

Model architecture modifications effectively preserve and adapt pre-trained features.

Abstract

Foundation models have shown great promise in speech emotion recognition (SER) by leveraging their pre-trained representations to capture emotion patterns in speech signals. To further enhance SER performance across various languages and domains, we propose a novel twofold approach. First, we gather EmoSet++, a comprehensive multi-lingual, multi-cultural speech emotion corpus with 37 datasets, 150,907 samples, and a total duration of 119.5 hours. Second, we introduce ExHuBERT, an enhanced version of HuBERT achieved by backbone extension and fine-tuning on EmoSet++. We duplicate each encoder layer and its weights, then freeze the first duplicate, integrating an extra zero-initialized linear layer and skip connections to preserve functionality and ensure its adaptability for subsequent fine-tuning. Our evaluation on unseen datasets shows the efficacy of ExHuBERT, setting a new benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition