ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets
Shahin Amiriparian, Filip Packa\'n, Maurice Gerczuk, Bj\"orn W., Schuller

TL;DR
This paper introduces ExHuBERT, an improved speech emotion recognition model built by extending and fine-tuning HuBERT on a large, multi-lingual emotion dataset, achieving state-of-the-art results across diverse datasets.
Contribution
We propose ExHuBERT, a novel model enhancement technique involving backbone extension and fine-tuning on a comprehensive multi-lingual emotion dataset, EmoSet++, to improve SER performance.
Findings
ExHuBERT outperforms previous models on unseen datasets.
The multi-lingual EmoSet++ dataset enhances model generalization.
Model architecture modifications effectively preserve and adapt pre-trained features.
Abstract
Foundation models have shown great promise in speech emotion recognition (SER) by leveraging their pre-trained representations to capture emotion patterns in speech signals. To further enhance SER performance across various languages and domains, we propose a novel twofold approach. First, we gather EmoSet++, a comprehensive multi-lingual, multi-cultural speech emotion corpus with 37 datasets, 150,907 samples, and a total duration of 119.5 hours. Second, we introduce ExHuBERT, an enhanced version of HuBERT achieved by backbone extension and fine-tuning on EmoSet++. We duplicate each encoder layer and its weights, then freeze the first duplicate, integrating an extra zero-initialized linear layer and skip connections to preserve functionality and ensure its adaptability for subsequent fine-tuning. Our evaluation on unseen datasets shows the efficacy of ExHuBERT, setting a new benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition
