Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion
Honghong Wang, Jing Deng, Fanqin Meng, Rong Zheng

TL;DR
This paper proposes a multi-task learning framework with a co-attention module and a novel loss function to improve speech emotion recognition by leveraging related tasks and addressing class imbalance.
Contribution
It introduces a co-attention based feature fusion mechanism and a Sample Weighted Focal Contrastive loss for enhanced SER performance.
Findings
Significant performance improvements on the SER in naturalistic conditions.
Effective handling of class imbalance and semantic confusion.
Enhanced feature interaction through co-attention module.
Abstract
This study investigates fine-tuning self-supervised learn ing (SSL) models using multi-task learning (MTL) to enhance speech emotion recognition (SER). The framework simultane ously handles four related tasks: emotion recognition, gender recognition, speaker verification, and automatic speech recog nition. An innovative co-attention module is introduced to dy namically capture the interactions between features from the primary emotion classification task and auxiliary tasks, en abling context-aware fusion. Moreover, We introduce the Sam ple Weighted Focal Contrastive (SWFC) loss function to ad dress class imbalance and semantic confusion by adjusting sam ple weights for difficult and minority samples. The method is validated on the Categorical Emotion Recognition task of the Speech Emotion Recognition in Naturalistic Conditions Chal lenge, showing significant performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
