Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion
Xiao Li, Kotaro Funakoshi, and Manabu Okumura

TL;DR
This paper introduces a novel multi-modal framework for emotion recognition in multi-speaker conversations, utilizing speaker identification, knowledge distillation, and hierarchical fusion to improve accuracy and address class imbalance.
Contribution
It presents a new framework combining speaker identification, knowledge distillation, and hierarchical fusion, effectively tackling speaker ambiguity and class imbalance in emotion recognition.
Findings
Achieved 67.75% and 72.44% weighted F1 scores on MELD and IEMOCAP datasets.
Significant improvements on minority emotion classes.
Demonstrated effectiveness of the proposed methods through comprehensive evaluations.
Abstract
Emotion recognition in multi-speaker conversations faces significant challenges due to speaker ambiguity and severe class imbalance. We propose a novel framework that addresses these issues through three key innovations: (1) a speaker identification module that leverages audio-visual synchronization to accurately identify the active speaker, (2) a knowledge distillation strategy that transfers superior textual emotion understanding to audio and visual modalities, and (3) hierarchical attention fusion with composite loss functions to handle class imbalance. Comprehensive evaluations on MELD and IEMOCAP datasets demonstrate superior performance, achieving 67.75% and 72.44% weighted F1 scores respectively, with particularly notable improvements on minority emotion classes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Speech and Audio Processing
