Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion

Xiao Li; Kotaro Funakoshi; and Manabu Okumura

arXiv:2511.13731·cs.SD·November 19, 2025

Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion

Xiao Li, Kotaro Funakoshi, and Manabu Okumura

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel multi-modal framework for emotion recognition in multi-speaker conversations, utilizing speaker identification, knowledge distillation, and hierarchical fusion to improve accuracy and address class imbalance.

Contribution

It presents a new framework combining speaker identification, knowledge distillation, and hierarchical fusion, effectively tackling speaker ambiguity and class imbalance in emotion recognition.

Findings

01

Achieved 67.75% and 72.44% weighted F1 scores on MELD and IEMOCAP datasets.

02

Significant improvements on minority emotion classes.

03

Demonstrated effectiveness of the proposed methods through comprehensive evaluations.

Abstract

Emotion recognition in multi-speaker conversations faces significant challenges due to speaker ambiguity and severe class imbalance. We propose a novel framework that addresses these issues through three key innovations: (1) a speaker identification module that leverages audio-visual synchronization to accurately identify the active speaker, (2) a knowledge distillation strategy that transfers superior textual emotion understanding to audio and visual modalities, and (3) hierarchical attention fusion with composite loss functions to handle class imbalance. Comprehensive evaluations on MELD and IEMOCAP datasets demonstrate superior performance, achieving 67.75% and 72.44% weighted F1 scores respectively, with particularly notable improvements on minority emotion classes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion· underline

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Speech and Audio Processing