TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation
Taeyang Yun, Hyunkuk Lim, Jeonghwan Lee, Min Song

TL;DR
This paper introduces TelME, a novel multimodal fusion network for emotion recognition in conversations that leverages teacher-student knowledge distillation and shifting fusion to improve the use of audio, visual, and text modalities.
Contribution
The paper proposes a teacher-leading multimodal fusion framework with knowledge distillation and shifting fusion, achieving state-of-the-art results in ERC tasks.
Findings
TelME outperforms existing models on MELD dataset.
Knowledge distillation enhances non-verbal modality contributions.
Shifting fusion effectively combines multimodal features.
Abstract
Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue systems to effectively respond to user requests. The emotions in a conversation can be identified by the representations from various modalities, such as audio, visual, and text. However, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challenging task. In this paper, we propose Teacher-leading Multimodal fusion network for ERC (TelME). TelME incorporates cross-modal knowledge distillation to transfer information from a language model acting as the teacher to the non-verbal students, thereby optimizing the efficacy of the weak modalities. We then combine multimodal features using a shifting fusion approach in which student networks support the teacher. TelME achieves state-of-the-art performance in MELD, a multi-speaker conversation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems · Emotion and Mood Recognition · Speech Recognition and Synthesis
MethodsKnowledge Distillation
