A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations
Soumya Dutta, Smruthi Balaji, Sriram Ganapathy

TL;DR
This paper introduces MiSTER-E, a modular Mixture-of-Experts framework that effectively integrates speech and text modalities for emotion recognition in conversations, achieving state-of-the-art results on benchmark datasets.
Contribution
The paper presents a novel MoE-based model that decouples modality-specific context modeling from multimodal fusion, leveraging large language models and a learned gating mechanism for improved ERC.
Findings
Achieves high weighted F1-scores on IEMOCAP, MELD, and MOSI datasets.
Outperforms baseline speech-text ERC systems.
Demonstrates the effectiveness of modality-specific experts and learned gating in multimodal emotion recognition.
Abstract
Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Speech and dialogue systems
