cross-modal fusion techniques for utterance-level emotion recognition from text and speech
Jiachen Luo, Huy Phan, Joshua Reiss

TL;DR
This paper introduces a novel cross-modal fusion model, CM-RoBERTa, that effectively captures inter- and intra-modal interactions for utterance-level emotion recognition from speech and text, achieving state-of-the-art results on MELD.
Contribution
The paper proposes a new cross-modal attention-based model, CM-RoBERTa, with mid-level fusion and residual modules for improved multimodal emotion recognition.
Findings
Achieves state-of-the-art performance on MELD dataset.
Effectively models long-term contextual dependencies.
Demonstrates superior inter- and intra-modal interaction capturing.
Abstract
Multimodal emotion recognition (MER) is a fundamental complex research problem due to the uncertainty of human emotional expression and the heterogeneity gap between different modalities. Audio and text modalities are particularly important for a human participant in understanding emotions. Although many successful attempts have been designed multimodal representations for MER, there still exist multiple challenges to be addressed: 1) bridging the heterogeneity gap between multimodal features and model inter- and intra-modal interactions of multiple modalities; 2) effectively and efficiently modelling the contextual dynamics in the conversation sequence. In this paper, we propose Cross-Modal RoBERTa (CM-RoBERTa) model for emotion detection from spoken audio and corresponding transcripts. As the core unit of the CM-RoBERTa, parallel self- and cross- attention is designed to dynamically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Residual Connection · Weight Decay · Dropout · Dense Connections · Attention Dropout · Linear Layer · Layer Normalization
