cross-modal fusion techniques for utterance-level emotion recognition   from text and speech

Jiachen Luo; Huy Phan; Joshua Reiss

arXiv:2302.02447·eess.AS·February 7, 2023·1 cites

cross-modal fusion techniques for utterance-level emotion recognition from text and speech

Jiachen Luo, Huy Phan, Joshua Reiss

PDF

Open Access

TL;DR

This paper introduces a novel cross-modal fusion model, CM-RoBERTa, that effectively captures inter- and intra-modal interactions for utterance-level emotion recognition from speech and text, achieving state-of-the-art results on MELD.

Contribution

The paper proposes a new cross-modal attention-based model, CM-RoBERTa, with mid-level fusion and residual modules for improved multimodal emotion recognition.

Findings

01

Achieves state-of-the-art performance on MELD dataset.

02

Effectively models long-term contextual dependencies.

03

Demonstrates superior inter- and intra-modal interaction capturing.

Abstract

Multimodal emotion recognition (MER) is a fundamental complex research problem due to the uncertainty of human emotional expression and the heterogeneity gap between different modalities. Audio and text modalities are particularly important for a human participant in understanding emotions. Although many successful attempts have been designed multimodal representations for MER, there still exist multiple challenges to be addressed: 1) bridging the heterogeneity gap between multimodal features and model inter- and intra-modal interactions of multiple modalities; 2) effectively and efficiently modelling the contextual dynamics in the conversation sequence. In this paper, we propose Cross-Modal RoBERTa (CM-RoBERTa) model for emotion detection from spoken audio and corresponding transcripts. As the core unit of the CM-RoBERTa, parallel self- and cross- attention is designed to dynamically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Residual Connection · Weight Decay · Dropout · Dense Connections · Attention Dropout · Linear Layer · Layer Normalization