WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

Feng Li; Jiusong Luo; Wanjun Xia

arXiv:2412.05558·cs.SD·December 10, 2024

WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

Feng Li, Jiusong Luo, Wanjun Xia

PDF

Open Access

TL;DR

WavFusion introduces a novel multimodal speech emotion recognition framework that leverages cross-modal attention and discrepancy learning to improve emotion detection accuracy over existing methods.

Contribution

The paper presents WavFusion, a new multimodal SER model that effectively captures cross-modal interactions and learns discriminative features, outperforming prior approaches.

Findings

01

WavFusion achieves higher accuracy on IEMOCAP and MELD datasets.

02

The proposed model outperforms existing state-of-the-art methods.

03

Effective multimodal fusion improves emotion recognition performance.

Abstract

Speech emotion recognition (SER) remains a challenging yet crucial task due to the inherent complexity and diversity of human emotions. To address this problem, researchers attempt to fuse information from other modalities via multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of cross-modal interactions, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal speech emotion recognition framework that addresses critical research problems in effective multimodal fusion, heterogeneity among modalities, and discriminative representation learning. By leveraging a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion demonstrates improved performance over existing state-of-the-art methods on benchmark datasets. Our work highlights the importance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Speech and Audio Processing

MethodsSoftmax · Attention Is All You Need