An Empirical Study and Improvement for Speech Emotion Recognition
Zhen Wu, Yizhe Lu, Xinyu Dai

TL;DR
This paper investigates the impact of fusion strategies in multimodal speech emotion recognition and proposes an improved model with perspective loss, achieving state-of-the-art results on the IEMOCAP dataset.
Contribution
It introduces a novel fusion approach and a perspective loss to enhance multimodal speech emotion recognition performance.
Findings
Achieved new state-of-the-art results on IEMOCAP dataset
Demonstrated the effectiveness of the proposed fusion strategy
Provided analysis explaining the performance improvements
Abstract
Multimodal speech emotion recognition aims to detect speakers' emotions from audio and text. Prior works mainly focus on exploiting advanced networks to model and fuse different modality information to facilitate performance, while neglecting the effect of different fusion strategies on emotion recognition. In this work, we consider a simple yet important problem: how to fuse audio and text modality information is more helpful for this multimodal task. Further, we propose a multimodal emotion recognition model improved by perspective loss. Empirical results show our method obtained new state-of-the-art results on the IEMOCAP dataset. The in-depth analysis explains why the improved model can achieve improvements and outperforms baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Speech and Audio Processing
