Multimodal Emotion Recognition with High-level Speech and Text Features
Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda

TL;DR
This paper introduces a multimodal emotion recognition system combining high-level speech features from wav2vec 2.0 and text features from Transformer models, achieving superior accuracy on the IEMOCAP dataset.
Contribution
It proposes a novel cross-representation speech model inspired by disentanglement learning and combines it with text-based emotion recognition for improved multimodal performance.
Findings
Outperforms existing speech-only and text-only emotion recognition methods.
Effective fusion of speech and text features enhances classification accuracy.
Demonstrates robustness on the IEMOCAP dataset for 4-class emotion classification.
Abstract
Automatic emotion recognition is one of the central concerns of the Human-Computer Interaction field as it can bridge the gap between humans and machines. Current works train deep learning models on low-level data representations to solve the emotion recognition task. Since emotion datasets often have a limited amount of data, these approaches may suffer from overfitting, and they may learn based on superficial cues. To address these issues, we propose a novel cross-representation speech model, inspired by disentanglement representation learning, to perform emotion recognition on wav2vec 2.0 speech features. We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models. We further combine the speech-based and text-based results with a score fusion approach. Our method is evaluated on the IEMOCAP dataset in a 4-class classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining
