Multimodal Speech Emotion Recognition Using Modality-specific Self-Supervised Frameworks
Rutherford Agbeshi Patamia, Paulo E. Santos, Kingsley Nketia, Acheampong, Favour Ekong, Kwabena Sarpong, She Kun

TL;DR
This paper introduces a modality-specific self-supervised transformer framework for speech and text emotion recognition, leveraging data-efficient learning and multimodal fusion to outperform existing methods on the IEMOCAP dataset.
Contribution
It presents a novel self-supervised learning approach using transformer models for emotion recognition from speech and text, with effective multimodal fusion.
Findings
Achieved 77.58% accuracy on IEMOCAP dataset
Outperformed state-of-the-art emotion recognition methods
Demonstrated effectiveness of modality-specific pre-trained transformers
Abstract
Emotion recognition is a topic of significant interest in assistive robotics due to the need to equip robots with the ability to comprehend human behavior, facilitating their effective interaction in our society. Consequently, efficient and dependable emotion recognition systems supporting optimal human-machine communication are required. Multi-modality (including speech, audio, text, images, and videos) is typically exploited in emotion recognition tasks. Much relevant research is based on merging multiple data modalities and training deep learning models utilizing low-level data representations. However, most existing emotion databases are not large (or complex) enough to allow machine learning approaches to learn detailed representations. This paper explores modalityspecific pre-trained transformer frameworks for self-supervised learning of speech and text representations for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Social Robot Interaction and HRI
