Multimodal Speech Emotion Recognition Using Modality-specific   Self-Supervised Frameworks

Rutherford Agbeshi Patamia; Paulo E. Santos; Kingsley Nketia; Acheampong; Favour Ekong; Kwabena Sarpong; She Kun

arXiv:2312.01568·cs.HC·December 5, 2023·1 cites

Multimodal Speech Emotion Recognition Using Modality-specific Self-Supervised Frameworks

Rutherford Agbeshi Patamia, Paulo E. Santos, Kingsley Nketia, Acheampong, Favour Ekong, Kwabena Sarpong, She Kun

PDF

Open Access 1 Repo

TL;DR

This paper introduces a modality-specific self-supervised transformer framework for speech and text emotion recognition, leveraging data-efficient learning and multimodal fusion to outperform existing methods on the IEMOCAP dataset.

Contribution

It presents a novel self-supervised learning approach using transformer models for emotion recognition from speech and text, with effective multimodal fusion.

Findings

01

Achieved 77.58% accuracy on IEMOCAP dataset

02

Outperformed state-of-the-art emotion recognition methods

03

Demonstrated effectiveness of modality-specific pre-trained transformers

Abstract

Emotion recognition is a topic of significant interest in assistive robotics due to the need to equip robots with the ability to comprehend human behavior, facilitating their effective interaction in our society. Consequently, efficient and dependable emotion recognition systems supporting optimal human-machine communication are required. Multi-modality (including speech, audio, text, images, and videos) is typically exploited in emotion recognition tasks. Much relevant research is based on merging multiple data modalities and training deep learning models utilizing low-level data representations. However, most existing emotion databases are not large (or complex) enough to allow machine learning approaches to learn detailed representations. This paper explores modalityspecific pre-trained transformer frameworks for self-supervised learning of speech and text representations for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Ruddy202/multimodal-SEmoR
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Social Robot Interaction and HRI