EmoTech: A Multi-modal Speech Emotion Recognition Using Multi-source   Low-level Information with Hybrid Recurrent Network

Shamin Bin Habib Avro; Taieba Taher; Nursadul Mamun

arXiv:2501.12674·eess.AS·January 23, 2025·3 cites

EmoTech: A Multi-modal Speech Emotion Recognition Using Multi-source Low-level Information with Hybrid Recurrent Network

Shamin Bin Habib Avro, Taieba Taher, Nursadul Mamun

PDF

Open Access

TL;DR

EmoTech introduces a multi-modal speech emotion recognition system that combines audio and text low-level features using hybrid neural networks, achieving 84% accuracy and outperforming previous methods.

Contribution

This paper presents a novel multi-source low-level feature fusion approach with hybrid CNN and BiLSTM networks for improved emotion recognition.

Findings

01

Achieved 84% overall accuracy in emotion recognition.

02

Outperformed previous approaches on the same dataset and modalities.

03

Effectively combines audio and text features for robust emotion detection.

Abstract

Emotion recognition is a critical task in human-computer interaction, enabling more intuitive and responsive systems. This study presents a multimodal emotion recognition system that combines low-level information from audio and text, leveraging both Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory Networks (BiLSTMs). The proposed system consists of two parallel networks: an Audio Block and a Text Block. Mel Frequency Cepstral Coefficients (MFCCs) are extracted and processed by a BiLSTM network and a 2D convolutional network to capture low-level intrinsic and extrinsic features from speech. Simultaneously, a combined BiLSTM-CNN network extracts the low-level sequential nature of text from word embeddings corresponding to the available audio. This low-level information from speech and text is then concatenated and processed by several fully connected layers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Bidirectional LSTM