EmoAugNet: A Signal-Augmented Hybrid CNN-LSTM Framework for Speech Emotion Recognition

Durjoy Chandra Paul; Gaurob Saha; and Md Amjad Hossain

arXiv:2508.06321·cs.SD·August 11, 2025

EmoAugNet: A Signal-Augmented Hybrid CNN-LSTM Framework for Speech Emotion Recognition

Durjoy Chandra Paul, Gaurob Saha, and Md Amjad Hossain

PDF

Open Access

TL;DR

EmoAugNet is a hybrid CNN-LSTM framework that uses advanced data augmentation techniques to significantly improve speech emotion recognition accuracy across multiple datasets.

Contribution

The paper introduces EmoAugNet, a novel hybrid deep learning model with a comprehensive data augmentation pipeline for enhanced speech emotion recognition.

Findings

01

Achieved over 95% weighted accuracy on IEMOCAP dataset.

02

Demonstrated robustness and improved generalization with combined augmentation methods.

03

Outperformed existing SER models on multiple benchmark datasets.

Abstract

Recognizing emotional signals in speech has a significant impact on enhancing the effectiveness of human-computer interaction (HCI). This study introduces EmoAugNet, a hybrid deep learning framework, that incorporates Long Short-Term Memory (LSTM) layers with one-dimensional Convolutional Neural Networks (1D-CNN) to enable reliable Speech Emotion Recognition (SER). The quality and variety of the features that are taken from speech signals have a significant impact on how well SER systems perform. A comprehensive speech data augmentation strategy was used to combine both traditional methods, such as noise addition, pitch shifting, and time stretching, with a novel combination-based augmentation pipeline to enhance generalization and reduce overfitting. Each audio sample was transformed into a high-dimensional feature vector using root mean square energy (RMSE), Mel-frequency Cepstral…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis