EmoAugNet: A Signal-Augmented Hybrid CNN-LSTM Framework for Speech Emotion Recognition
Durjoy Chandra Paul, Gaurob Saha, and Md Amjad Hossain

TL;DR
EmoAugNet is a hybrid CNN-LSTM framework that uses advanced data augmentation techniques to significantly improve speech emotion recognition accuracy across multiple datasets.
Contribution
The paper introduces EmoAugNet, a novel hybrid deep learning model with a comprehensive data augmentation pipeline for enhanced speech emotion recognition.
Findings
Achieved over 95% weighted accuracy on IEMOCAP dataset.
Demonstrated robustness and improved generalization with combined augmentation methods.
Outperformed existing SER models on multiple benchmark datasets.
Abstract
Recognizing emotional signals in speech has a significant impact on enhancing the effectiveness of human-computer interaction (HCI). This study introduces EmoAugNet, a hybrid deep learning framework, that incorporates Long Short-Term Memory (LSTM) layers with one-dimensional Convolutional Neural Networks (1D-CNN) to enable reliable Speech Emotion Recognition (SER). The quality and variety of the features that are taken from speech signals have a significant impact on how well SER systems perform. A comprehensive speech data augmentation strategy was used to combine both traditional methods, such as noise addition, pitch shifting, and time stretching, with a novel combination-based augmentation pipeline to enhance generalization and reduce overfitting. Each audio sample was transformed into a high-dimensional feature vector using root mean square energy (RMSE), Mel-frequency Cepstral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis
