Toward Efficient Speech Emotion Recognition via Spectral Learning and Attention
HyeYoung Lee, Muhammad Nadeem

TL;DR
This paper introduces a spectral learning and attention-based 1D-CNN framework for speech emotion recognition, achieving high accuracy and setting new benchmarks across multiple datasets.
Contribution
It proposes a novel SER method using MFCC spectral features combined with attention-enhanced 1D-CNNs and data augmentation, improving subtle emotion detection and cross-dataset generalization.
Findings
Achieved state-of-the-art accuracy on multiple SER datasets.
Demonstrated robustness and high precision in emotion recognition.
Enhanced model generalization across diverse speech datasets.
Abstract
Speech Emotion Recognition (SER) traditionally relies on auditory data analysis for emotion classification. Several studies have adopted different methods for SER. However, existing SER methods often struggle to capture subtle emotional variations and generalize across diverse datasets. In this article, we use Mel-Frequency Cepstral Coefficients (MFCCs) as spectral features to bridge the gap between computational emotion processing and human auditory perception. To further improve robustness and feature diversity, we propose a novel 1D-CNN-based SER framework that integrates data augmentation techniques. MFCC features extracted from the augmented data are processed using a 1D Convolutional Neural Network (CNN) architecture enhanced with channel and spatial attention mechanisms. These attention modules allow the model to highlight key emotional patterns, enhancing its ability to capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech Recognition and Synthesis
