Speech Emotion Detection Based on MFCC and CNN-LSTM Architecture
Qianhe Ouyang

TL;DR
This paper presents a CNN-LSTM model utilizing MFCC features for speech emotion detection, achieving over 61% accuracy on a combined dataset of seven emotions, highlighting the influence of emotion distinctiveness on classification performance.
Contribution
It introduces a hybrid CNN-LSTM architecture with MFCC features for speech emotion recognition, demonstrating its effectiveness on a multi-emotion dataset.
Findings
Achieved 61.07% overall accuracy.
Anger and neutral emotions detected with over 75% accuracy.
Emotion classification accuracy varies with emotion distinctiveness.
Abstract
Emotion detection techniques have been applied to multiple cases mainly from facial image features and vocal audio features, of which the latter aspect is disputed yet not only due to the complexity of speech audio processing but also the difficulties of extracting appropriate features. Part of the SAVEE and RAVDESS datasets are selected and combined as the dataset, containing seven sorts of common emotions (i.e. happy, neutral, sad, anger, disgust, fear, and surprise) and thousands of samples. Based on the Librosa package, this paper processes the initial audio input into waveplot and spectrum for analysis and concentrates on multiple features including MFCC as targets for feature extraction. The hybrid CNN-LSTM architecture is adopted by virtue of its strong capability to deal with sequential data and time series, which mainly consists of four convolutional layers and three long…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
