FSER: Deep Convolutional Neural Networks for Speech Emotion Recognition
Bonaventure F. P. Dossou, Yeno K. S. Gbenou

TL;DR
This paper introduces FSER, a deep convolutional neural network that uses mel-spectrograms to accurately classify speech emotions across multiple datasets, outperforming previous models with 95.05% accuracy.
Contribution
The paper presents FSER, a novel CNN-based speech emotion recognition model trained on multiple datasets, achieving state-of-the-art accuracy and robustness across languages and demographics.
Findings
FSER achieves 95.05% accuracy on emotion classification.
FSER outperforms previous models on all benchmark datasets.
Model remains reliable regardless of language, sex, or external factors.
Abstract
Using mel-spectrograms over conventional MFCCs features, we assess the abilities of convolutional neural networks to accurately recognize and classify emotions from speech data. We introduce FSER, a speech emotion recognition model trained on four valid speech databases, achieving a high-classification accuracy of 95,05\%, over 8 different emotion classes: anger, anxiety, calm, disgust, happiness, neutral, sadness, surprise. On each benchmark dataset, FSER outperforms the best models introduced so far, achieving a state-of-the-art performance. We show that FSER stays reliable, independently of the language, sex identity, and any other external factor. Additionally, we describe how FSER could potentially be used to improve mental and emotional health care and how our analysis and findings serve as guidelines and benchmarks for further works in the same direction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
