Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study
Lucky Onyekwelu-Udoka, Md Shafiqul Islam, Md Shahedul Hasan

TL;DR
This study compares lightweight transformer models for speech emotion recognition, finding DistilHuBERT outperforms others in accuracy and size, making it suitable for real-time edge device applications.
Contribution
It introduces a comparative analysis of lightweight transformer models for speech emotion detection, highlighting DistilHuBERT's superior performance and efficiency.
Findings
DistilHuBERT achieves 70.64% accuracy and 70.36% F1 score.
PaSST with MLP head performs best among its variants.
Angry emotion is detected most accurately, disgust is most challenging.
Abstract
Emotion recognition from speech plays a vital role in the development of empathetic human-computer interaction systems. This paper presents a comparative analysis of lightweight transformer-based models, DistilHuBERT and PaSST, by classifying six core emotions from the CREMA-D dataset. We benchmark their performance against a traditional CNN-LSTM baseline model using MFCC features. DistilHuBERT demonstrates superior accuracy (70.64%) and F1 score (70.36%) while maintaining an exceptionally small model size (0.02 MB), outperforming both PaSST and the baseline. Furthermore, we conducted an ablation study on three variants of the PaSST, Linear, MLP, and Attentive Pooling heads, to understand the effect of classification head architecture on model performance. Our results indicate that PaSST with an MLP head yields the best performance among its variants but still falls short of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Sentiment Analysis and Opinion Mining
