Distilled HuBERT for Mobile Speech Emotion Recognition: A Cross-Corpus Validation Study
Saifelden M. Ismail

TL;DR
This study introduces a highly compact, quantized transformer-based speech emotion recognition system suitable for mobile devices, demonstrating competitive accuracy and cross-corpus robustness through extensive validation and transfer learning.
Contribution
We developed DistilHuBERT, a distilled and quantized transformer model that significantly reduces size while maintaining accuracy, and validated its effectiveness across multiple emotion datasets with cross-corpus training.
Findings
92% parameter reduction with maintained accuracy
Improved cross-corpus generalization by 1.2% in weighted accuracy
Model achieves 61.4% Unweighted Accuracy with a 23 MB footprint
Abstract
Speech Emotion Recognition (SER) has significant potential for mobile applications, yet deployment remains constrained by the computational demands of state-of-the-art transformer architectures. This paper presents a mobile-efficient SER system based on DistilHuBERT, a distilled and 8-bit quantized transformer that achieves approximately 92% parameter reduction compared to full-scale Wav2Vec 2.0 models while maintaining competitive accuracy. We conduct a rigorous 5-fold Leave-One-Session-Out (LOSO) cross-validation on the IEMOCAP dataset to ensure speaker independence, augmented with cross-corpus training on CREMA-D to enhance generalization. Cross-corpus training with CREMA-D yields a 1.2% improvement in Weighted Accuracy, a 1.4% gain in Macro F1-score, and a 32% reduction in cross-fold variance, with the Neutral class showing the most substantial benefit at 5.4% F1-score improvement.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Music and Audio Processing
