Distilled HuBERT for Mobile Speech Emotion Recognition: A Cross-Corpus Validation Study

Saifelden M. Ismail

arXiv:2512.23435·cs.SD·January 1, 2026

Distilled HuBERT for Mobile Speech Emotion Recognition: A Cross-Corpus Validation Study

Saifelden M. Ismail

PDF

Open Access

TL;DR

This study introduces a highly compact, quantized transformer-based speech emotion recognition system suitable for mobile devices, demonstrating competitive accuracy and cross-corpus robustness through extensive validation and transfer learning.

Contribution

We developed DistilHuBERT, a distilled and quantized transformer model that significantly reduces size while maintaining accuracy, and validated its effectiveness across multiple emotion datasets with cross-corpus training.

Findings

01

92% parameter reduction with maintained accuracy

02

Improved cross-corpus generalization by 1.2% in weighted accuracy

03

Model achieves 61.4% Unweighted Accuracy with a 23 MB footprint

Abstract

Speech Emotion Recognition (SER) has significant potential for mobile applications, yet deployment remains constrained by the computational demands of state-of-the-art transformer architectures. This paper presents a mobile-efficient SER system based on DistilHuBERT, a distilled and 8-bit quantized transformer that achieves approximately 92% parameter reduction compared to full-scale Wav2Vec 2.0 models while maintaining competitive accuracy. We conduct a rigorous 5-fold Leave-One-Session-Out (LOSO) cross-validation on the IEMOCAP dataset to ensure speaker independence, augmented with cross-corpus training on CREMA-D to enhance generalization. Cross-corpus training with CREMA-D yields a 1.2% improvement in Weighted Accuracy, a 1.4% gain in Macro F1-score, and a 32% reduction in cross-fold variance, with the Neutral class showing the most substantial benefit at 5.4% F1-score improvement.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Music and Audio Processing