Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations
Debaditya Shome, Ali Etemad

TL;DR
EmoDistill is a speech emotion recognition framework that uses cross-modal knowledge distillation to learn robust linguistic and prosodic features from speech, achieving state-of-the-art accuracy without requiring transcription during inference.
Contribution
The paper introduces EmoDistill, a novel distillation-based approach that enhances speech emotion recognition by leveraging pre-trained prosodic and linguistic teachers during training.
Findings
Outperforms existing unimodal and multimodal methods on IEMOCAP
Achieves state-of-the-art accuracy of 77.49% unweighted
Reduces computation by using only speech signals during inference
Abstract
We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsKnowledge Distillation
