Speech Emotion: Investigating Model Representations, Multi-Task Learning and Knowledge Distillation
Vikramjit Mitra, Hsiang-Yun Sherry Chien, Vasudha Kowtha, Joseph Yitan, Cheng, Erdrin Azemi

TL;DR
This paper explores using pre-trained model representations and knowledge distillation to enhance speech emotion recognition, achieving state-of-the-art results in valence estimation and emotion modeling across multiple datasets.
Contribution
It introduces a fusion approach of pre-trained embeddings and a distillation method to improve emotion recognition accuracy from speech signals.
Findings
Fusion of pre-trained embeddings improves valence CCC by 79%.
Knowledge distillation yields a 12% relative improvement.
Achieved new state-of-the-art results on MSP-Podcast datasets.
Abstract
Estimating dimensional emotions, such as activation, valence and dominance, from acoustic speech signals has been widely explored over the past few years. While accurate estimation of activation and dominance from speech seem to be possible, the same for valence remains challenging. Previous research has shown that the use of lexical information can improve valence estimation performance. Lexical information can be obtained from pre-trained acoustic models, where the learned representations can improve valence estimation from speech. We investigate the use of pre-trained model representations to improve valence estimation from acoustic speech signal. We also explore fusion of representations to improve emotion estimation across all three emotion dimensions: activation, valence and dominance. Additionally, we investigate if representations from pre-trained models can be distilled into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
