Representation learning through cross-modal conditional teacher-student training for speech emotion recognition
Sundararajan Srinivasan, Zhaocheng Huang, Katrin Kirchhoff

TL;DR
This paper introduces a cross-modal teacher-student training approach that enhances speech emotion recognition by integrating lexical information and estimating prediction quality, achieving state-of-the-art results on benchmark datasets.
Contribution
It proposes a novel method for improving speech emotion recognition by combining multimodal representations with a quality-conditioned teacher-student training framework.
Findings
Achieved new state-of-the-art CCC values on MSP-Podcast for activation, valence, and dominance.
Outperformed previous models in valence prediction by incorporating lexical information.
Demonstrated the effectiveness of quality estimation in teacher-student training for emotion recognition.
Abstract
Generic pre-trained speech and text representations promise to reduce the need for large labeled datasets on specific speech and language tasks. However, it is not clear how to effectively adapt these representations for speech emotion recognition. Recent public benchmarks show the efficacy of several popular self-supervised speech representations for emotion classification. In this study, we show that the primary difference between the top-performing representations is in predicting valence while the differences in predicting activation and dominance dimensions are less pronounced. However, we show that even the best-performing HuBERT representation underperforms on valence prediction compared to a multimodal model that also incorporates text representation. We address this shortcoming by injecting lexical information into the speech representation using the multimodal model as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
