Sentiment-Aware Automatic Speech Recognition pre-training for enhanced Speech Emotion Recognition
Ayoub Ghriss, Bo Yang, Viktor Rozgic, Elizabeth Shriberg, Chao Wang

TL;DR
This paper introduces a multi-task pre-training approach that combines automatic speech recognition and sentiment classification to improve speech emotion recognition accuracy, achieving state-of-the-art results on MSP-Podcast.
Contribution
It presents a novel multi-task pre-training method that enhances speech emotion recognition by making ASR models sentiment-aware through joint training.
Findings
Achieved a CCC of 0.41 for valence prediction on MSP-Podcast
Demonstrated improved emotion recognition performance over baseline models
Proposed a sentiment-aware pre-training framework for speech models
Abstract
We propose a novel multi-task pre-training method for Speech Emotion Recognition (SER). We pre-train SER model simultaneously on Automatic Speech Recognition (ASR) and sentiment classification tasks to make the acoustic ASR model more ``emotion aware''. We generate targets for the sentiment classification using text-to-sentiment model trained on publicly available data. Finally, we fine-tune the acoustic ASR on emotion annotated speech data. We evaluated the proposed approach on the MSP-Podcast dataset, where we achieved the best reported concordance correlation coefficient (CCC) of 0.41 for valence prediction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Sentiment Analysis and Opinion Mining · Speech and Audio Processing
