Adapting WavLM for Speech Emotion Recognition
Daria Diatlova, Anton Udalov, Vitalii Shutov, Egor Spirin

TL;DR
This paper investigates fine-tuning WavLM Large, a speech self-supervised model, for speech emotion recognition, exploring strategies involving gender and semantic information to improve performance on the MSP Podcast Corpus.
Contribution
It presents a systematic analysis of fine-tuning strategies for WavLM Large in speech emotion recognition, including the use of gender and semantic cues, and applies these to a challenge dataset.
Findings
Fine-tuning strategies significantly impact emotion recognition accuracy.
Incorporating gender and semantic information improves model performance.
The final model was submitted to the Speech Emotion Recognition Challenge 2024.
Abstract
Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
