LanSER: Language-Model Supported Speech Emotion Recognition
Taesik Gong, Josh Belanich, Krishna Somandepalli, Arsha Nagrani, Brian, Eoff, Brendan Jou

TL;DR
LanSER introduces a weakly-supervised approach for speech emotion recognition that leverages large language models to infer emotion labels from speech transcripts, reducing reliance on costly labeled data.
Contribution
It proposes a novel method using pre-trained language models and textual entailment to generate weak labels for SER, enhancing scalability and label efficiency.
Findings
Pre-trained models with weak supervision outperform baselines on standard datasets.
The approach improves label efficiency in speech emotion recognition.
Representations capture prosodic speech content despite text-based label derivation.
Abstract
Speech emotion recognition (SER) models typically rely on costly human-labeled data for training, making scaling methods to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER, a method that enables the use of unlabeled data by inferring weak emotion labels via pre-trained large language models through weakly-supervised learning. For inferring weak labels constrained to a taxonomy, we use a textual entailment approach that selects an emotion label with the highest entailment score for a speech transcript extracted via automatic speech recognition. Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and show improved label efficiency. Despite being pre-trained on labels derived only from text, we show that the resulting representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
