Pre-Finetuning for Few-Shot Emotional Speech Recognition

Maximillian Chen; Zhou Yu

arXiv:2302.12921·cs.CL·November 8, 2024·1 cites

Pre-Finetuning for Few-Shot Emotional Speech Recognition

Maximillian Chen, Zhou Yu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a pre-finetuning approach for speech models to improve few-shot emotional speech recognition, enhancing generalization across speakers and domains.

Contribution

It proposes a novel pre-finetuning method for speech models, inspired by NLP transfer learning, to better handle few-shot emotional speech classification tasks.

Findings

01

Pre-finetuning on diverse corpora improves few-shot recognition accuracy.

02

The approach reduces speaker overfitting in emotional speech tasks.

03

Experimental results show significant gains over baseline models.

Abstract

Speech models have long been known to overfit individual speakers for many classification tasks. This leads to poor generalization in settings where the speakers are out-of-domain or out-of-distribution, as is common in production environments. We view speaker adaptation as a few-shot learning problem and propose investigating transfer learning approaches inspired by recent success with pre-trained models in natural language tasks. We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives. We pre-finetune Wav2Vec2.0 on every permutation of four multiclass emotional speech recognition corpora and evaluate our pre-finetuned models through 33,600 few-shot fine-tuning trials on the Emotional Speech Dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maxlchen/Speech-PreFinetuning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing