Active Learning Based Fine-Tuning Framework for Speech Emotion Recognition
Dongyuan Li, Yusong Wang, Kotaro Funakoshi, Manabu Okumura

TL;DR
This paper introduces an active learning-based fine-tuning framework for speech emotion recognition that improves accuracy and reduces training time by selectively using informative samples and minimizing the information gap through task adaptation pre-training.
Contribution
It combines task adaptation pre-training with active learning to enhance SER performance and efficiency, addressing limitations of existing methods.
Findings
20% sample usage yields 8.45% accuracy improvement
Reduces fine-tuning time by 79%
Effective in large-scale noisy data scenarios
Abstract
Speech emotion recognition (SER) has drawn increasing attention for its applications in human-machine interaction. However, existing SER methods ignore the information gap between the pre-training speech recognition task and the downstream SER task, leading to sub-optimal performance. Moreover, they require much time to fine-tune on each specific speech dataset, restricting their effectiveness in real-world scenes with large-scale noisy data. To address these issues, we propose an active learning (AL) based Fine-Tuning framework for SER that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. Specifically, we first use TAPT to minimize the information gap between the pre-training and the downstream task. Then, AL methods are used to iteratively select a subset of the most informative and diverse samples for fine-tuning, reducing time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
