Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models
Reem Gody, David Harwath

TL;DR
This paper explores unsupervised data selection methods for fine-tuning self-supervised speech models like HuBERT, emphasizing diversity and novel selection techniques to improve ASR performance with limited transcribed data.
Contribution
It introduces two novel unsupervised data selection techniques based on pre-training loss and PBPE perplexity, and analyzes their impact on ASR performance and data diversity.
Findings
Token, speaker, and topic diversity improve WER.
Proposed selection methods outperform random selection.
Correlations between data characteristics and WER are identified.
Abstract
Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models when we have access to only a small amount of transcribed speech data. However, this raises the question of which subset of the available unlabeled data should be selected for transcription. Our work investigates different unsupervised data selection techniques for fine-tuning the HuBERT model under a limited transcription budget. We investigate the impact of speaker diversity, gender bias, and topic diversity on the downstream ASR performance. We also devise two novel techniques for unsupervised data selection: pre-training loss based data selection and the perplexity of byte pair encoded clustered units (PBPE) and we show how these techniques compare to pure random data selection. Finally, we analyze the correlations between the inherent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
