Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised   Speech Models

Reem Gody; David Harwath

arXiv:2212.01661·eess.AS·December 6, 2022

Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models

Reem Gody, David Harwath

PDF

Open Access

TL;DR

This paper explores unsupervised data selection methods for fine-tuning self-supervised speech models like HuBERT, emphasizing diversity and novel selection techniques to improve ASR performance with limited transcribed data.

Contribution

It introduces two novel unsupervised data selection techniques based on pre-training loss and PBPE perplexity, and analyzes their impact on ASR performance and data diversity.

Findings

01

Token, speaker, and topic diversity improve WER.

02

Proposed selection methods outperform random selection.

03

Correlations between data characteristics and WER are identified.

Abstract

Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models when we have access to only a small amount of transcribed speech data. However, this raises the question of which subset of the available unlabeled data should be selected for transcription. Our work investigates different unsupervised data selection techniques for fine-tuning the HuBERT model under a limited transcription budget. We investigate the impact of speaker diversity, gender bias, and topic diversity on the downstream ASR performance. We also devise two novel techniques for unsupervised data selection: pre-training loss based data selection and the perplexity of byte pair encoded clustered units (PBPE) and we show how these techniques compare to pure random data selection. Finally, we analyze the correlations between the inherent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems