Speech Corpora Divergence Based Unsupervised Data Selection for ASR
Changfeng Gao, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan

TL;DR
This paper introduces an unsupervised speech corpora divergence method for selecting training data that closely matches target speech characteristics, improving ASR performance without requiring labeled data.
Contribution
It proposes a novel unsupervised data selection approach based on speech corpora divergence using self-supervised models, enhancing diversity and acoustic detail focus.
Findings
Achieves 14.8% relative improvement over random selection
Performs comparably or better than supervised selection methods
Effective across different accents in Common Voice dataset
Abstract
Selecting application scenarios matching data is important for the automatic speech recognition (ASR) training, but it is difficult to measure the matching degree of the training corpus. This study proposes a unsupervised target-aware data selection method based on speech corpora divergence (SCD), which can measure the similarity between two speech corpora. We first use the self-supervised Hubert model to discretize the speech corpora into label sequence and calculate the N-gram probability distribution. Then we calculate the Kullback-Leibler divergence between the N-grams as the SCD. Finally, we can choose the subset which has minimum SCD to the target corpus for annotation and training. Compared to previous data selection method, the SCD data selection method can focus on more acoustic details and guarantee the diversity of the selected set. We evaluate our method on different accents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
