Unsupervised data selection for Speech Recognition with contrastive loss   ratios

Chanho Park; Rehan Ahmad; Thomas Hain

arXiv:2207.12028·eess.AS·July 26, 2022

Unsupervised data selection for Speech Recognition with contrastive loss ratios

Chanho Park, Rehan Ahmad, Thomas Hain

PDF

TL;DR

This paper introduces an unsupervised data selection method for speech recognition using contrastive loss ratios and submodular functions, leading to improved WER performance over traditional likelihood-based methods.

Contribution

It presents a novel unsupervised data selection technique based on contrastive loss ratios and submodular optimization, enhancing speech recognition training efficiency and accuracy.

Findings

01

Outperforms likelihood-based selection in WER reduction

02

Reduces WER by 20.23% on Tedtalks dataset

03

Decreases WER by 6.26% on WSJCAM0 with less data

Abstract

This paper proposes an unsupervised data selection method by using a submodular function based on contrastive loss ratios of target and training data sets. A model using a contrastive loss function is trained on both sets. Then the ratio of frame-level losses for each model is used by a submodular function. By using the submodular function, a training set for automatic speech recognition matching the target data set is selected. Experiments show that models trained on the data sets selected by the proposed method outperform the selection method based on log-likelihoods produced by GMM-HMM models, in terms of word error rate (WER). When selecting a fixed amount, e.g. 10 hours of data, the difference between the results of two methods on Tedtalks was 20.23% WER relative. The method can also be used to select data with the aim of minimising negative transfer, while maintaining or improving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.