Predicting positive transfer for improved low-resource speech   recognition using acoustic pseudo-tokens

Nay San; Georgios Paraskevopoulos; Aryaman Arora; Xiluo He; Prabhjot; Kaur; Oliver Adams; Dan Jurafsky

arXiv:2402.02302·eess.AS·February 6, 2024·2 cites

Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens

Nay San, Georgios Paraskevopoulos, Aryaman Arora, Xiluo He, Prabhjot, Kaur, Oliver Adams, Dan Jurafsky

PDF

Open Access

TL;DR

This paper demonstrates that supplementing low-resource language speech models with data from similar higher-resource languages improves ASR performance, and introduces a novel similarity metric to select optimal donor languages.

Contribution

It proposes a new similarity metric, ATDS, to predict effective donor languages for low-resource speech recognition enhancement.

Findings

01

Supplementing low-resource languages with similar donor data improves ASR.

02

ATDS accurately predicts the effectiveness of donor languages.

03

Using ATDS, donor selection can be optimized for better ASR outcomes.

Abstract

While massively multilingual speech models like wav2vec 2.0 XLSR-128 can be directly fine-tuned for automatic speech recognition (ASR), downstream performance can still be relatively poor on languages that are under-represented in the pre-training data. Continued pre-training on 70-200 hours of untranscribed speech in these languages can help -- but what about languages without that much recorded data? For such cases, we show that supplementing the target language with data from a similar, higher-resource 'donor' language can help. For example, continued pre-training on only 10 hours of low-resource Punjabi supplemented with 60 hours of donor Hindi is almost as good as continued pretraining on 70 hours of Punjabi. By contrast, sourcing data from less similar donors like Bengali does not improve ASR performance. To inform donor language selection, we propose a novel similarity metric…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsSparse Evolutionary Training