Voice Conversion Can Improve ASR in Very Low-Resource Settings

Matthew Baas; Herman Kamper

arXiv:2111.02674·eess.AS·June 22, 2022

Voice Conversion Can Improve ASR in Very Low-Resource Settings

Matthew Baas, Herman Kamper

PDF

Open Access

TL;DR

This paper demonstrates that cross-lingual voice conversion can effectively augment data and improve speech recognition accuracy in low-resource languages, outperforming traditional augmentation methods.

Contribution

It introduces a practical cross-lingual voice conversion system trained on English and shows its effectiveness in enhancing low-resource speech recognition.

Findings

01

VC augmentation improves recognition in all tested low-resource languages.

02

VC outperforms SpecAugment in low-resource settings.

03

Cross-lingual VC is feasible and beneficial for low-resource ASR.

Abstract

Voice conversion (VC) could be used to improve speech recognition systems in low-resource languages by using it to augment limited training data. However, VC has not been widely used for this purpose because of practical issues such as compute speed and limitations when converting to and from unseen speakers. Moreover, it is still unclear whether a VC model trained on one well-resourced language can be applied to speech from another low-resource language for the aim of data augmentation. In this work we assess whether a VC system can be used cross-lingually to improve low-resource speech recognition. We combine several recent techniques to design and train a practical VC system in English, and then use this system to augment data for training speech recognition models in several low-resource languages. When using a sensible amount of VC augmented data, speech recognition performance is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings