Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language   augmentation for Low Resource Self-Supervised Speech Models

Asad Ullah; Alessandro Ragano; Andrew Hines

arXiv:2309.12763·eess.AS·July 2, 2024

Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models

Asad Ullah, Alessandro Ragano, Andrew Hines

PDF

Open Access

TL;DR

This paper investigates the effectiveness of various audio augmentation techniques for low-resource self-supervised speech models, finding combined synthetic augmentations outperform other methods in phoneme recognition tasks.

Contribution

It introduces a combined synthetic augmentation strategy for pre-training SSRL models in low-resource settings, outperforming accent and language transfer methods.

Findings

01

Combined noise and pitch augmentation outperforms other augmentation strategies.

02

Scaling augmented data can match the performance of target domain pre-training.

03

Synthetic augmentations are a viable alternative for resource-constrained languages.

Abstract

Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition. Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available. A common approach is cross-lingual pre-training. Instead, we propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition. Our comparisons found that a combined synthetic augmentations (noise/pitch) strategy outperformed accent and language knowledge transfer. Furthermore, we examined the scaling factor of augmented data to achieve equivalent performance to model pre-trained with target domain speech. Our findings suggest that for resource-constrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Music and Audio Processing