Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models
Asad Ullah, Alessandro Ragano, Andrew Hines

TL;DR
This paper investigates the effectiveness of various audio augmentation techniques for low-resource self-supervised speech models, finding combined synthetic augmentations outperform other methods in phoneme recognition tasks.
Contribution
It introduces a combined synthetic augmentation strategy for pre-training SSRL models in low-resource settings, outperforming accent and language transfer methods.
Findings
Combined noise and pitch augmentation outperforms other augmentation strategies.
Scaling augmented data can match the performance of target domain pre-training.
Synthetic augmentations are a viable alternative for resource-constrained languages.
Abstract
Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition. Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available. A common approach is cross-lingual pre-training. Instead, we propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition. Our comparisons found that a combined synthetic augmentations (noise/pitch) strategy outperformed accent and language knowledge transfer. Furthermore, we examined the scaling factor of augmented data to achieve equivalent performance to model pre-trained with target domain speech. Our findings suggest that for resource-constrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Music and Audio Processing
