Methods to Increase the Amount of Data for Speech Recognition for Low   Resource Languages

Alexan Ayrapetyan; Sofia Kostandian; Ara Yeroyan; Mher Yerznkanyan,; Nikolay Karpov; Nune Tadevosyan; Vitaly Lavrukhin; Boris Ginsburg

arXiv:2501.14788·cs.SD·February 10, 2025

Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages

Alexan Ayrapetyan, Sofia Kostandian, Ara Yeroyan, Mher Yerznkanyan,, Nikolay Karpov, Nune Tadevosyan, Vitaly Lavrukhin, Boris Ginsburg

PDF

Open Access

TL;DR

This paper investigates cost-effective methods to augment data for low-resource speech recognition, demonstrating that paid crowdsourcing yields the best results and providing improved models for Armenian and Georgian languages.

Contribution

It provides practical strategies for dataset expansion in low-resource languages, highlighting the effectiveness of paid crowdsourcing and open-sourcing models for Armenian and Georgian.

Findings

01

Paid crowdsourcing outperforms other data collection methods.

02

Expanded datasets improve ASR performance significantly.

03

Open-sourced models facilitate further research.

Abstract

This study explores methods to increase data volume for low-resource languages using techniques such as crowdsourcing, pseudo-labeling, advanced data preprocessing and various permissive data sources such as audiobooks, Common Voice, YouTube. While these methods are well-explored for highresource languages, their application for low-resource languages remains underexplored. Using Armenian and Georgian as case studies, we demonstrate how linguistic and resource-specific characteristics influence the success of these methods. This work provides practical guidance for researchers to choose cost-effective and quality-driven dataset extension strategies for low-resource languages. The key takeaway from various data extension approaches is that paid crowd-sourcing offers the best balance between cost and quality, outperforming volunteer crowd-sourcing, open-source audiobooks, and unlabeled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques · Speech and Audio Processing