Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach
Ara Yeroyan (Data Science Department, American University of Armenia),, Nikolay Karpov (Nvidia, NeMo Conversational AI team)

TL;DR
This paper presents a new pipeline for creating ASR training datasets from audiobooks, enabling better speech recognition in low-resource languages by effectively segmenting long audio into suitable training units.
Contribution
It introduces a novel, portable method for aligning and segmenting audiobook audio for low-resource language ASR dataset creation, improving data availability and model performance.
Findings
Effective audio-text alignment and segmentation method
Application demonstrated on Armenian language
Enhanced ASR performance for low-resource languages
Abstract
In recent years, automatic speech recognition (ASR) systems have significantly improved, especially in languages with a vast amount of transcribed speech data. However, ASR systems tend to perform poorly for low-resource languages with fewer resources, such as minority and regional languages. This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks, which typically feature a single transcript associated with hours-long audios. The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments, whereas optimal ASR training requires segments ranging from 4 to 15 seconds. To address this, we propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training. Our approach simplifies data preparation for ASR systems in low-resource languages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Packet Processing and Optimization · Service-Oriented Architecture and Web Services · Fault Detection and Control Systems
