Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language
Hanif Rahman, Shafeeq ur Rehman

TL;DR
This paper introduces the Pashto Common Voice corpus, the first large-scale open speech dataset for Pashto, significantly advancing speech technology resources for this low-resource language.
Contribution
It details the creation of a comprehensive Pashto speech corpus through community efforts, including methodology, dataset statistics, and initial speech recognition results.
Findings
Corpus grew from 1.5 to 147 hours of speech data.
Speaker participation increased 108-fold after outreach campaigns.
Fine-tuning Whisper on the corpus achieved 13.4% WER, a substantial improvement over zero-shot performance.
Abstract
We present the Pashto Common Voice corpus -- the first large-scale, openly licensed speech resource for Pashto, a language with over 60 million native speakers largely absent from open speech technology. Through a community effort spanning 2022-2025, the corpus grew from 1.5 hours and 5 contributors to 147 total hours and 1,483 unique speakers across ten Mozilla Common Voice releases (CV14-CV23). Speaker participation increased approximately 108-fold between CV17 and CV18, coinciding with a VOA Pashto broadcast campaign. We describe the full methodology: interface localisation, Wikipedia-based sentence extraction with automated filtering, phonemically targeted contributions for the four most frequently dropped Pashto characters, and multi-channel community outreach. MCV23 contains 107,781 clips (60,337 validated; 82.33 validated hours) across 13 content domains. Fine-tuning Whisper Base…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
