Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language

Hanif Rahman; Shafeeq ur Rehman

arXiv:2603.27021·cs.CL·March 31, 2026

Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language

Hanif Rahman, Shafeeq ur Rehman

PDF

TL;DR

This paper introduces the Pashto Common Voice corpus, the first large-scale open speech dataset for Pashto, significantly advancing speech technology resources for this low-resource language.

Contribution

It details the creation of a comprehensive Pashto speech corpus through community efforts, including methodology, dataset statistics, and initial speech recognition results.

Findings

01

Corpus grew from 1.5 to 147 hours of speech data.

02

Speaker participation increased 108-fold after outreach campaigns.

03

Fine-tuning Whisper on the corpus achieved 13.4% WER, a substantial improvement over zero-shot performance.

Abstract

We present the Pashto Common Voice corpus -- the first large-scale, openly licensed speech resource for Pashto, a language with over 60 million native speakers largely absent from open speech technology. Through a community effort spanning 2022-2025, the corpus grew from 1.5 hours and 5 contributors to 147 total hours and 1,483 unique speakers across ten Mozilla Common Voice releases (CV14-CV23). Speaker participation increased approximately 108-fold between CV17 and CV18, coinciding with a VOA Pashto broadcast campaign. We describe the full methodology: interface localisation, Wikipedia-based sentence extraction with automated filtering, phonemically targeted contributions for the four most frequently dropped Pashto characters, and multi-channel community outreach. MCV23 contains 107,781 clips (60,337 validated; 82.33 validated hours) across 13 content domains. Fine-tuning Whisper Base…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.