From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset

Jandad Jahani; Mursal Dawodi; Jawid Ahmad Baktash

arXiv:2602.14062·cs.CL·February 17, 2026

From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset

Jandad Jahani, Mursal Dawodi, Jawid Ahmad Baktash

PDF

Open Access

TL;DR

This paper analyzes the growth, demographic distribution, and structural characteristics of the Pashto speech dataset in Mozilla Common Voice, highlighting challenges and priorities for dataset maturity in low-resource language ASR development.

Contribution

It provides a comprehensive release-level analysis of Pashto Common Voice data, revealing insights into scale, contributor participation, demographic biases, and data structure, guiding future improvements.

Findings

01

Rapid growth from 1.49 to 2,768.7 hours of speech data.

02

High contributor participation inequality (Gini = 0.941).

03

Skewed demographic representation with limited gender metadata.

Abstract

Large, openly licensed speech datasets are essential for building automatic speech recognition (ASR) systems, yet many widely spoken languages remain underrepresented in public resources. Pashto, spoken by more than 60 million people, has historically lacked large-scale openly licensed speech data suitable for modern ASR development. This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. We document rapid growth from 1.49 recorded hours in mid-2023 to 2,768.7 total hours in 2025, including 975.89 validated hours available for supervised ASR training. Beyond scale, we analyze validation throughput, contributor participation inequality, demographic metadata completeness, and sentence-level concentration in the validated subset. We find that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · ICT in Developing Communities · Face recognition and analysis