Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation
Hanif Rahman

TL;DR
This paper evaluates multilingual speech models on Pashto, highlighting zero-shot performance, script fidelity issues, and cross-domain robustness, providing the first public benchmarks for Pashto ASR.
Contribution
It offers the first reproducible evaluation of multilingual models on Pashto, revealing zero-shot performance, script failure rates, and cross-domain transfer insights.
Findings
SeamlessM4T achieves 39.7% WER on Common Voice Pashto test set.
Zero-shot Whisper WER ranges from 90% to 297%, with model collapse at 461%.
Pashto script fidelity exceeds 93% in some models, but WER alone masks script failure.
Abstract
Pashto is spoken by approximately 60--80 million people but has no published benchmarks for multilingual automatic speech recognition (ASR) on any shared public test set. This paper reports the first reproducible multi-model evaluation on public Pashto data, covering zero-shot ASR, script-level failure, and cross-domain evaluation of fine-tuned models. For zero-shot ASR, ten models (all seven Whisper sizes, MMS-1B, SeamlessM4T-v2-large, and OmniASR-CTC-300M) are evaluated on the FLEURS Pashto test set and a filtered Common Voice~24 subset; zero-shot Whisper WER ranges from 90% to 297%, with the medium model collapsing to 461% on Common Voice~24 consistent with decoder looping. SeamlessM4T achieves 39.7% WER on Common Voice~24 (the best zero-shot result reported to date, as of submission); MMS-1B achieves 43.8% on FLEURS. For script failure, a language-identification audit shows that no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
