TL;DR
This paper explores fine-tuning strategies for adapting Whisper speech recognition models to Pashto, a language previously unsupported, demonstrating effective methods and analyzing their performance and limitations.
Contribution
It systematically compares multiple fine-tuning approaches for Whisper on Pashto, revealing the most effective strategies and providing insights into their advantages and shortcomings.
Findings
Vanilla fine-tuning outperforms LoRA, frozen-encoder, and Urdu transfer methods.
Whisper-small achieves a WER of 24.89% on Pashto with 113 hours of data.
Online augmentation improves WER by 7.25 percentage points.
Abstract
Pashto is absent from Whisper's pre-training corpus despite being one of CommonVoice's largest language collections, leaving off-the-shelf models unusable: all Whisper sizes output Arabic, Dari, or Urdu script on Pashto audio, achieving word error rates above 100%. We compare four fine-tuning strategies for whisper-base on CommonVoice Pashto v20: vanilla full fine-tuning, LoRA (rank 64), frozen-encoder (2/6 layers), and multistage Urdu-to-Pashto transfer. We extend vanilla fine-tuning to whisper-small and whisper-large-v3-turbo on CommonVoice Pashto v24 (113 hours). Vanilla fine-tuning achieves WER 21.22% on CV20, outperforming LoRA by 33.36 pp, frozen-encoder by 14.76 pp, and Urdu transfer by 44.56 pp. Frozen-encoder fine-tuning degrades performance on whisper-base (6 encoder layers): layer-function separation does not hold at this depth, and freezing removes a third of trainable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
