Zero-Shot Context-Aware ASR for Diverse Arabic Varieties
Bashar Talafha, Amin Abu Alhassan, Muhammad Abdul-Mageed

TL;DR
This paper introduces a context-aware decoding approach for zero-shot Arabic speech recognition, improving accuracy across diverse dialects and accents by leveraging external information without retraining models.
Contribution
It proposes novel prompt-based and proxy-guided methods for context-aware inference applicable to various ASR architectures, enhancing zero-shot dialectal Arabic recognition.
Findings
Average WER reductions of over 20% on MSA and accented Arabic.
Proxy-guided selection improves WER by 15.6% on MSA.
Context-aware decoding generalizes beyond encoder-decoder models.
Abstract
Zero-shot ASR for Arabic remains challenging: while multilingual models perform well on Modern Standard Arabic (MSA), error rates rise sharply on dialectal and accented speech due to linguistic mismatch and scarce labeled data. We study context-aware decoding as a lightweight test-time adaptation paradigm that conditions inference on external side information without parameter updates. For promptable encoder-decoder ASR (e.g., Whisper), we incorporate context through (i) decoder prompting with first-pass hypotheses and (ii) encoder/decoder prefixing with retrieved speech-text exemplars, complemented by simple prompt reordering and optional speaker-matched synthetic exemplars to improve robustness in informal and multi-speaker settings. To extend contextual adaptation beyond promptable architectures, we introduce proxy-guided n-best selection for CTC ASR: given one or more external proxy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing
