Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
Sajjad Abdoli, Ghassan Al-Sumaidaee, Clayton W. Taylor, Ahmad ElShiekh, Ahmed Rashad

TL;DR
This paper introduces a comprehensive benchmark for evaluating commercial ASR systems on code-switching speech in Arabic, Persian, and German, highlighting performance differences and proposing more reliable evaluation metrics.
Contribution
It provides a new benchmark dataset and evaluation pipeline for multilingual code-switching ASR, including a cost-effective scoring method and analysis of semantic similarity metrics.
Findings
ElevenLabs Scribe v2 achieved the lowest WER (13.2%) across all language pairs.
BERTScore proved more reliable than WER for Arabic and Persian code-switching evaluation.
Difficulty-stratified analysis revealed performance gaps hidden in aggregate metrics.
Abstract
Code-switching -- the natural alternation between two languages within a single utterance -- represents one of the most challenging and under-studied conditions for automatic speech recognition (ASR). Existing commercial ASR benchmarks predominantly evaluate clean, monolingual audio and report a single Word Error Rate (WER) figure that tells practitioners little about real-world multilingual performance. We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic--English, Saudi Arabic (Najdi/Hijazi)--English, Persian (Farsi)--English, and German--English. Each dataset comprises 300 samples selected by a two-stage pipeline: a heuristic filter scoring transcripts on five structural code-switching signals, followed by a GPT-4o and Gemini 1.5 Pro ensemble scoring candidates across six linguistic dimensions. This pipeline reduces LLM scoring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
