ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models
Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan

TL;DR
ArabicNumBench is a comprehensive benchmark assessing large language models' ability to read Arabic numbers across various contexts, revealing significant performance variation and highlighting the importance of output structure and instruction-following capabilities.
Contribution
This work introduces ArabicNumBench, the first extensive benchmark for Arabic number reading in LLMs, evaluating 71 models with diverse prompting strategies across multiple contextual categories.
Findings
Few-shot Chain-of-Thought prompting significantly improves accuracy.
High-accuracy models often produce unstructured outputs without explicit markers.
Only a few models consistently generate structured, reliable outputs.
Abstract
We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive and developmental aspects of mathematical skills · Handwritten Text Recognition Techniques · Topic Modeling
