The Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic
Shahad Al-Khalifa, Hend Al-Khalifa

TL;DR
This paper introduces two new benchmarks based on the Qiyas exam to evaluate Arabic language models' mathematical reasoning and understanding, revealing current models' limitations and guiding future improvements.
Contribution
It presents the first Arabic-specific benchmarks for mathematical and language understanding, derived from a standardized exam, and evaluates ChatGPT models' performance on them.
Findings
ChatGPT-4 achieved 64% accuracy on the benchmarks.
ChatGPT-3.5-trubo achieved 49% accuracy.
Benchmarks are challenging, highlighting room for improvement.
Abstract
Despite the growing importance of Arabic as a global language, there is a notable lack of language models pre-trained exclusively on Arabic data. This shortage has led to limited benchmarks available for assessing language model performance in Arabic. To address this gap, we introduce two novel benchmarks designed to evaluate models' mathematical reasoning and language understanding abilities in Arabic. These benchmarks are derived from a General Aptitude Test (GAT) called Qiyas exam, a standardized test widely used for university admissions in Saudi Arabia. For validation purposes, we assess the performance of ChatGPT-3.5-trubo and ChatGPT-4 on our benchmarks. Our findings reveal that these benchmarks pose a significant challenge, with ChatGPT-4 achieving an overall average accuracy of 64%, while ChatGPT-3.5-trubo achieved an overall accuracy of 49% across the various question types in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Text Readability and Simplification · Topic Modeling
