The Qiyas Benchmark: Measuring ChatGPT Mathematical and Language   Understanding in Arabic

Shahad Al-Khalifa; Hend Al-Khalifa

arXiv:2407.00146·cs.CL·July 2, 2024

The Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic

Shahad Al-Khalifa, Hend Al-Khalifa

PDF

Open Access

TL;DR

This paper introduces two new benchmarks based on the Qiyas exam to evaluate Arabic language models' mathematical reasoning and understanding, revealing current models' limitations and guiding future improvements.

Contribution

It presents the first Arabic-specific benchmarks for mathematical and language understanding, derived from a standardized exam, and evaluates ChatGPT models' performance on them.

Findings

01

ChatGPT-4 achieved 64% accuracy on the benchmarks.

02

ChatGPT-3.5-trubo achieved 49% accuracy.

03

Benchmarks are challenging, highlighting room for improvement.

Abstract

Despite the growing importance of Arabic as a global language, there is a notable lack of language models pre-trained exclusively on Arabic data. This shortage has led to limited benchmarks available for assessing language model performance in Arabic. To address this gap, we introduce two novel benchmarks designed to evaluate models' mathematical reasoning and language understanding abilities in Arabic. These benchmarks are derived from a General Aptitude Test (GAT) called Qiyas exam, a standardized test widely used for university admissions in Saudi Arabia. For validation purposes, we assess the performance of ChatGPT-3.5-trubo and ChatGPT-4 on our benchmarks. Our findings reveal that these benchmarks pose a significant challenge, with ChatGPT-4 achieving an overall average accuracy of 64%, while ChatGPT-3.5-trubo achieved an overall accuracy of 49% across the various question types in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Text Readability and Simplification · Topic Modeling