How well can LLMs Grade Essays in Arabic?
Rayed Ghazawi, Edwin Simpson

TL;DR
This study evaluates the performance of various large language models in grading Arabic essays, highlighting the impact of prompt strategies, model differences, and language-specific challenges, with ACEGPT and a BERT-based model showing notable results.
Contribution
First empirical assessment of multiple LLMs on Arabic essay grading using authentic student data, exploring prompt engineering and language-specific challenges.
Findings
ACEGPT achieved a QWK of 0.67
A smaller BERT-based model outperformed ACEGPT with a QWK of 0.88
Prompt engineering improved model performance
Abstract
This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in-context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance. Among the models tested, ACEGPT demonstrated the strongest performance across the dataset, achieving a Quadratic Weighted Kappa (QWK) of 0.67, but was outperformed by a smaller BERT-based model with a QWK of 0.88. The study identifies challenges faced by LLMs in processing Arabic,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Emergency Medicine Education and Research · Language, Linguistics, Cultural Analysis
