How well can LLMs Grade Essays in Arabic?

Rayed Ghazawi; Edwin Simpson

arXiv:2501.16516·cs.CL·January 29, 2025

How well can LLMs Grade Essays in Arabic?

Rayed Ghazawi, Edwin Simpson

PDF

Open Access

TL;DR

This study evaluates the performance of various large language models in grading Arabic essays, highlighting the impact of prompt strategies, model differences, and language-specific challenges, with ACEGPT and a BERT-based model showing notable results.

Contribution

First empirical assessment of multiple LLMs on Arabic essay grading using authentic student data, exploring prompt engineering and language-specific challenges.

Findings

01

ACEGPT achieved a QWK of 0.67

02

A smaller BERT-based model outperformed ACEGPT with a QWK of 0.88

03

Prompt engineering improved model performance

Abstract

This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in-context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance. Among the models tested, ACEGPT demonstrated the strongest performance across the dataset, achieving a Quadratic Weighted Kappa (QWK) of 0.67, but was outperformed by a smaller BERT-based model with a QWK of 0.88. The study identifies challenges faced by LLMs in processing Arabic,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Emergency Medicine Education and Research · Language, Linguistics, Cultural Analysis