AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

Ahmed Hasanaath; Aisha Alansari; Ahmed Ashraf; Chafik Salmane; Hamzah Luqman; Saad Ezzini

arXiv:2506.08768·cs.CL·December 16, 2025

AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

Ahmed Hasanaath, Aisha Alansari, Ahmed Ashraf, Chafik Salmane, Hamzah Luqman, Saad Ezzini

PDF

Open Access 1 Video

TL;DR

This paper evaluates reasoning-based large language models on Arabic NLP tasks, highlighting the effectiveness of few-shot prompting, DeepSeek architectures, and fine-tuning strategies in improving performance on complex linguistic tasks.

Contribution

It introduces DeepSeek models for Arabic NLP and systematically benchmarks their reasoning capabilities across multiple tasks with various prompting and fine-tuning methods.

Findings

01

Few-shot prompting significantly boosts classification accuracy.

02

DeepSeek models outperform baseline GPT models on inference tasks.

03

LoRA fine-tuning further enhances model performance.

Abstract

Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP· underline

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Artificial Intelligence in Healthcare and Education

MethodsCosine Annealing · Layer Normalization · Linear Warmup With Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Discriminative Fine-Tuning · Byte Pair Encoding · Softmax · Linear Layer · Dropout