State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation

Navan Preet Singh; Anurag Garikipati; Ahmed Abulkhair; Jyani Akshay Jagdishbhai; Atul Yaduvanshi; Amarendra Chaudhary; Madalina Ciobanu; Qingqing Mao; Ritankar Das

arXiv:2604.06421·cs.CL·April 9, 2026

State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation

Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi, Amarendra Chaudhary, Madalina Ciobanu, Qingqing Mao, Ritankar Das

PDF

TL;DR

This paper presents Arabic-DeepSeek-R1, an open-source Arabic LLM using sparse MoE and chain-of-thought distillation, achieving state-of-the-art results across multiple benchmarks and demonstrating the effectiveness of culturally-informed, parameter-efficient adaptation.

Contribution

It introduces a novel Arabic-specific training scheme with linguistic and ethical checks, setting new SOTA benchmarks for Arabic language modeling with an open-source model.

Findings

01

Arabic-DeepSeek-R1 surpasses GPT-5.1 on several benchmarks.

02

The model achieves SOTA or near-SOTA results across seven benchmarks.

03

Culturally-informed CoT distillation improves Arabic LLM performance.

Abstract

This paper introduces Arabic-DeepSeek-R1, an application-driven open-source Arabic LLM that leverages a sparse MoE backbone to address the digital equity gap for under-represented languages, and establishes a new SOTA across the entire Open Arabic LLM Leaderboard (OALL). Our four-phase CoT distillation scheme integrates Arabic-specific linguistic verification and regional ethical norms into a 372M-token, contamination-controlled 80/20 Arabic-English training mixture. Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by substantial margins), safety-oriented AraTrust, multi-ability AlGhafa, and retrieval-augmented ALRAGE. Our results indicate that the combination of sparse MoE architecture, culturally-informed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.