GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data
Abderrahman Skiredj, Ferdaous Azhari, Houdaifa Atou, Nouamane Tazi, Ismail Berrada

TL;DR
This paper introduces GemMaroc, a minimal-data, efficient approach to adapt large language models for Moroccan Arabic (Darija), achieving high performance without extensive compute or English regression.
Contribution
It presents a quality-over-quantity alignment strategy and a low-resource fine-tuning method to enhance Darija proficiency in LLMs while preserving reasoning skills.
Findings
GemMaroc-3-4B improves Darija MMLU scores from 32.8 to 42.7
Scaling to GemMaroc-27B matches or exceeds existing models on Darija benchmarks
Model training requires only 48 GPU hours, demonstrating efficiency and sustainability
Abstract
Open-source large language models (LLMs) still marginalise Moroccan Arabic (Darija), forcing practitioners either to bolt on heavyweight Arabic adapters or to sacrifice the very reasoning skills that make LLMs useful. We show that a rigorously quality-over-quantity alignment strategy can surface fluent Darija while safeguarding the backbone s cross-lingual reasoning at a sliver of the usual compute. We translate three compact instruction suites LIMA 1 K, DEITA 6 K and TULU 50 K into Darija, preserve 20 of the English originals, and add mathematics, coding and scientific prompts. A LoRA-tuned Gemma 3-4B trained on 5 K mixed instructions lifts DarijaMMLU from 32.8 to 42.7 ; adding the reasoning-dense TULU portion pushes it to 47.5 with no English regression. Scaling the identical recipe to Gemma 3-27B produces GemMaroc-27B, which matches Atlas-Chat on DarijaMMLU (61.6 ) and leaps ahead on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗GemMaroc/GemMaroc-27b-itmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗AbderrahmanSkiredj1/GemMaroc-27b-itmodel· 37 dl· ♡ 137 dl♡ 1
- 🤗GemMaroc/GemMaroc-4b-tulumodel· 4 dl· ♡ 14 dl♡ 1
- 🤗AbderrahmanSkiredj1/GemMaroc-4b-tulu-Q4_K_M-GGUFmodel· 19 dl19 dl
- 🤗AbderrahmanSkiredj1/GemMaroc-27b-it-GGUFmodel· 24 dl24 dl
- 🤗GemMaroc/Qwen2.5-7B-Instruct-darijamodel· 24 dl24 dl
- 🤗GemMaroc/Qwen2.5-14B-Instruct-darijamodel· 4 dl· ♡ 24 dl♡ 2
- 🤗GemMaroc/Qwen2.5-32B-Instruct-darijamodel· 6 dl6 dl
- 🤗MathematicianNLPer/GemMaroc-4b-tulu-Q4_K_M-GGUFmodel· 3 dl3 dl
- 🤗MathematicianNLPer/GemMaroc-27b-it-GGUFmodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
