Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation
Hassan Barmandah

TL;DR
This paper introduces a LoRA fine-tuning approach for a Saudi-developed language model to improve dialectal Arabic generation, achieving high dialect control and fidelity without releasing the dataset or model weights.
Contribution
It presents the first Saudi dialect-specific fine-tuning of a foundation model using a new dataset and compares dialect control methods, demonstrating superior performance over generic instruction models.
Findings
Dialect-Token training significantly improves dialect control.
LoRA-tuned models outperform baseline instruction models in dialect accuracy.
Fidelity metrics show notable improvements with dialect-specific fine-tuning.
Abstract
Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA), with limited support for Saudi dialects such as Najdi and Hijazi. This underrepresentation hinders their ability to capture authentic dialectal variation. Using a privately curated Saudi Dialect Instruction dataset (Hijazi and Najdi; 5,466 synthetic instruction-response pairs; 50/50 split), we LoRA-tune ALLaM-7B-Instruct-preview, the first foundation model developed in Saudi Arabia, for Saudi dialect generation. We investigate two variants: (i) Dialect-Token training, which prepends an explicit dialect tag to the instruction, and (ii) No-Token training, which omits the tag at formatting time. Evaluation on a held-out test set combines an external dialect classifier with text fidelity metrics (chrF++ and BERTScore) and diversity measures. The Dialect-Token model achieves the best control, raising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
