# Fine-Tuning Arabic Large Language Models for improved multi-turn dialogue: A blueprint for synthetic data generation and benchmarking

**Authors:** Ahmed Mahmoud Misbah, Mohamed Farouk, Mustafa AbdulAzim, Helen Howard, Mohammad Salah Hassan, Mohammad Salah Hassan, Mohammad Salah Hassan

PMC · DOI: 10.1371/journal.pone.0341905 · PLOS One · 2026-02-12

## TL;DR

This paper introduces a reproducible method to generate synthetic data for improving Arabic conversational AI, showing that it can create effective dialogue systems even in low-resource settings.

## Contribution

A novel blueprint for generating and benchmarking synthetic data for Arabic multi-turn dialogue systems using large language models.

## Key findings

- Fine-tuned ArabianGPT-08B-V2 achieved the highest RAVEN score (0.823) in cross-model comparisons.
- Human evaluation showed acceptable inter-rater reliability and positive correlations in quality scores.
- LLM-generated synthetic data significantly improved Arabic conversational models' performance.

## Abstract

The rapid evolution of Large Language Models (LLMs) has fueled increasing interest in developing Arabic conversational systems capable of sustaining coherent multi-turn dialogues. However, progress remains constrained by the scarcity of large-scale, diverse, and high-quality datasets specifically designed for Arabic multi-turn interaction. This study presents a reproducible methodology for constructing such a dataset through structured prompting of an instruction-tuned Arabic LLM (Jais-13b-chat), yielding 43,316 multi-turn conversations across 93 topics and 151 countries. Two pre-trained Arabic language models (ArabianGPT-08B-V2 and AraGPT2-mega) were fine-tuned on this synthetic data and benchmarked against multilingual instruction-tuned baselines using a comprehensive evaluation framework combining automatic metrics (Perplexity and RAVEN) with structured human evaluation. Fine-tuned ArabianGPT-08B-V2 achieved the highest RAVEN score (0.823) for cross-model comparison, outperforming both fine-tuned AraGPT2-mega and instruction-tuned baselines while maintaining strong within-model perplexity (9.4). Human evaluation by two independent raters demonstrated acceptable inter-rater reliability (Cohen’s κ = 0.229–0.739) with positive rank correlations (Spearman ρ = 0.424–0.759), yielding overall quality scores of 4.04–4.34 on a five-point scale. These findings demonstrate that high-quality, LLM-generated synthetic data effectively improves Arabic conversational models, providing a scalable, resource-efficient blueprint for dialogue systems in low-resource and culturally specific settings.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12900375/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12900375/full.md

## References

57 references — full list in the complete paper: https://tomesphere.com/paper/PMC12900375/full.md

---
Source: https://tomesphere.com/paper/PMC12900375