GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning

Hasna Chouikhi; Manel Aloui; Cyrine Ben Hammou; Ghaith Chaabane,; Haithem Kchaou; Chehir Dhaouadi

arXiv:2407.02147·cs.CL·July 10, 2024

GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning

Hasna Chouikhi, Manel Aloui, Cyrine Ben Hammou, Ghaith Chaabane,, Haithem Kchaou, Chehir Dhaouadi

PDF

Open Access

TL;DR

This paper introduces InstAr-500k, a new Arabic instruction dataset, and demonstrates how fine-tuning an open-source LLM with this dataset significantly improves Arabic NLP task performance.

Contribution

The paper presents a novel Arabic instruction dataset and fine-tuning approach that enhances LLM capabilities for Arabic NLP tasks, addressing resource scarcity.

Findings

01

Fine-tuned model achieves state-of-the-art results on Arabic benchmarks.

02

The dataset effectively bridges the performance gap between English and Arabic models.

03

Enhanced Arabic NLP capabilities demonstrated through multiple downstream tasks.

Abstract

Large language models (LLMs) have greatly impacted the natural language processing (NLP) field, particularly for the English language. These models have demonstrated capabilities in understanding and generating human-like text. The success of language models largely depends on the availability of high-quality instruction datasets, which consist of detailed task descriptions and corresponding responses that are essential for training the models to address a variety of prompts accurately. However, the availability and quality of these resources vary by language. While models perform well in English, they often need help with languages like Arabic, due to the lack of datasets for fine-tuning Arabic-specific tasks. To address this issue, we introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content that covers several domains and instruction types.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing · Text Readability and Simplification