AceGPT, Localizing Large Language Models in Arabic
Huang Huang, Fei Yu, Jianqing Zhu, Xuening Sun, Hao Cheng, Dingjie, Song, Zhihong Chen, Abdulmohsen Alharthi, Bang An, Juncai He, Ziche Liu,, Zhiyi Zhang, Junying Chen, Jianquan Li, Benyou Wang, Lian Zhang, Ruoyu Sun,, Xiang Wan, Haizhou Li, Jinchao Xu

TL;DR
AceGPT is a culturally sensitive Arabic language model developed through targeted pre-training, fine-tuning, and reinforcement learning, achieving state-of-the-art performance on Arabic benchmarks.
Contribution
This work introduces AceGPT, a novel Arabic LLM that incorporates cultural and value alignment through specialized training and reinforcement learning techniques.
Findings
Achieves state-of-the-art results on Arabic benchmarks
Demonstrates improved cultural sensitivity and value alignment
Provides open-source code, data, and models
Abstract
This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside Reinforcement Learning with AI Feedback (RLAIF) employing a reward model attuned to local culture and values. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities. Comprehensive evaluations reveal that the resulting model, dubbed `AceGPT', sets the…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
* Pre-training with Arabic data seems to improve performance on MMLU and ACVA benchmarks, proving the utility of native language data. * An interesting analysis of preference for certain cultural contexts in Language model responses. A dataset for studying cultural alignment the same has been created, which is a novel and useful contribution. * Better performance than JAIS, comparable to ChaptGPT. * Ablation studies show the utility of pre-training with Arabic data and RLAIF (which improves bot
* While the paper makes an interesting contribution to an improved Arabic LLM, it does little to advance the study of building/adapting LLMs for non-English languages. Most of the methods are well known. A few studies can help draw broader lessons on the localization of LLMs: * How much pre-training data is required? What is the best data balance between English and other languages? * Does the English performance get impacted due to the finetuning? * How objective is the ACVA bench
1. This paper is well-written and easy to follow. 2. Localization on specific culture is an important topic in LLMs. 3. This paper mentioned Arabic-related dataset and models, which can be useful for the people in related fields.
1. The theoretical and technical contributions are poor. This paper is more like a engineering report to introduce how to localize a public LLM on Arabic, illustrating the operation and dataset during pre-training, instruction tunning and RLHF stage. All the methods are well-known. The findings are intuitive, using localize data to pre-train, instruction tuning and training RLHF can be helpful for better localization. It seems more suitable for an empirical NLP conference rather than a learning
The authors organize the paper clearly and describe in details on how to tune an LLM to work better on a specific language Arabic. From the evaluation, tuning improves metrics on both human and auto evaluations.
I think the main weakness is novelty. Though novel application should be also considered an contribution, I do not think the paper provides many insights on LLM in Arabic. It seems to just follow the common techniques continue training/SFT/RLAIF.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Byte Pair Encoding · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Residual Connection · Adam
