Creating Arabic LLM Prompts at Scale
Abdelrahman El-Sheikh, Ahmed Elmogtaba, Kareem Darwish and, Muhammad Elmallah, Ashraf Elneima, Hassan Sawaf

TL;DR
This paper presents scalable methods for creating large Arabic prompt datasets, enabling fine-tuned LLMs to outperform larger models in Arabic instruction tasks.
Contribution
Introduces two efficient methods for generating extensive Arabic prompts from existing datasets and translations, significantly enhancing Arabic instruction-following capabilities.
Findings
Created over 67.4 million Arabic prompts across various tasks.
Fine-tuned a 7B LLM to outperform a 70B instruction-tuned model in Arabic.
Demonstrated the effectiveness of prompt creation methods for Arabic NLP.
Abstract
The debut of chatGPT and BARD has popularized instruction following text generation using LLMs, where a user can interrogate an LLM using natural language requests and obtain natural language answers that matches their requests. Training LLMs to respond in this manner requires a large number of worked out examples of user requests (aka prompts) with corresponding gold responses. In this paper, we introduce two methods for creating such prompts for Arabic cheaply and quickly. The first methods entails automatically translating existing prompt datasets from English, such as PromptSource and Super-NaturalInstructions, and then using machine translation quality estimation to retain high quality translations only. The second method involves creating natural language prompts on top of existing Arabic NLP datasets. Using these two methods we were able to create more than 67.4 million Arabic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsBalanced Selection
