NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities
Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed

TL;DR
This paper presents NileChat, a culturally aware LLM for Egyptian and Moroccan dialects, created through community-specific synthetic and retrieval-based data, outperforming similar-sized models in understanding, translation, and cultural alignment.
Contribution
It introduces a novel methodology for culturally tailored pre-training data and develops NileChat, an LLM that incorporates language, heritage, and values of low-resource communities.
Findings
NileChat outperforms existing Arabic-aware LLMs of similar size.
NileChat performs on par with larger models in various benchmarks.
The methodology enhances cultural and values alignment in LLMs.
Abstract
Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
