Does Generative AI speak Nigerian-Pidgin?: Issues about Representativeness and Bias for Multilingualism in LLMs
David Ifeoluwa Adelani, A. Seza Do\u{g}ru\"oz, Iyanuoluwa Shode,, Anuoluwapo Aremu

TL;DR
This paper investigates the linguistic differences and representation issues of Nigerian Pidgin languages in large language models, revealing underrepresentation of Naija and biases towards West African Pidgin English.
Contribution
It provides a comparative analysis of Naija and WAPE, highlighting linguistic differences and demonstrating that current LLMs mainly operate on WAPE, thus underrepresenting Naija.
Findings
Naija and WAPE have significant linguistic differences.
LLMs predominantly operate on WAPE, underrepresenting Naija.
Historical and interview data provide context for language representation issues.
Abstract
Nigeria is a multilingual country with 500+ languages. Naija is a Nigerian Pidgin spoken by approximately 120M speakers and it is a mixed language (e.g., English, Portuguese, Yoruba, Hausa and Igbo). Although it has mainly been a spoken language until recently, there are some online platforms (e.g., Wikipedia), publishing in written Naija as well. West African Pidgin English (WAPE) is also spoken in Nigeria and it is used by BBC to broadcast news on the internet to a wider audience not only in Nigeria but also in other West African countries (e.g., Cameroon and Ghana). Through statistical analyses and Machine Translation experiments, our paper shows that these two pidgin varieties do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on WAPE. In other words, Naija is underrepresented in Generative AI, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLanguage and cultural evolution
