Language Diversity: Evaluating Language Usage and AI Performance on African Languages in Digital Spaces
Edward Ajayi, Eudoxie Umwari, Mawuli Deku, Prosper Singadi, Jules Udahemuka, Bekalu Tadele, Chukuemeka Edeh

TL;DR
This paper evaluates the digital presence of African languages and finds that curated news data improves AI language detection accuracy more than conversational data, highlighting challenges in processing code-switched and sparse online language use.
Contribution
It demonstrates the effectiveness of using curated news data over social media for training language detection models on African languages and emphasizes the need for models handling code-switching.
Findings
News media provides more reliable data than social media for African languages.
Language detection models perform well on clean news data but struggle with code-switched social media posts.
Future models should better handle code-switching and sparse conversational data.
Abstract
This study examines the digital representation of African languages and the challenges this presents for current language detection tools. We evaluate their performance on Yoruba, Kinyarwanda, and Amharic. While these languages are spoken by millions, their online usage on conversational platforms is often sparse, heavily influenced by English, and not representative of the authentic, monolingual conversations prevalent among native speakers. This lack of readily available authentic data online creates a challenge of scarcity of conversational data for training language models. To investigate this, data was collected from subreddits and local news sources for each language. The analysis showed a stark contrast between the two sources. Reddit data was minimal and characterized by heavy code-switching. Conversely, local news media offered a robust source of clean, monolingual language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsICT in Developing Communities · Authorship Attribution and Profiling · Computational and Text Analysis Methods
