How Well Do LLMs Understand Tunisian Arabic?
Mohamed Mahdi

TL;DR
This paper evaluates how well large language models understand Tunisian Arabic, highlighting gaps and emphasizing the need for inclusive AI that supports low-resource languages.
Contribution
Introduces a new dataset with Tunisian Arabic and English, benchmarking LLMs on transliteration, translation, and sentiment analysis tasks.
Findings
Significant variation in model performance across tasks
Identified limitations in LLM understanding of Tunisian dialects
Highlighted the importance of supporting low-resource languages in AI
Abstract
Large Language Models (LLMs) are the engines driving today's AI agents. The better these models understand human languages, the more natural and user-friendly the interaction with AI becomes, from everyday devices like computers and smartwatches to any tool that can act intelligently. Yet, the ability of industrial-scale LLMs to comprehend low-resource languages, such as Tunisian Arabic (Tunizi), is often overlooked. This neglect risks excluding millions of Tunisians from fully interacting with AI in their own language, pushing them toward French or English. Such a shift not only threatens the preservation of the Tunisian dialect but may also create challenges for literacy and influence younger generations to favor foreign languages. In this study, we introduce a novel dataset containing parallel Tunizi, standard Tunisian Arabic, and English translations, along with sentiment labels. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · ICT in Developing Communities · Big Data and Digital Economy
