Benchmarking Sociolinguistic Diversity in Swahili NLP: A Taxonomy-Guided Approach
Kezia Oketch, John P. Lalor, Ahmed Abbasi

TL;DR
This paper presents a novel taxonomy-guided evaluation framework for Swahili NLP, emphasizing sociolinguistic diversity and its impact on model performance, based on a new dataset of Kenyan language use.
Contribution
It introduces the first sociolinguistic taxonomy for Swahili NLP evaluation and demonstrates its effectiveness in analyzing model errors across diverse language varieties.
Findings
Sociolinguistic variation significantly affects model accuracy.
The taxonomy reveals specific error patterns linked to linguistic features.
Models struggle with code-mixing and regional dialects.
Abstract
We introduce the first taxonomy-guided evaluation of Swahili NLP, addressing gaps in sociolinguistic diversity. Drawing on health-related psychometric tasks, we collect a dataset of 2,170 free-text responses from Kenyan speakers. The data exhibits tribal influences, urban vernacular, code-mixing, and loanwords. We develop a structured taxonomy and use it as a lens for examining model prediction errors across pre-trained and instruction-tuned language models. Our findings advance culturally grounded evaluation frameworks and highlight the role of sociolinguistic variation in shaping model performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultilingual Education and Policy
