Benchmarking Sociolinguistic Diversity in Swahili NLP: A Taxonomy-Guided Approach

Kezia Oketch; John P. Lalor; Ahmed Abbasi

arXiv:2508.14051·cs.CL·August 21, 2025

Benchmarking Sociolinguistic Diversity in Swahili NLP: A Taxonomy-Guided Approach

Kezia Oketch, John P. Lalor, Ahmed Abbasi

PDF

Open Access

TL;DR

This paper presents a novel taxonomy-guided evaluation framework for Swahili NLP, emphasizing sociolinguistic diversity and its impact on model performance, based on a new dataset of Kenyan language use.

Contribution

It introduces the first sociolinguistic taxonomy for Swahili NLP evaluation and demonstrates its effectiveness in analyzing model errors across diverse language varieties.

Findings

01

Sociolinguistic variation significantly affects model accuracy.

02

The taxonomy reveals specific error patterns linked to linguistic features.

03

Models struggle with code-mixing and regional dialects.

Abstract

We introduce the first taxonomy-guided evaluation of Swahili NLP, addressing gaps in sociolinguistic diversity. Drawing on health-related psychometric tasks, we collect a dataset of 2,170 free-text responses from Kenyan speakers. The data exhibits tribal influences, urban vernacular, code-mixing, and loanwords. We develop a structured taxonomy and use it as a lens for examining model prediction errors across pre-trained and instruction-tuned language models. Our findings advance culturally grounded evaluation frameworks and highlight the role of sociolinguistic variation in shaping model performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultilingual Education and Policy