SwaQuAD-24: QA Benchmark Dataset in Swahili
Alfred Malengo Kondoro

TL;DR
This paper introduces SwaQuAD-24, a comprehensive Swahili QA benchmark dataset designed to advance NLP research and applications for the low-resource Swahili language, inspired by established benchmarks.
Contribution
It presents the creation of a high-quality, annotated Swahili QA dataset, addressing language underrepresentation and supporting diverse NLP tasks.
Findings
Dataset includes diverse, annotated question-answer pairs
Supports applications like translation and chatbots
Aims to foster NLP innovation in East Africa
Abstract
This paper proposes the creation of a Swahili Question Answering (QA) benchmark dataset, aimed at addressing the underrepresentation of Swahili in natural language processing (NLP). Drawing from established benchmarks like SQuAD, GLUE, KenSwQuAD, and KLUE, the dataset will focus on providing high-quality, annotated question-answer pairs that capture the linguistic diversity and complexity of Swahili. The dataset is designed to support a variety of applications, including machine translation, information retrieval, and social services like healthcare chatbots. Ethical considerations, such as data privacy, bias mitigation, and inclusivity, are central to the dataset development. Additionally, the paper outlines future expansion plans to include domain-specific content, multimodal integration, and broader crowdsourcing efforts. The Swahili QA dataset aims to foster technological innovation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsFocus
