KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language
Barack W. Wanjawa (1), Lilian D.A. Wanzare (2), Florence Indede (2),, Owen McOnyango (2), Lawrence Muchemi (1), Edward Ombui (3) ((1) University of, Nairobi Kenya, (2) Maseno University Kenya (3) Africa Nazarene University, Kenya)

TL;DR
This paper introduces KenSwQuAD, a new Swahili question answering dataset created from low-resource language texts, enabling machine comprehension tasks and advancing Swahili NLP resources.
Contribution
The paper presents the creation and validation of KenSwQuAD, the first large-scale Swahili QA dataset, with 7,526 QA pairs from 1,445 texts, supporting NLP research in low-resource languages.
Findings
The dataset contains 7,526 QA pairs from 1,445 texts.
QA pairs were validated with 12.5% quality assurance set.
Proof of concept shows dataset's usability for QA tasks.
Abstract
The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold standard Question Answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
