KazQAD: Kazakh Open-Domain Question Answering Dataset
Rustem Yeshpanov, Pavel Efimov, Leonid Boytsov, Ardak Shalkarbayuli,, Pavel Braslavski

TL;DR
KazQAD is a new Kazakh open-domain question answering dataset with nearly 6,000 questions, designed for reading comprehension and information retrieval, providing a benchmark for future NLP research in Kazakh language.
Contribution
It introduces the first large-scale Kazakh ODQA dataset with diverse sources and baseline models, filling a critical resource gap for Kazakh NLP research.
Findings
Baseline models achieve moderate retrieval and comprehension scores.
Current models underperform compared to English QA benchmarks.
ChatGPTv3.5 struggles with Kazakh QA in closed-book setting.
Abstract
We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
