KazQAD: Kazakh Open-Domain Question Answering Dataset

Rustem Yeshpanov; Pavel Efimov; Leonid Boytsov; Ardak Shalkarbayuli,; Pavel Braslavski

arXiv:2404.04487·cs.CL·April 9, 2024·1 cites

KazQAD: Kazakh Open-Domain Question Answering Dataset

Rustem Yeshpanov, Pavel Efimov, Leonid Boytsov, Ardak Shalkarbayuli,, Pavel Braslavski

PDF

Open Access 1 Repo 2 Datasets

TL;DR

KazQAD is a new Kazakh open-domain question answering dataset with nearly 6,000 questions, designed for reading comprehension and information retrieval, providing a benchmark for future NLP research in Kazakh language.

Contribution

It introduces the first large-scale Kazakh ODQA dataset with diverse sources and baseline models, filling a critical resource gap for Kazakh NLP research.

Findings

01

Baseline models achieve moderate retrieval and comprehension scores.

02

Current models underperform compared to English QA benchmarks.

03

ChatGPTv3.5 struggles with Kazakh QA in closed-book setting.

Abstract

We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

is2ai/kazqad
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques