BanglaQuAD: A Bengali Open-domain Question Answering Dataset

Md Rashad Al Hasan Rony; Sudipto Kumar Shaha; Rakib Al Hasan; Sumon; Kanti Dey; Amzad Hossain Rafi; Amzad Hossain Rafi; Ashraf Hasan Sirajee; Jens; Lehmann

arXiv:2410.10229·cs.CL·October 15, 2024

BanglaQuAD: A Bengali Open-domain Question Answering Dataset

Md Rashad Al Hasan Rony, Sudipto Kumar Shaha, Rakib Al Hasan, Sumon, Kanti Dey, Amzad Hossain Rafi, Amzad Hossain Rafi, Ashraf Hasan Sirajee, Jens, Lehmann

PDF

Open Access

TL;DR

BanglaQuAD is a newly created Bengali question answering dataset with over 30,000 pairs, developed from Wikipedia articles to support NLP research in this low-resource language.

Contribution

The paper introduces BanglaQuAD, the first large-scale Bengali QA dataset, and an annotation tool for dataset creation, addressing language resource scarcity.

Findings

01

Dataset contains 30,808 question-answer pairs.

02

Qualitative analysis confirms dataset quality.

03

Supports NLP research in Bengali language.

Abstract

Bengali is the seventh most spoken language on earth, yet considered a low-resource language in the field of natural language processing (NLP). Question answering over unstructured text is a challenging NLP task as it requires understanding both question and passage. Very few researchers attempted to perform question answering over Bengali (natively pronounced as Bangla) text. Typically, existing approaches construct the dataset by directly translating them from English to Bengali, which produces noisy and improper sentence structures. Furthermore, they lack topics and terminologies related to the Bengali language and people. This paper introduces BanglaQuAD, a Bengali question answering dataset, containing 30,808 question-answer pairs constructed from Bengali Wikipedia articles by native speakers. Additionally, we propose an annotation tool that facilitates question-answering dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications