BanglaQuAD: A Bengali Open-domain Question Answering Dataset
Md Rashad Al Hasan Rony, Sudipto Kumar Shaha, Rakib Al Hasan, Sumon, Kanti Dey, Amzad Hossain Rafi, Amzad Hossain Rafi, Ashraf Hasan Sirajee, Jens, Lehmann

TL;DR
BanglaQuAD is a newly created Bengali question answering dataset with over 30,000 pairs, developed from Wikipedia articles to support NLP research in this low-resource language.
Contribution
The paper introduces BanglaQuAD, the first large-scale Bengali QA dataset, and an annotation tool for dataset creation, addressing language resource scarcity.
Findings
Dataset contains 30,808 question-answer pairs.
Qualitative analysis confirms dataset quality.
Supports NLP research in Bengali language.
Abstract
Bengali is the seventh most spoken language on earth, yet considered a low-resource language in the field of natural language processing (NLP). Question answering over unstructured text is a challenging NLP task as it requires understanding both question and passage. Very few researchers attempted to perform question answering over Bengali (natively pronounced as Bangla) text. Typically, existing approaches construct the dataset by directly translating them from English to Bengali, which produces noisy and improper sentence structures. Furthermore, they lack topics and terminologies related to the Bengali language and people. This paper introduces BanglaQuAD, a Bengali question answering dataset, containing 30,808 question-answer pairs constructed from Bengali Wikipedia articles by native speakers. Additionally, we propose an annotation tool that facilitates question-answering dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
