MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering
Ruturaj Ghatage, Aditya Kulkarni, Rajlaxmi Patil, Sharvi Endait,, Raviraj Joshi

TL;DR
This paper introduces MahaSQuAD, a comprehensive Marathi question-answering dataset derived from translating English SQuAD, along with a robust translation approach to support low-resource language QA systems.
Contribution
It presents the first Marathi SQuAD dataset and a scalable translation method for low-resource languages, addressing linguistic nuances and context preservation.
Findings
MahaSQuAD contains over 118,000 training samples.
A novel translation and span-mapping approach was developed.
Datasets and models are publicly available.
Abstract
Question-answering systems have revolutionized information retrieval, but linguistic and cultural boundaries limit their widespread accessibility. This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages by translating the English Question Answering Dataset (SQuAD) using a robust data curation approach. We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples. We also present a gold test set of manually verified 500 examples. Challenges in maintaining context and handling linguistic nuances are addressed, ensuring accurate translations. Moreover, as a QnA dataset cannot be simply converted into any low-resource language using translation, we need a robust method to map the answer translation to its span in the translated passage.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗l3cube-pune/marathi-question-answering-squad-bertmodel· 13 dl· ♡ 213 dl♡ 2
- 🤗l3cube-pune/gujarati-question-answering-squad-bertmodel· 12 dl· ♡ 112 dl♡ 1
- 🤗l3cube-pune/hindi-question-answering-squad-bertmodel· 104 dl104 dl
- 🤗l3cube-pune/kannada-question-answering-squad-bertmodel· 24 dl24 dl
- 🤗l3cube-pune/punjabi-question-answering-squad-bertmodel
- 🤗l3cube-pune/tamil-question-answering-squad-bertmodel· 26 dl26 dl
- 🤗l3cube-pune/bengali-question-answering-squad-bertmodel· 1 dl1 dl
- 🤗l3cube-pune/malayalam-question-answering-squad-bertmodel· 14 dl14 dl
- 🤗l3cube-pune/oriya-question-answering-squad-bertmodel· 7 dl7 dl
- 🤗l3cube-pune/telugu-question-answering-squad-bertmodel· 4 dl4 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training
