Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi
Maithili Sabane, Onkar Litake, Aman Chadha

TL;DR
This paper introduces the largest question answering datasets for Hindi and Marathi, created by translating SQuAD 2.0, and provides benchmark models to advance NLP research in these low-resource languages.
Contribution
The paper presents a novel translation-based approach to develop large QA datasets for Hindi and Marathi, along with benchmark models and resources for further research.
Findings
Created 28,000 sample QA datasets for Hindi and Marathi
Evaluated multiple architectures and released top models
Facilitated NLP research in low-resource languages
Abstract
The recent advances in deep-learning have led to the development of highly sophisticated systems with an unquenchable appetite for data. On the other hand, building good deep-learning models for low-resource languages remains a challenging task. This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi. Despite Hindi being the 3rd most spoken language worldwide, with 345 million speakers, and Marathi being the 11th most spoken language globally, with 83.2 million speakers, both languages face limited resources for building efficient Question Answering systems. To tackle the challenge of data scarcity, we have developed a novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi. We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples. We evaluate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
