Breaking Language Barriers: A Question Answering Dataset for Hindi and   Marathi

Maithili Sabane; Onkar Litake; Aman Chadha

arXiv:2308.09862·cs.CL·February 20, 2024

Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi

Maithili Sabane, Onkar Litake, Aman Chadha

PDF

Open Access

TL;DR

This paper introduces the largest question answering datasets for Hindi and Marathi, created by translating SQuAD 2.0, and provides benchmark models to advance NLP research in these low-resource languages.

Contribution

The paper presents a novel translation-based approach to develop large QA datasets for Hindi and Marathi, along with benchmark models and resources for further research.

Findings

01

Created 28,000 sample QA datasets for Hindi and Marathi

02

Evaluated multiple architectures and released top models

03

Facilitated NLP research in low-resource languages

Abstract

The recent advances in deep-learning have led to the development of highly sophisticated systems with an unquenchable appetite for data. On the other hand, building good deep-learning models for low-resource languages remains a challenging task. This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi. Despite Hindi being the 3rd most spoken language worldwide, with 345 million speakers, and Marathi being the 11th most spoken language globally, with 83.2 million speakers, both languages face limited resources for building efficient Question Answering systems. To tackle the challenge of data scarcity, we have developed a novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi. We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples. We evaluate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques