MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models
Suramya Jadhav, Abhay Shanbhag, Amogh Thakurdesai, Ridhima Sinare, Ananya Joshi, Raviraj Joshi

TL;DR
This paper introduces MahaParaphrase, a high-quality Marathi paraphrase dataset and evaluates BERT-based models on it, addressing the scarcity of resources for NLP tasks in low-resource Indic languages.
Contribution
The work provides the first large-scale Marathi paraphrase corpus and benchmarks BERT models, advancing NLP research for low-resource Indic languages.
Findings
BERT models achieve promising paraphrase detection accuracy.
The dataset facilitates future NLP research in Marathi.
Publicly available dataset and models support further development.
Abstract
Paraphrases are a vital tool to assist language understanding tasks such as question answering, style transfer, semantic parsing, and data augmentation tasks. Indic languages are complex in natural language processing (NLP) due to their rich morphological and syntactic variations, diverse scripts, and limited availability of annotated data. In this work, we present the L3Cube-MahaParaphrase Dataset, a high-quality paraphrase corpus for Marathi, a low resource Indic language, consisting of 8,000 sentence pairs, each annotated by human experts as either Paraphrase (P) or Non-paraphrase (NP). We also present the results of standard transformer-based BERT models on these datasets. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
