MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models

Suramya Jadhav; Abhay Shanbhag; Amogh Thakurdesai; Ridhima Sinare; Ananya Joshi; Raviraj Joshi

arXiv:2508.17444·cs.CL·August 26, 2025

MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models

Suramya Jadhav, Abhay Shanbhag, Amogh Thakurdesai, Ridhima Sinare, Ananya Joshi, Raviraj Joshi

PDF

1 Models 1 Datasets

TL;DR

This paper introduces MahaParaphrase, a high-quality Marathi paraphrase dataset and evaluates BERT-based models on it, addressing the scarcity of resources for NLP tasks in low-resource Indic languages.

Contribution

The work provides the first large-scale Marathi paraphrase corpus and benchmarks BERT models, advancing NLP research for low-resource Indic languages.

Findings

01

BERT models achieve promising paraphrase detection accuracy.

02

The dataset facilitates future NLP research in Marathi.

03

Publicly available dataset and models support further development.

Abstract

Paraphrases are a vital tool to assist language understanding tasks such as question answering, style transfer, semantic parsing, and data augmentation tasks. Indic languages are complex in natural language processing (NLP) due to their rich morphological and syntactic variations, diverse scripts, and limited availability of annotated data. In this work, we present the L3Cube-MahaParaphrase Dataset, a high-quality paraphrase corpus for Marathi, a low resource Indic language, consisting of 8,000 sentence pairs, each annotated by human experts as either Paraphrase (P) or Non-paraphrase (NP). We also present the results of standard transformer-based BERT models on these datasets. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
l3cube-pune/marathi-paraphrase-detection-bert
model· 1 dl
1 dl

Datasets

l3cube-pune/MahaParaphrase
dataset· 8 dl
8 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.