BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset

Ajwad Akil; Najrin Sultana; Abhik Bhattacharjee; Rifat Shahriyar

arXiv:2210.05109·cs.CL·October 12, 2022·1 cites

BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset

Ajwad Akil, Najrin Sultana, Abhik Bhattacharjee, Rifat Shahriyar

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

BanglaParaphrase is a high-quality synthetic dataset for Bangla paraphrasing, designed to improve NLP resources for the low-resource Bangla language by ensuring semantic accuracy and diversity.

Contribution

The paper introduces a novel filtering pipeline to create a high-quality Bangla paraphrase dataset, addressing resource scarcity in Bangla NLP.

Findings

01

Dataset improves model performance on Bangla NLP tasks

02

Synthetic data quality is validated through comparative analysis

03

Models trained on BanglaParaphrase outperform existing datasets

Abstract

In this work, we present BanglaParaphrase, a high-quality synthetic Bangla Paraphrase dataset curated by a novel filtering pipeline. We aim to take a step towards alleviating the low resource status of the Bangla language in the NLP domain through the introduction of BanglaParaphrase, which ensures quality by preserving both semantics and diversity, making it particularly useful to enhance other Bangla datasets. We show a detailed comparative analysis between our dataset and models trained on it with other existing works to establish the viability of our synthetic paraphrase data generation pipeline. We are making the dataset and models publicly available at https://github.com/csebuetnlp/banglaparaphrase to further the state of Bangla NLP.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

csebuetnlp/banglaparaphrase
pytorchOfficial

Models

🤗
csebuetnlp/banglat5_banglaparaphrase
model· 108 dl
108 dl

Datasets

csebuetnlp/BanglaParaphrase
dataset· 62 dl
62 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques