# BanglaRegionalTextCorpus: A curated dataset for four regional bangla dialects with standard Bangla and English translation

**Authors:** Md. Tofael Ahmed, Zannatul Mawa Koli, Azmain Mahtab Rahat, Taslima Akhter, Umme Ayman

PMC · DOI: 10.1016/j.dib.2026.112585 · Data in Brief · 2026-02-11

## TL;DR

The paper introduces a dataset for four Bangla dialects with translations, aiding NLP research and dialect preservation.

## Contribution

A curated dataset of four regional Bangla dialects with standard Bangla and English translations is presented.

## Key findings

- The dataset contains 4653 manually validated sentences from multiple sources.
- It supports dialect identification, translation, and sociolinguistic research.
- The corpus enables inclusive NLP models for low-resourced languages.

## Abstract

The BanglaRegionalTextCorpus is introduced as a curated dataset documenting four regional Bangla dialects: Rangpur, Barisal, Narail, and Khulna along with their corresponding Standard Bangla and English translations. The corpus contains 4653 manually validated sentences, collected from community interactions, field recordings, and publicly available digital sources. Rigorous pre-processing steps, including duplicate removal, normalization, and linguistic validation by native speakers, were employed to ensure data accuracy and consistency. This dataset serves as a comprehensive resource for dialect identification, machine translation, and text classification, as well as for research in sociolinguistics and regional language variation. By capturing phonetic, lexical, and syntactic distinctions across four dialects, it enables the development of inclusive and context-aware NLP models for low-resourced languages. Furthermore, the dataset supports comparative linguistic studies between regional and standardized Bangla, contributing to the preservation and computational representation of dialectal diversity. The BanglaRegionalTextCorpus provides a benchmark resource for future research in Bangla NLP, promoting collaboration, cultural preservation, and equitable language technology development across diverse linguistic communities.

## Full-text entities

- **Chemicals:** Azmain (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12934226/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12934226/full.md

## References

6 references — full list in the complete paper: https://tomesphere.com/paper/PMC12934226/full.md

---
Source: https://tomesphere.com/paper/PMC12934226