# Dataset on multiregional variations of Bangla language (BD-Dialect)

**Authors:** Anika Rahman, Nafesha Hasan Muna, Masuma Saba Prity

PMC · DOI: 10.1016/j.dib.2026.112654 · Data in Brief · 2026-03-03

## TL;DR

The BD-Dialect dataset provides aligned translations of Bangla and its regional dialects to support linguistic and NLP research.

## Contribution

It introduces a multiregional Bangla dialect dataset with aligned translations and native speaker validation.

## Key findings

- The dataset includes Standard Bangla and five dialects with English translations for cross-linguistic comparison.
- Two CSV files with 950 aligned entries each were validated by native speakers for linguistic accuracy.
- The dataset is publicly available for dialect recognition and translation system development.

## Abstract

The BD-Dialect dataset presents a comprehensive multiregional linguistic resource for Bangla and its major regional dialects, designed to support research in computational linguistics, dialectology, and natural language processing (NLP). The dataset includes aligned translations across Standard Bangla and five major dialects—Noakhali, Sylhet, Chittagong, Rajshahi, and Mymensingh—alongside English translations to facilitate cross-linguistic comparison. Data were collected from two sources: native speaker interviews and regional literature, ensuring both lexical richness and regional authenticity. The final dataset consists of two CSV files (words and clauses), each containing 950 aligned entries structured under seven columns: Standard Bangla, Noakhali, Sylhet, Chittagong, Rajshahi, Mymensingh, and English Translation. Preprocessing and formatting were conducted using Python in Google Colab, followed by validation by three native speakers per dialect to ensure linguistic accuracy and consistency. The dataset and preprocessing scripts are publicly available in Mendeley Data under DOI: 10.17632/k769s4vk5z.2, providing an open-access resource for developing dialect recognition models, translation systems, and comparative linguistic research in Bangla.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12996979/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12996979/full.md

## References

11 references — full list in the complete paper: https://tomesphere.com/paper/PMC12996979/full.md

---
Source: https://tomesphere.com/paper/PMC12996979