# A dataset for translating local Bangla (Sylheti) dialects into standard Bangla

**Authors:** Tabia Tanzin Prama, Mangsura Kabir Oni

PMC · DOI: 10.1016/j.dib.2026.112576 · 2026-02-13

## TL;DR

This paper introduces a dataset to translate Sylheti dialects into standard Bangla, supporting language preservation and digital use.

## Contribution

A new dataset of 5002 Sylheti-Standard Bangla sentence pairs is introduced for NMT and NLP tasks.

## Key findings

- The dataset includes 21,132 unique words and 10,340 clauses across both languages.
- It supports NMT and other NLP tasks like text classification and sentiment analysis.
- The dataset was collected from diverse sources including newspapers and social media.

## Abstract

Sylheti is a language spoken by about 11 million people worldwide. It's mostly spoken in northeastern Bangladesh and southern Assam, India, and by people living in other countries who originally came from these regions. Translating Sylheti dialects into Standard Bangla is essential to ensure effective communication across the country and internationally. This article introduces a collection of paired sentences, one in the Sylheti dialect and the other in Standard Bangla. It was created to enhance Neural Machine Translation (NMT) between the two languages. Sylheti is a language with a rich cultural heritage, known for its unique vocabulary, music, and folklore. However, it has been largely absent from formal written materials and digital resources, leaving a gap in its linguistic representation. To bridge this gap, 5002 sentence pairs were carefully collected from various sources, such as Bangladeshi newspapers, social media platforms, voluntary comments and contributions from native Sylheti speakers. The dataset, collected between December 2023 and March 2025, contains diverse linguistic elements. It includes 21,132 unique words (9729 Sylheti words and 11,403 Standard Bangla words), 10,340 clauses (5069 Sylheti and 5271 Standard Bangla), and 10,004 sentences. This collection is not only valuable for machine translation but also plays a crucial role in other areas of natural language processing. It supports tasks like text classification, identifying key names and entities, and analyzing sentiment. Furthermore, it enables the development of advanced technologies for Sylheti, such as text-to-speech systems, sentiment analysis tools, and language models. This resource is a significant step towards better understanding and utilizing the Sylheti language in the digital world.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12966678/full.md

---
Source: https://tomesphere.com/paper/PMC12966678