A dataset for translating local Bangla (Sylheti) dialects into standard Bangla
Tabia Tanzin Prama, Mangsura Kabir Oni

TL;DR
This paper introduces a dataset to translate Sylheti dialects into standard Bangla, supporting language preservation and digital use.
Contribution
A new dataset of 5002 Sylheti-Standard Bangla sentence pairs is introduced for NMT and NLP tasks.
Findings
The dataset includes 21,132 unique words and 10,340 clauses across both languages.
It supports NMT and other NLP tasks like text classification and sentiment analysis.
The dataset was collected from diverse sources including newspapers and social media.
Abstract
Sylheti is a language spoken by about 11 million people worldwide. It's mostly spoken in northeastern Bangladesh and southern Assam, India, and by people living in other countries who originally came from these regions. Translating Sylheti dialects into Standard Bangla is essential to ensure effective communication across the country and internationally. This article introduces a collection of paired sentences, one in the Sylheti dialect and the other in Standard Bangla. It was created to enhance Neural Machine Translation (NMT) between the two languages. Sylheti is a language with a rich cultural heritage, known for its unique vocabulary, music, and folklore. However, it has been largely absent from formal written materials and digital resources, leaving a gap in its linguistic representation. To bridge this gap, 5002 sentence pairs were carefully collected from various sources, such…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · South Asian Studies and Conflicts · Authorship Attribution and Profiling
