Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, A Low-Resource Language

Md Obyedullahil Mamun; Md Adyelullahil Mamun; Arif Ahmad; Md. Imran Hossain Emu

arXiv:2507.18448·cs.CL·January 13, 2026

Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, A Low-Resource Language

Md Obyedullahil Mamun, Md Adyelullahil Mamun, Arif Ahmad, Md. Imran Hossain Emu

PDF

TL;DR

This paper presents a transformer-based approach for restoring punctuation in unpunctuated Bangla text, achieving high accuracy and demonstrating strong generalization in low-resource, noisy scenarios, with publicly available datasets and code.

Contribution

It introduces a novel transformer-based model for Bangla punctuation restoration, including a large dataset and data augmentation techniques tailored for low-resource language processing.

Findings

01

Achieved 97.1% accuracy on News test set

02

Model generalizes well to reference and ASR transcripts

03

Provides publicly available datasets and code for future research

Abstract

Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set. Results show strong generalization to reference and ASR transcripts, demonstrating the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.