Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, A Low-Resource Language
Md Obyedullahil Mamun, Md Adyelullahil Mamun, Arif Ahmad, Md. Imran Hossain Emu

TL;DR
This paper presents a transformer-based approach for restoring punctuation in unpunctuated Bangla text, achieving high accuracy and demonstrating strong generalization in low-resource, noisy scenarios, with publicly available datasets and code.
Contribution
It introduces a novel transformer-based model for Bangla punctuation restoration, including a large dataset and data augmentation techniques tailored for low-resource language processing.
Findings
Achieved 97.1% accuracy on News test set
Model generalizes well to reference and ASR transcripts
Provides publicly available datasets and code for future research
Abstract
Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set. Results show strong generalization to reference and ASR transcripts, demonstrating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
