PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, and Azadeh Shakery

TL;DR
This paper introduces PersianPunc, a large-scale dataset and a BERT-based model for Persian punctuation restoration, achieving high accuracy and efficiency, and highlighting limitations of large language models in this task.
Contribution
The paper presents a new large-scale dataset and a lightweight BERT-based approach for Persian punctuation restoration, addressing a gap in Persian NLP resources and methods.
Findings
Achieved a macro F1 score of 91.33% on the test set.
Demonstrated the efficiency of the BERT-based model for real-time applications.
Compared performance with large language models, highlighting their limitations.
Abstract
Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Phonetics and Phonology Research
