PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Mohammad Javad Ranjbar Kalahroodi; Heshaam Faili; and Azadeh Shakery

arXiv:2603.05314·cs.CL·March 6, 2026

PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, and Azadeh Shakery

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces PersianPunc, a large-scale dataset and a BERT-based model for Persian punctuation restoration, achieving high accuracy and efficiency, and highlighting limitations of large language models in this task.

Contribution

The paper presents a new large-scale dataset and a lightweight BERT-based approach for Persian punctuation restoration, addressing a gap in Persian NLP resources and methods.

Findings

01

Achieved a macro F1 score of 91.33% on the test set.

02

Demonstrated the efficiency of the BERT-based model for real-time applications.

03

Compared performance with large language models, highlighting their limitations.

Abstract

Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MohammadJRanjbar/PersianPunc
dataset· 8 dl
8 dl

Videos

PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Phonetics and Phonology Research