Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT
Ehsan Doostmohammadi, Minoo Nassajian, Adel Rahimi

TL;DR
This paper presents a joint approach using BERT for Persian word segmentation correction and ZWNJ recognition, achieving high accuracy on a challenging corpus.
Contribution
It introduces a novel joint sequence labeling method with BERT for Persian text processing, addressing both segmentation and ZWNJ recognition tasks.
Findings
Achieved a macro-averaged F1-score of 92.40%
Effectively handled complex and difficult Persian sentences
Demonstrated the effectiveness of BERT in joint segmentation and ZWNJ recognition
Abstract
Words are properly segmented in the Persian writing system; in practice, however, these writing rules are often neglected, resulting in single words being written disjointedly and multiple words written without any white spaces between them. This paper addresses the problems of word segmentation and zero-width non-joiner (ZWNJ) recognition in Persian, which we approach jointly as a sequence labeling problem. We achieved a macro-averaged F1-score of 92.40% on a carefully collected corpus of 500 sentences with a high level of difficulty.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
