Automatic Standardization of Colloquial Persian
Mohammad Sadegh Rasooli, Farzane Bakhtyari, Fatemeh Shafiei, Mahsa, Ravanbakhsh, Chris Callison-Burch

TL;DR
This paper presents a sequence-to-sequence based approach for standardizing colloquial Persian to formal standard, improving NLP tasks like machine translation with a new artificial data generation method and a publicly available evaluation dataset.
Contribution
It introduces a novel artificial data generation algorithm and a sequence-to-sequence model for Persian standardization, along with a new evaluation dataset.
Findings
Higher BLEU score (62.8 vs 61.7) compared to rule-based models
Improved Persian-English translation performance with 1.4 and 0.8 BLEU score gains
Effective standardization enhances NLP applications for colloquial Persian
Abstract
The Iranian Persian language has two varieties: standard and colloquial. Most natural language processing tools for Persian assume that the text is in standard form: this assumption is wrong in many real applications especially web content. This paper describes a simple and effective standardization approach based on sequence-to-sequence translation. We design an algorithm for generating artificial parallel colloquial-to-standard data for learning a sequence-to-sequence model. Moreover, we annotate a publicly available evaluation data consisting of 1912 sentences from a diverse set of domains. Our intrinsic evaluation shows a higher BLEU score of 62.8 versus 61.7 compared to an off-the-shelf rule-based standardization model in which the original text has a BLEU score of 46.4. We also show that our model improves English-to-Persian machine translation in scenarios for which the training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
