Challenges in Persian Electronic Text Analysis
Behrang QasemiZadeh, Saeed Rahimi, Mehdi Safaee Ghalati

TL;DR
This paper discusses the unique challenges faced in analyzing Persian electronic texts, focusing on transcription and encoding issues crucial for developing accurate Farsi corpora.
Contribution
It highlights specific problems in Farsi text analysis related to transcription and encoding, emphasizing their importance in corpus development.
Findings
Identification of transcription challenges in Farsi texts
Highlighting encoding issues affecting text analysis
Emphasizing the importance of standardization in Farsi corpora
Abstract
Farsi, also known as Persian, is the official language of Iran and Tajikistan and one of the two main languages spoken in Afghanistan. Farsi enjoys a unified Arabic script as its writing system. In this paper we briefly introduce the writing standards of Farsi and highlight problems one would face when analyzing Farsi electronic texts, especially during development of Farsi corpora regarding to transcription and encoding of Farsi e-texts. The pointes mentioned may sounds easy but they are crucial when developing and processing written corpora of Farsi.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques
