Challenges in Persian Electronic Text Analysis

Behrang QasemiZadeh; Saeed Rahimi; Mehdi Safaee Ghalati

arXiv:1404.4740·cs.CL·April 21, 2014·5 cites

Challenges in Persian Electronic Text Analysis

Behrang QasemiZadeh, Saeed Rahimi, Mehdi Safaee Ghalati

PDF

Open Access

TL;DR

This paper discusses the unique challenges faced in analyzing Persian electronic texts, focusing on transcription and encoding issues crucial for developing accurate Farsi corpora.

Contribution

It highlights specific problems in Farsi text analysis related to transcription and encoding, emphasizing their importance in corpus development.

Findings

01

Identification of transcription challenges in Farsi texts

02

Highlighting encoding issues affecting text analysis

03

Emphasizing the importance of standardization in Farsi corpora

Abstract

Farsi, also known as Persian, is the official language of Iran and Tajikistan and one of the two main languages spoken in Afghanistan. Farsi enjoys a unified Arabic script as its writing system. In this paper we briefly introduce the writing standards of Farsi and highlight problems one would face when analyzing Farsi electronic texts, especially during development of Farsi corpora regarding to transcription and encoding of Farsi e-texts. The pointes mentioned may sounds easy but they are crucial when developing and processing written corpora of Farsi.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques