The First Parallel Multilingual Corpus of Persian: Toward a Persian BLARK
Behrang Qasemizadeh, Saeed Rahimi, Behrooz Mahmoodi Bakhtiari

TL;DR
This paper introduces the first parallel multilingual corpus of Persian with over ten European languages and outlines initial steps toward developing a Basic Language Resources Kit (BLARK) for Persian.
Contribution
It presents the creation of a multilingual Persian corpus and proposes a morphosyntactic specification and POS categorization aligned with international standards.
Findings
A comprehensive Persian multilingual corpus has been compiled.
Morphosyntactic features of Persian are defined based on established guidelines.
Initial statistical analysis of the corpus is provided.
Abstract
In this article, we have introduced the first parallel corpus of Persian with more than 10 other European languages. This article describes primary steps toward preparing a Basic Language Resources Kit (BLARK) for Persian. Up to now, we have proposed morphosyntactic specification of Persian based on EAGLE/MULTEXT guidelines and specific resources of MULTEXT-East. The article introduces Persian Language, with emphasis on its orthography and morphosyntactic features, then a new Part-of-Speech categorization and orthography for Persian in digital environments is proposed. Finally, the corpus and related statistic will be analyzed.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistics and language evolution
