Colloquial Persian POS (CPPOS) Corpus: A Novel Corpus for Colloquial Persian Part of Speech Tagging
Leyla Rabiei, Farzaneh Rahmani, Mohammad Khansari, Zeinab Rajabi,, Moein Salimi

TL;DR
This paper introduces CPPOS, a new annotated corpus of colloquial Persian from social media, enabling improved POS tagging with deep learning models, outperforming previous resources by 14%.
Contribution
The creation of the first large-scale colloquial Persian POS corpus with manual annotation and a new tagging guideline, tailored for social media text.
Findings
Deep learning models trained on CPPOS outperform previous models.
The corpus achieves a 14% improvement in POS tagging accuracy.
Manual annotation ensures high-quality, domain-specific data.
Abstract
Introduction: Part-of-Speech (POS) Tagging, the process of classifying words into their respective parts of speech (e.g., verb or noun), is essential in various natural language processing applications. POS tagging is a crucial preprocessing task for applications like machine translation, question answering, sentiment analysis, etc. However, existing corpora for POS tagging in Persian mainly consist of formal texts, such as daily news and newspapers. As a result, smart POS tools, machine learning models, and deep learning models trained on these corpora may not perform optimally for processing colloquial text in social network analysis. Method: This paper introduces a novel corpus, "Colloquial Persian POS" (CPPOS), specifically designed to support colloquial Persian text. The corpus includes formal and informal text collected from various domains such as political, social, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM
