Informal Persian Universal Dependency Treebank
Roya Kabiri, Simin Karimi, Mihai Surdeanu

TL;DR
This paper introduces the first open-source Universal Dependency Treebank for informal Persian, highlighting the linguistic differences from formal Persian and evaluating parser performance across these variants.
Contribution
It develops a dedicated informal Persian treebank within the Universal Dependencies framework and analyzes parser performance on informal language data.
Findings
Parsers perform significantly worse on informal Persian due to unknown tokens and structures.
Dependency relations unique to informal Persian show the greatest performance decline.
The study emphasizes the importance of including informal language data in NLP tools.
Abstract
This paper presents the phonological, morphological, and syntactic distinctions between formal and informal Persian, showing that these two variants have fundamental differences that cannot be attributed solely to pronunciation discrepancies. Given that informal Persian exhibits particular characteristics, any computational model trained on formal Persian is unlikely to transfer well to informal Persian, necessitating the creation of dedicated treebanks for this variety. We thus detail the development of the open-source Informal Persian Universal Dependency Treebank, a new treebank annotated within the Universal Dependencies scheme. We then investigate the parsing of informal Persian by training two dependency parsers on existing formal treebanks and evaluating them on out-of-domain data, i.e. the development set of our informal treebank. Our results show that parsers experience a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
