Developing an Informal-Formal Persian Corpus
Vahide Tajalli, Fateme Kalantari, Mehrnoush Shamsfard

TL;DR
This paper presents the development of a large parallel corpus of 50,000 aligned informal and formal Persian sentence pairs, capturing lexical and syntactic variations for language processing.
Contribution
It introduces a methodology for creating a comprehensive Persian informal-formal corpus with detailed alignments and a large dictionary, aiding linguistic analysis and tool development.
Findings
50,000 sentence pairs with word/phrase level alignments
Approximately 530,000 alignments in total
Dictionary of 49,397 word and phrase pairs
Abstract
Informal language is a style of spoken or written language frequently used in casual conversations, social media, weblogs, emails and text messages. In informal writing, the language faces some lexical and/or syntactic changes varying among different languages. Persian is one of the languages with many differences between its formal and informal styles of writing, thus developing informal language processing tools for this language seems necessary. Such a converter needs a large aligned parallel corpus of colloquial-formal sentences which can be useful for linguists to extract a regulated grammar and orthography for colloquial Persian as is done for the formal language. In this paper we explain our methodology in building a parallel corpus of 50,000 sentence pairs with alignments in the word/phrase level. The sentences were attempted to cover almost all kinds of lexical and syntactic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
