Developing an Informal-Formal Persian Corpus

Vahide Tajalli; Fateme Kalantari; Mehrnoush Shamsfard

arXiv:2308.05336·cs.CL·August 11, 2023·1 cites

Developing an Informal-Formal Persian Corpus

Vahide Tajalli, Fateme Kalantari, Mehrnoush Shamsfard

PDF

Open Access

TL;DR

This paper presents the development of a large parallel corpus of 50,000 aligned informal and formal Persian sentence pairs, capturing lexical and syntactic variations for language processing.

Contribution

It introduces a methodology for creating a comprehensive Persian informal-formal corpus with detailed alignments and a large dictionary, aiding linguistic analysis and tool development.

Findings

01

50,000 sentence pairs with word/phrase level alignments

02

Approximately 530,000 alignments in total

03

Dictionary of 49,397 word and phrase pairs

Abstract

Informal language is a style of spoken or written language frequently used in casual conversations, social media, weblogs, emails and text messages. In informal writing, the language faces some lexical and/or syntactic changes varying among different languages. Persian is one of the languages with many differences between its formal and informal styles of writing, thus developing informal language processing tools for this language seems necessary. Such a converter needs a large aligned parallel corpus of colloquial-formal sentences which can be useful for linguists to extract a regulated grammar and orthography for colloquial Persian as is done for the formal language. In this paper we explain our methodology in building a parallel corpus of 50,000 sentence pairs with alignments in the word/phrase level. The sentences were attempted to cover almost all kinds of lexical and syntactic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques