PerPaDa: A Persian Paraphrase Dataset based on Implicit Crowdsourcing Data Collection
Salar Mohtaj, Fatemeh Tavakkoli, Habibollah Asghari

TL;DR
This paper introduces PerPaDa, a large Persian paraphrase dataset collected through implicit crowdsourcing from a plagiarism detection system, offering a less biased and more extensive resource for paraphrase identification in Persian.
Contribution
The paper presents a novel large-scale Persian paraphrase dataset collected via implicit crowdsourcing, improving data quality and reducing bias compared to existing datasets.
Findings
Dataset contains 2446 paraphrase instances.
Collected data is larger and less biased than existing datasets.
Heuristics improved data quality.
Abstract
In this paper we introduce PerPaDa, a Persian paraphrase dataset that is collected from users' input in a plagiarism detection system. As an implicit crowdsourcing experience, we have gathered a large collection of original and paraphrased sentences from Hamtajoo; a Persian plagiarism detection system, in which users try to conceal cases of text re-use in their documents by paraphrasing and re-submitting manuscripts for analysis. The compiled dataset contains 2446 instances of paraphrasing. In order to improve the overall quality of the collected data, some heuristics have been used to exclude sentences that don't meet the proposed criteria. The introduced corpus is much larger than the available datasets for the task of paraphrase identification in Persian. Moreover, there is less bias in the data compared to the similar datasets, since the users did not try some fixed predefined rules…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
