Detecting Cross-Language Plagiarism using Open Knowledge Graphs
Johannes Stegm\"uller, Fabian Bauer-Marquart, Norman Meuschke, Terry, Ruas, Moritz Schubotz, Bela Gipp

TL;DR
This paper introduces CL-OSA, a multilingual retrieval model that detects cross-language plagiarism by representing documents as entity vectors from Wikidata, avoiding machine translation and pre-training, and outperforming existing methods especially for distant language pairs.
Contribution
The paper presents CL-OSA, a novel knowledge graph-based model for cross-language plagiarism detection that is scalable, effective for distant languages, and does not rely on translation or pre-training.
Findings
CL-OSA outperforms state-of-the-art methods on diverse corpora.
It effectively detects sense-for-sense translations.
Performance exceeds competitors by over two times in PlagDet score.
Abstract
Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
