Detecting Cross-Language Plagiarism using Open Knowledge Graphs

Johannes Stegm\"uller; Fabian Bauer-Marquart; Norman Meuschke; Terry; Ruas; Moritz Schubotz; Bela Gipp

arXiv:2111.09749·cs.CL·December 17, 2021·1 cites

Detecting Cross-Language Plagiarism using Open Knowledge Graphs

Johannes Stegm\"uller, Fabian Bauer-Marquart, Norman Meuschke, Terry, Ruas, Moritz Schubotz, Bela Gipp

PDF

Open Access 1 Repo

TL;DR

This paper introduces CL-OSA, a multilingual retrieval model that detects cross-language plagiarism by representing documents as entity vectors from Wikidata, avoiding machine translation and pre-training, and outperforming existing methods especially for distant language pairs.

Contribution

The paper presents CL-OSA, a novel knowledge graph-based model for cross-language plagiarism detection that is scalable, effective for distant languages, and does not rely on translation or pre-training.

Findings

01

CL-OSA outperforms state-of-the-art methods on diverse corpora.

02

It effectively detects sense-for-sense translations.

03

Performance exceeds competitors by over two times in PlagDet score.

Abstract

Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gipplab/cl-osa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification