LLMs Plagiarize: Ensuring Responsible Sourcing of Large Language Model Training Data Through Knowledge Graph Comparison
Devam Mondal, Carlo Lipizzi

TL;DR
This paper introduces a knowledge graph-based system to detect if large language models have plagiarized source material during training or generation, addressing legal concerns with a novel similarity analysis approach.
Contribution
It proposes a new plagiarism detection method using RDF knowledge graphs and structural similarity measures that do not require access to LLM internals or training data.
Findings
Effective detection of source material in LLM outputs.
Utilizes content and structural similarity measures.
Does not rely on LLM internal metrics or training data.
Abstract
In light of recent legal allegations brought by publishers, newspapers, and other creators of copyrighted corpora against large language model developers who use their copyrighted materials for training or fine-tuning purposes, we propose a novel system, a variant of a plagiarism detection system, that assesses whether a knowledge source has been used in the training or fine-tuning of a large language model. Unlike current methods, we utilize an approach that uses Resource Description Framework (RDF) triples to create knowledge graphs from both a source document and an LLM continuation of that document. These graphs are then analyzed with respect to content using cosine similarity and with respect to structure using a normalized version of graph edit distance that shows the degree of isomorphism. Unlike traditional plagiarism systems that focus on content matching and keyword…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Natural Language Processing Techniques
MethodsFocus
