LLMs Plagiarize: Ensuring Responsible Sourcing of Large Language Model   Training Data Through Knowledge Graph Comparison

Devam Mondal; Carlo Lipizzi

arXiv:2407.02659·cs.CL·August 5, 2024·1 cites

LLMs Plagiarize: Ensuring Responsible Sourcing of Large Language Model Training Data Through Knowledge Graph Comparison

Devam Mondal, Carlo Lipizzi

PDF

Open Access

TL;DR

This paper introduces a knowledge graph-based system to detect if large language models have plagiarized source material during training or generation, addressing legal concerns with a novel similarity analysis approach.

Contribution

It proposes a new plagiarism detection method using RDF knowledge graphs and structural similarity measures that do not require access to LLM internals or training data.

Findings

01

Effective detection of source material in LLM outputs.

02

Utilizes content and structural similarity measures.

03

Does not rely on LLM internal metrics or training data.

Abstract

In light of recent legal allegations brought by publishers, newspapers, and other creators of copyrighted corpora against large language model developers who use their copyrighted materials for training or fine-tuning purposes, we propose a novel system, a variant of a plagiarism detection system, that assesses whether a knowledge source has been used in the training or fine-tuning of a large language model. Unlike current methods, we utilize an approach that uses Resource Description Framework (RDF) triples to create knowledge graphs from both a source document and an LLM continuation of that document. These graphs are then analyzed with respect to content using cosine similarity and with respect to structure using a normalized version of graph edit distance that shows the degree of isomorphism. Unlike traditional plagiarism systems that focus on content matching and keyword…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Topic Modeling · Natural Language Processing Techniques

MethodsFocus