How Large Language Models are Transforming Machine-Paraphrased   Plagiarism

Jan Philip Wahle; Terry Ruas; Frederic Kirstein; Bela Gipp

arXiv:2210.03568·cs.CL·February 9, 2024·6 cites

How Large Language Models are Transforming Machine-Paraphrased Plagiarism

Jan Philip Wahle, Terry Ruas, Frederic Kirstein, Bela Gipp

PDF

Open Access 3 Repos 1 Datasets

TL;DR

Large language models like GPT-3 and T5 can generate highly realistic paraphrases that challenge existing detection methods, raising concerns about academic integrity and the need for improved detection techniques.

Contribution

This study evaluates the capabilities of large autoregressive transformers in generating machine-paraphrased texts and assesses the effectiveness of current detection methods, including human judgment.

Findings

01

GPT-3 achieves 66% F1-score in paraphrase detection.

02

Human raters find GPT-3 paraphrases as high quality as original texts.

03

Large models can produce paraphrases with 53% detection accuracy.

Abstract

The recent success of large language models for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work. However, the role of large autoregressive transformers in generating machine-paraphrased plagiarism and their detection is still developing in the literature. This work explores T5 and GPT-3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia. We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software and perform a human study with 105 participants regarding their detection performance and the quality of generated examples. Our results suggest that large models can rewrite text humans have difficulty identifying as machine-paraphrased (53% mean acc.). Human experts rate the quality of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

jpwahle/autoregressive-paraphrase-dataset
dataset· 44 dl
44 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Cosine Annealing · Residual Connection · Weight Decay · Linear Warmup With Cosine Annealing · Adafactor