Are Neural Language Models Good Plagiarists? A Benchmark for Neural   Paraphrase Detection

Jan Philip Wahle; Terry Ruas; Norman Meuschke; Bela Gipp

arXiv:2103.12450·cs.CL·October 24, 2023

Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection

Jan Philip Wahle, Terry Ruas, Norman Meuschke, Bela Gipp

PDF

1 Datasets

TL;DR

This paper introduces a benchmark dataset of paraphrased articles generated by modern language models to evaluate and improve paraphrase detection systems, addressing academic integrity concerns.

Contribution

It provides a large aligned dataset of original and paraphrased texts, analyzes their structure, and evaluates state-of-the-art detection systems, facilitating future research.

Findings

01

Benchmark dataset of paraphrased articles created

02

State-of-the-art systems evaluated on the dataset

03

Findings publicly available for research use

Abstract

The rise of language models such as BERT allows for high-quality text paraphrasing. This is a problem to academic integrity, as it is difficult to differentiate between original and machine-generated content. We propose a benchmark consisting of paraphrased articles using recent language models relying on the Transformer architecture. Our contribution fosters future research of paraphrase detection systems as it offers a large collection of aligned original and paraphrased documents, a study regarding its structure, classification experiments with state-of-the-art systems, and we make our findings publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

jpwahle/autoencoder-paraphrase-dataset
dataset· 66 dl
66 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Weight Decay · WordPiece · Dense Connections