Paraphrase Detection: Human vs. Machine Content

Jonas Becker; Jan Philip Wahle; Terry Ruas; Bela Gipp

arXiv:2303.13989·cs.CL·March 27, 2023·1 cites

Paraphrase Detection: Human vs. Machine Content

Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp

PDF

Open Access 1 Repo

TL;DR

This study compares human and machine-generated paraphrases, evaluating detection methods across datasets, revealing that human paraphrases are more complex and diverse, and identifying effective detection techniques and challenging datasets.

Contribution

It provides a comprehensive analysis of paraphrase detection methods on various datasets, highlighting the differences between human and machine paraphrases and identifying the most effective approaches.

Findings

01

Human paraphrases are more difficult and diverse than machine-generated ones.

02

Transformers are the most effective detection method across datasets.

03

Four datasets are identified as most challenging for paraphrase detection.

Abstract

The growing prominence of large language models, such as GPT-4 and ChatGPT, has led to increased concerns over academic integrity due to the potential for machine-generated content and paraphrasing. Although studies have explored the detection of human- and machine-paraphrased content, the comparison between these types of content remains underexplored. In this paper, we conduct a comprehensive analysis of various datasets commonly employed for paraphrase detection tasks and evaluate an array of detection methods. Our findings highlight the strengths and limitations of different detection methods in terms of performance on individual datasets, revealing a lack of suitable machine-generated datasets that can be aligned with human expectations. Our main finding is that human-authored paraphrases exceed machine-generated ones in terms of difficulty, diversity, and similarity implying that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jonas-becker/pd-human-vs-machine-content
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections