Paraphrase Detection: Human vs. Machine Content
Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp

TL;DR
This study compares human and machine-generated paraphrases, evaluating detection methods across datasets, revealing that human paraphrases are more complex and diverse, and identifying effective detection techniques and challenging datasets.
Contribution
It provides a comprehensive analysis of paraphrase detection methods on various datasets, highlighting the differences between human and machine paraphrases and identifying the most effective approaches.
Findings
Human paraphrases are more difficult and diverse than machine-generated ones.
Transformers are the most effective detection method across datasets.
Four datasets are identified as most challenging for paraphrase detection.
Abstract
The growing prominence of large language models, such as GPT-4 and ChatGPT, has led to increased concerns over academic integrity due to the potential for machine-generated content and paraphrasing. Although studies have explored the detection of human- and machine-paraphrased content, the comparison between these types of content remains underexplored. In this paper, we conduct a comprehensive analysis of various datasets commonly employed for paraphrase detection tasks and evaluate an array of detection methods. Our findings highlight the strengths and limitations of different detection methods in terms of performance on individual datasets, revealing a lack of suitable machine-generated datasets that can be aligned with human expectations. Our main finding is that human-authored paraphrases exceed machine-generated ones in terms of difficulty, diversity, and similarity implying that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections
