Deciphering Textual Authenticity: A Generalized Strategy through the Lens of Large Language Semantics for Detecting Human vs. Machine-Generated Text
Mazal Bethany, Brandon Wherry, Emet Bethany, Nishant Vishwamitra,, Anthony Rios, Peyman Najafirad

TL;DR
This paper introduces T5LLMCipher, a novel detection system leveraging a pretrained T5 encoder and embedding sub-clustering, which significantly improves the generalization and accuracy of distinguishing human from machine-generated text across diverse models and domains.
Contribution
The paper presents a new detection approach that overcomes limitations of existing methods by effectively handling diverse generators and real-world scenarios.
Findings
Achieves 19.6% higher F1 score on unseen generators and domains.
Correctly attributes text generator with 93.6% accuracy.
Outperforms state-of-the-art detection methods across multiple datasets.
Abstract
With the recent proliferation of Large Language Models (LLMs), there has been an increasing demand for tools to detect machine-generated text. The effective detection of machine-generated text face two pertinent problems: First, they are severely limited in generalizing against real-world scenarios, where machine-generated text is produced by a variety of generators, including but not limited to GPT-4 and Dolly, and spans diverse domains, ranging from academic manuscripts to social media posts. Second, existing detection methodologies treat texts produced by LLMs through a restrictive binary classification lens, neglecting the nuanced diversity of artifacts generated by different LLMs. In this work, we undertake a systematic study on the detection of machine-generated text in real-world scenarios. We first study the effectiveness of state-of-the-art approaches and find that they are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Text Readability and Simplification
MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Dropout · Adafactor · Adam
