RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns
Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S. Chao, Derek F. Wong

TL;DR
RepreGuard is a novel detection method that leverages internal neural activation patterns of LLMs to distinguish between machine-generated and human-written texts, showing high accuracy and robustness across various scenarios.
Contribution
This paper introduces RepreGuard, a new statistics-based detection approach utilizing internal LLM representations, improving robustness over existing methods in diverse conditions.
Findings
Achieves 94.92% AUROC on both ID and OOD data
Demonstrates robustness against various text sizes and attacks
Outperforms all baseline detection methods
Abstract
Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in out-of-distribution (OOD) scenarios is still lacking. In this paper, we hypothesize that, compared to features used by existing detection methods, the internal representations of LLMs contain more comprehensive and raw features that can more effectively capture and distinguish the statistical pattern differences between LLM-generated texts (LGT) and human-written texts (HWT). We validated this hypothesis across different LLMs and observed significant differences in neural activation patterns when processing these two types of texts. Based on this, we propose RepreGuard, an efficient statistics-based detection method. Specifically, we first employ a surrogate model to collect representation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing
