ReMoDetect: Reward Models Recognize Aligned LLM's Generations
Hyunseok Lee, Jihoon Tack, Jinwoo Shin

TL;DR
ReMoDetect leverages the common feature of aligned LLMs generating texts with higher human preference scores to effectively detect LLM-generated texts, achieving state-of-the-art results across multiple domains.
Contribution
The paper introduces novel training schemes for reward models that improve detection of aligned LLM-generated texts by exploiting their preference characteristics.
Findings
Reward models can distinguish aligned LLM texts from human texts.
Training with mixed human/LLM texts enhances detection accuracy.
Method achieves state-of-the-art results across six domains and twelve models.
Abstract
The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Machine Learning in Healthcare · Topic Modeling
