ReMoDetect: Reward Models Recognize Aligned LLM's Generations

Hyunseok Lee; Jihoon Tack; Jinwoo Shin

arXiv:2405.17382·cs.LG·November 8, 2024

ReMoDetect: Reward Models Recognize Aligned LLM's Generations

Hyunseok Lee, Jihoon Tack, Jinwoo Shin

PDF

Open Access 2 Repos 1 Models 1 Video

TL;DR

ReMoDetect leverages the common feature of aligned LLMs generating texts with higher human preference scores to effectively detect LLM-generated texts, achieving state-of-the-art results across multiple domains.

Contribution

The paper introduces novel training schemes for reward models that improve detection of aligned LLM-generated texts by exploiting their preference characteristics.

Findings

01

Reward models can distinguish aligned LLM texts from human texts.

02

Training with mixed human/LLM texts enhances detection accuracy.

03

Method achieves state-of-the-art results across six domains and twelve models.

Abstract

The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
hyunseoki/ReMoDetect-deberta
model· 200 dl· ♡ 2
200 dl♡ 2

Videos

ReMoDetect: Reward Models Recognize Aligned LLM's Generations· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Machine Learning in Healthcare · Topic Modeling