TL;DR
This paper introduces IRM, a zero-shot method using Implicit Reward Models to detect LLM-generated text effectively without additional training, outperforming existing methods on benchmark tests.
Contribution
The paper presents IRM, a novel zero-shot detection approach that leverages publicly available models, eliminating the need for preference collection or task-specific fine-tuning.
Findings
IRM achieves superior detection performance on the DetectRL benchmark.
IRM outperforms existing zero-shot and supervised detection methods.
IRM does not require preference collection or additional training.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their ability to generate human-like text has raised concerns about potential misuse. This underscores the need for reliable and effective methods to detect LLM-generated text. In this paper, we propose IRM, a novel zero-shot approach that leverages Implicit Reward Models for LLM-generated text detection. Such implicit reward models can be derived from publicly available instruction-tuned and base models. Previous reward-based method relies on preference construction and task-specific fine-tuning. In comparison, IRM requires neither preference collection nor additional training. We evaluate IRM on the DetectRL benchmark and demonstrate that IRM can achieve superior detection performance, outperforms existing zero-shot and supervised methods in LLM-generated text detection.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
