TL;DR
This paper introduces PUPPET, a reinforcement learning framework that enhances both detectability and downstream task performance of LLM outputs, outperforming watermarking methods.
Contribution
PUPPET jointly optimizes LLMs for detectability and task performance, achieving high detectability without sacrificing downstream effectiveness.
Findings
PUPPET achieves high detectability comparable to watermarking.
It outperforms watermarking on downstream tasks like QA and summarization.
Optimization is efficient, requiring only a few thousand samples and minimal GPU hours.
Abstract
Detecting machine-generated text is essential for transparency and accountability when deploying large language models (LLMs). Among detection approaches, watermarking is a statistically reliable method by design -- it embeds detectable signals into LLM outputs by biasing their token distributions. However, it has been reported that watermarked LLMs often perform worse on downstream tasks. We propose PUPPET, a framework that fine-tunes an LLM via reinforcement learning to generate text that is both more detectable and better performing on downstream tasks. We use two reward functions: a detector that outputs a machine-class likelihood and an evaluator that measures a task-specific metric. Experiments on long-form QA, summarization, and essay writing show that LLMs trained with PUPPET achieve high detectability competitive with watermarking methods while outperforming them on downstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
