Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt,, Minlie Huang, Samuel R. Bowman, He He, Shi Feng

TL;DR
This paper reveals that RLHF can cause language models to deceive humans into believing incorrect outputs, highlighting a significant failure mode in current alignment techniques.
Contribution
It demonstrates that RLHF-trained models can convincingly mislead humans without improving task accuracy, and that existing detection methods are ineffective against this issue.
Findings
RLHF increases models' ability to deceive humans.
Human evaluators' false positive rate rises significantly.
Probing methods fail to detect deceptive outputs in RLHF models.
Abstract
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans' accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects' false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we…
Peer Reviews
Decision·ICLR 2025 Poster
1. This paper introduces the concept of "U-SOPHISTRY," wherein RLHF unintentionally enables language models to mislead human evaluators without necessarily improving task performance. This novel framing extends prior work on reward hacking and deception in AI, highlighting new risks in standard RLHF pipelines. With AI applications proliferating, ensuring safe and reliable human-AI interaction is critical. 2. The method incorporate diverse and challenging tasks like question-answering and progra
1. The scenario discussed is somewhat perplexing: this paper argues that models trained with RLHF may become more deceptive towards humans without actual improvements in capability. However, RLHF’s effectiveness relies heavily on the choice of reward model and corresponding training data, so if there are issues in human-annotated data, such results are predictable. Thus, the reviewer suggests that the problem stems from humans no longer being able to provide sufficiently high-quality evaluations
* Addresses a critical gap in understanding how language models might naturally learn to mislead humans * The experiment is well-designed with appropriate controls * First systematic study of unintended sophistry in RLHF
* This paper only shows the experimental results on only two tasks (question-answering and programming). Without specific experiments, we may not know whether the method would generalize to other important domains where RLHF is used. * Figure 1 could benefit from more detailed captions * The related work section only covers RLHF literature and could expand discussion on human evaluation methods.
- The work recruits humans and validates the hypothesis that models can learn to perform reward hacking. The human evaluations are well thought out. - They perform extensive evaluations and experiments for robustness. They also try methods that can detect I-Sophistry to detect U-Sophistory but find that the methods do not generalize. - The insights are impactful and should make researchers and industry practitioners give more thought to designing their RLHF training procedure. - The manuscript
- I am not convinced by the design of the programming task used to validate the hypothesis. - Why do the authors choose the two simplest unit tests? How would things change if they used the two most difficult unit tests? - For the pilot study, how were the human evaluators incentivized? As a developer, I would write two unit tests. One is an easy case, and another is difficult or where programs usually fail. - In an ideal scenario for preference tuning for programming tasks, human
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling
MethodsALIGN
