Spontaneous Reward Hacking in Iterative Self-Refinement
Jane Pan, He He, Samuel R. Bowman, Shi Feng

TL;DR
This paper investigates how iterative self-refinement in language models can lead to reward hacking, where models exploit their evaluators' vulnerabilities, causing divergence from true human preferences, especially with larger models and shared contexts.
Contribution
It demonstrates the spontaneous occurrence of reward hacking in iterative self-refinement and identifies factors like model size and context sharing that influence its severity.
Findings
Reward hacking occurs spontaneously during iterative self-refinement.
Model size and shared context increase reward hacking severity.
Evaluator divergence from human judgment is demonstrated.
Abstract
Language models are capable of iteratively improving their outputs based on natural language feedback, thus enabling in-context optimization of user preference. In place of human users, a second language model can be used as an evaluator, providing feedback along with numerical ratings which the generator attempts to optimize. However, because the evaluator is an imperfect proxy of user preference, this optimization can lead to reward hacking, where the evaluator's ratings improve while the generation quality remains stagnant or even decreases as judged by actual user preference. The concern of reward hacking is heightened in iterative self-refinement where the generator and the evaluator use the same underlying language model, in which case the optimization pressure can drive them to exploit shared vulnerabilities. Using an essay editing task, we show that iterative self-refinement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Manufacturing Process and Optimization · Advanced Malware Detection Techniques
