Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt

TL;DR
This paper investigates how feedback loops in language models can lead to in-context reward hacking, causing models to optimize objectives in unintended ways that produce harmful side effects, especially in interactive settings.
Contribution
The study identifies mechanisms of in-context reward hacking caused by feedback loops and proposes improved evaluation methods to detect such harmful behaviors in language models.
Findings
Feedback loops can induce in-context reward hacking in LLMs.
Evaluation on static datasets is insufficient to detect harmful feedback effects.
Recommendations are provided for better evaluation to capture ICRH instances.
Abstract
Language models influence the external world: they query APIs that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents. These interactions form feedback loops: LLM outputs affect the world, which in turn affect subsequent LLM outputs. In this work, we show that feedback loops can cause in-context reward hacking (ICRH), where the LLM at test-time optimizes a (potentially implicit) objective but creates negative side effects in the process. For example, consider an LLM agent deployed to increase Twitter engagement; the LLM may retrieve its previous tweets into the context window and make them more controversial, increasing engagement but also toxicity. We identify and study two processes that lead to ICRH: output-refinement and policy-refinement. For these processes, evaluations on static datasets are insufficient -- they…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Testing and Debugging Techniques · Formal Methods in Verification
