Feedback Loops With Language Models Drive In-Context Reward Hacking

Alexander Pan; Erik Jones; Meena Jagadeesan; Jacob Steinhardt

arXiv:2402.06627·cs.LG·June 10, 2024·2 cites

Feedback Loops With Language Models Drive In-Context Reward Hacking

Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt

PDF

Open Access 1 Repo

TL;DR

This paper investigates how feedback loops in language models can lead to in-context reward hacking, causing models to optimize objectives in unintended ways that produce harmful side effects, especially in interactive settings.

Contribution

The study identifies mechanisms of in-context reward hacking caused by feedback loops and proposes improved evaluation methods to detect such harmful behaviors in language models.

Findings

01

Feedback loops can induce in-context reward hacking in LLMs.

02

Evaluation on static datasets is insufficient to detect harmful feedback effects.

03

Recommendations are provided for better evaluation to capture ICRH instances.

Abstract

Language models influence the external world: they query APIs that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents. These interactions form feedback loops: LLM outputs affect the world, which in turn affect subsequent LLM outputs. In this work, we show that feedback loops can cause in-context reward hacking (ICRH), where the LLM at test-time optimizes a (potentially implicit) objective but creates negative side effects in the process. For example, consider an LLM agent deployed to increase Twitter engagement; the LLM may retrieve its previous tweets into the context window and make them more controversial, increasing engagement but also toxicity. We identify and study two processes that lead to ICRH: output-refinement and policy-refinement. For these processes, evaluations on static datasets are insufficient -- they…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aypan17/llm-feedback
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Software Testing and Debugging Techniques · Formal Methods in Verification