Implementing surrogate goals for safer bargaining in LLM-based agents
Caspar Oesterheld, Maxime Rich\'e, Filip Sondej, Jesse Clifton, Vincent Conitzer

TL;DR
This paper explores implementing surrogate goals in language models to promote safer bargaining behavior, comparing prompting, fine-tuning, and scaffolding methods through experiments.
Contribution
It introduces and evaluates four methods for implementing surrogate goals in language models, highlighting scaffolding and fine-tuning as most effective.
Findings
Scaffolding and fine-tuning outperform prompting in implementing surrogate goals.
Fine-tuning and scaffolding more accurately reflect desired threat responses.
Scaffolding-based methods have the best overall performance.
Abstract
Surrogate goals have been proposed as a strategy for reducing risks from bargaining failures. A surrogate goal is goal that a principal can give an AI agent and that deflects any threats against the agent away from what the principal cares about. For example, one might make one's agent care about preventing money from being burned. Then in bargaining interactions, other agents can threaten to burn their money instead of threatening to spending money to hurt the principal. Importantly, the agent has to care equally about preventing money from being burned as it cares about money being spent to hurt the principal. In this paper, we implement surrogate goals in language-model-based agents. In particular, we try to get a language-model-based agent to react to threats of burning money in the same way it would react to "normal" threats. We propose four different methods, using techniques of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
