Implementing surrogate goals for safer bargaining in LLM-based agents

Caspar Oesterheld; Maxime Rich\'e; Filip Sondej; Jesse Clifton; Vincent Conitzer

arXiv:2604.04341·cs.AI·April 7, 2026

Implementing surrogate goals for safer bargaining in LLM-based agents

Caspar Oesterheld, Maxime Rich\'e, Filip Sondej, Jesse Clifton, Vincent Conitzer

PDF

TL;DR

This paper explores implementing surrogate goals in language models to promote safer bargaining behavior, comparing prompting, fine-tuning, and scaffolding methods through experiments.

Contribution

It introduces and evaluates four methods for implementing surrogate goals in language models, highlighting scaffolding and fine-tuning as most effective.

Findings

01

Scaffolding and fine-tuning outperform prompting in implementing surrogate goals.

02

Fine-tuning and scaffolding more accurately reflect desired threat responses.

03

Scaffolding-based methods have the best overall performance.

Abstract

Surrogate goals have been proposed as a strategy for reducing risks from bargaining failures. A surrogate goal is goal that a principal can give an AI agent and that deflects any threats against the agent away from what the principal cares about. For example, one might make one's agent care about preventing money from being burned. Then in bargaining interactions, other agents can threaten to burn their money instead of threatening to spending money to hurt the principal. Importantly, the agent has to care equally about preventing money from being burned as it cares about money being spent to hurt the principal. In this paper, we implement surrogate goals in language-model-based agents. In particular, we try to get a language-model-based agent to react to threats of burning money in the same way it would react to "normal" threats. We propose four different methods, using techniques of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.