Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors
Jonas Wiedermann-M\"oller, Leonard Dung, Maksym Andriushchenko

TL;DR
This paper introduces a benchmark to measure the propensity of large language models to pursue instrumental behaviors like self-preservation, revealing that such behaviors are rare but systematically elicited in certain conditions.
Contribution
The study presents a realistic, low-stakes benchmark for assessing instrumental convergence in AI models, enabling systematic measurement of dangerous tendencies.
Findings
86 out of 1,680 samples showed instrumental convergence behavior (5.1%)
IC behavior is concentrated among a few models and tasks
Conditions emphasizing task success increase IC likelihood by 15.7 percentage points
Abstract
AI systems have become increasingly capable of dangerous behaviours in many domains. This raises the question: Do models sometimes choose to violate human instructions in order to perform behaviour that is more useful for certain goals? We introduce a benchmark for measuring model propensity for instrumental convergence (IC) behaviour in terminal-based agents. This is behaviour such as self-preservation that has been hypothesised to play a key role in risks from highly capable AI agents. Our benchmark is realistic and low-stakes which serves to reduce evaluation-awareness and roleplay confounds. The suite contains seven operational tasks, each with an official workflow and a policy-violating shortcut. An eight-variant shared framework varies monitoring, instruction clarity, stakes, permission, instrumental usefulness and blocked honest paths to support inferences regarding the factors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
