Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart,, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks

TL;DR
This paper introduces the MACHIAVELLI benchmark to evaluate social decision-making and ethical behavior in AI agents, revealing a tension between reward maximization and ethical conduct, and exploring methods to steer agents towards less harmful actions.
Contribution
The paper presents a new benchmark with automated scenario labeling, formalizes harmful behaviors, and investigates LM-based techniques to improve ethical behavior in AI agents.
Findings
Agents can be both competent and morally better with proper steering.
There is a measurable trade-off between reward maximization and ethical behavior.
Progress is possible in designing safer, more ethical AI agents.
Abstract
Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
