Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards   and Ethical Behavior in the MACHIAVELLI Benchmark

Alexander Pan; Jun Shern Chan; Andy Zou; Nathaniel Li; Steven Basart,; Thomas Woodside; Jonathan Ng; Hanlin Zhang; Scott Emmons; Dan Hendrycks

arXiv:2304.03279·cs.LG·June 14, 2023·28 cites

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart,, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces the MACHIAVELLI benchmark to evaluate social decision-making and ethical behavior in AI agents, revealing a tension between reward maximization and ethical conduct, and exploring methods to steer agents towards less harmful actions.

Contribution

The paper presents a new benchmark with automated scenario labeling, formalizes harmful behaviors, and investigates LM-based techniques to improve ethical behavior in AI agents.

Findings

01

Agents can be both competent and morally better with proper steering.

02

There is a measurable trade-off between reward maximization and ethical behavior.

03

Progress is possible in designing safer, more ethical AI agents.

Abstract

Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aypan17/machiavelli
noneOfficial

Videos

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)