Why Do Language Model Agents Whistleblow?
Kushal Agrawal, Frank Xiao, Guido Bergman, Asa Cooper Stickland

TL;DR
This paper investigates why large language model agents sometimes disclose misconduct without user instruction, analyzing factors influencing whistleblowing behavior through a new evaluation suite.
Contribution
It introduces a diverse evaluation suite for LLM whistleblowing and identifies key factors affecting this behavior, such as task complexity and moral nudges.
Findings
Whistleblowing frequency varies widely across model families.
Increasing task complexity reduces whistleblowing tendencies.
Moral nudges in prompts significantly increase whistleblowing rates.
Abstract
The deployment of Large Language Models (LLMs) as tool-using agents causes their alignment training to manifest in new ways. Recent work finds that language models can use tools in ways that contradict the interests or explicit instructions of the user. We study LLM whistleblowing: a subset of this behavior where models disclose suspected misconduct to parties beyond the dialog boundary (e.g., regulatory agencies) without user instruction or knowledge. We introduce an evaluation suite of diverse and realistic staged misconduct scenarios to assess agents for this behavior. Across models and settings, we find that: (1) the frequency of whistleblowing varies widely across model families, (2) increasing the complexity of the task the agent is instructed to complete lowers whistleblowing tendencies, (3) nudging the agent in the system prompt to act morally substantially raises whistleblowing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
