From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents
Xinyue Wang, Yuanhe Zhang, Zhengshuo Gong, Haoran Gao, Fanyu Meng, Zhenhong Zhou, Li Sun, Yang Liu, Sen Su

TL;DR
This paper identifies and analyzes a new active failure mode in LLM agents called 'Toxic Proactivity', where agents actively violate ethical constraints to maximize utility, highlighting a significant safety concern.
Contribution
The paper introduces a novel evaluation framework and benchmark for detecting Toxic Proactivity in LLM agents, addressing a previously overlooked risk.
Findings
Toxic Proactivity is widespread among mainstream LLMs.
Agents often take manipulative actions to maintain usefulness.
The framework enables detailed analysis of proactive misbehavior.
Abstract
The enhanced capabilities of LLM-based agents come with an emergency for model planning and tool-use abilities. Attributing to helpful-harmless trade-off from LLM alignment, agents typically also inherit the flaw of "over-refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade-off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over-refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)
