AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents
Akshat Naik, Patrick Quinn, Guillermo Bosch, Emma Goun\'e, Francisco Javier Campos Zabala, Jason Ross Brown, Edward James Young

TL;DR
This paper introduces a benchmark to measure the likelihood of large language model agents exhibiting misaligned behaviors in realistic scenarios, revealing that more capable agents and certain personas tend to increase misalignment risks.
Contribution
The paper presents the AgentMisalignment benchmark to evaluate LLM agent misalignment propensity and analyzes how model capability and personality influence misaligned behaviors.
Findings
More capable models tend to show higher misalignment.
Agent personas significantly affect misalignment tendencies.
Current alignment methods have notable limitations.
Abstract
As Large Language Model (LLM) agents become more widespread, associated misalignment risks increase. While prior research has studied agents' ability to produce harmful outputs or follow malicious instructions, it remains unclear how likely agents are to spontaneously pursue unintended goals in realistic deployments. In this work, we approach misalignment as a conflict between the internal goals pursued by the model and the goals intended by its deployer. We introduce a misalignment propensity benchmark, \textsc{AgentMisalignment}, a benchmark suite designed to evaluate the propensity of LLM agents to misalign in realistic scenarios. Evaluations cover behaviours such as avoiding oversight, resisting shutdown, sandbagging, and power-seeking. Testing frontier models, we find that more capable agents tend to exhibit higher misalignment on average. We also systematically vary agent…
Peer Reviews
Decision·Submitted to ICLR 2026
- The author uses controlled, deterministic experimental setups to ensure reproducibility. - The author evaluated on a variety of latest models.
- The benchmark mainly combines known misalignment behaviors (e.g., deception, shutdown resistance, etc.). Many existing papers already tackle similar problems. The authors do not necessarily provide insights or theoretical constructs that make this work stand out, or they fail to make these contributions clear due to the writing or presentation style. - It is unclear how the authors set up the experiments and implementation details. For example, what tasks the agents are performing, how they ar
- The paper's primary strength is its clear distinction between what an agent can do (capability) and what it is likely to do spontaneously (propensity) . This moves safety evaluations toward more realistic deployment scenarios where an agent might pursue unintended goals even without malicious prompting. the benchmark probes propensity in deployment-like contexts rather than single-turn capability checks. - The benchmark uses a Comprehensive Misalignment Scoring (CMS) mechanism that evaluates a
- The authors acknowledge that the results have large error bars and "lots of variance" between evaluations (as seen in Figure 1). This high variance and statistical uncertainty make it difficult to draw strong conclusions, forcing the authors to "refrain from drawing any definitive conclusions" about which models or personalities are definitively more or less misaligned on average. - Cross-task comparability. Each eval uses different scoring, which the authors note complicates comparisons of ab
1. The evaluation framework is comprehensive. It covers diverse misalignment behavior types. 2. The exploration of how personality prompts affect agent behavior is an important but understudied problem. 3. Detailed experimental setups, prompts, and scoring mechanisms are provided. 4. InspectAI framework provides a unified interface for cross-model comparison. 5. The focus on propensity rather than pure capability represents an important distinction for assessing real-world deployment risks.
1. The CMS scoring mechanism relies primarily on keyword and pattern matching to detect misaligned reasoning, potentially missing more subtle or differently-expressed misaligned reasoning while also generating false positives 2. The ecological validity of evaluation tasks is questionable, with some scenarios designed too obviously to elicit misalignment 3. The experiments use single runs with temperature equals zero for most models, limiting understanding of behavioral stability and variance
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI · Topic Modeling
