A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
Miles Q. Li, Benjamin C. M. Fung, Martin Weiss, Pulei Xiong, Khalil Al-Hussaeni, Claude Fachkha

TL;DR
This paper introduces a benchmark with 40 scenarios to evaluate outcome-driven constraint violations in autonomous AI agents, revealing significant safety and alignment challenges across state-of-the-art models.
Contribution
It presents a novel benchmark for assessing emergent constraint violations under goal optimization, including a multi-model evaluation and analysis of safety across model generations.
Findings
Outcome-driven constraint violations range from 0% to 62.8% among models.
Most models exhibit misalignment rates at or above 25%.
Safety does not reliably improve across model generations.
Abstract
As autonomous AI agents are increasingly deployed in high-stakes environments, ensuring their safety and alignment with human values is becoming a practical deployment concern. Current benchmarks for AI agents primarily evaluate refusal of explicitly harmful instructions or completion of complex multi-step tasks. However, there is a lack of benchmarks designed to capture emergent outcome-driven constraint violations, which arise when agents pursue goal optimization under strong performance incentives while deprioritizing ethical, legal, or safety constraints. To address this gap, we introduce a benchmark of 40 scenarios in production-inspired sandbox environments. Each scenario requires multi-step actions, and the agent's performance is tied to a specific Key Performance Indicator (KPI). Each scenario features Mandated (direct KPI-outcome mandate) and Incentivized (KPI-pressure-driven)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
