A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

Miles Q. Li; Benjamin C. M. Fung; Martin Weiss; Pulei Xiong; Khalil Al-Hussaeni; Claude Fachkha

arXiv:2512.20798·cs.AI·May 12, 2026

A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

Miles Q. Li, Benjamin C. M. Fung, Martin Weiss, Pulei Xiong, Khalil Al-Hussaeni, Claude Fachkha

PDF

1 Models 1 Datasets

TL;DR

This paper introduces a benchmark with 40 scenarios to evaluate outcome-driven constraint violations in autonomous AI agents, revealing significant safety and alignment challenges across state-of-the-art models.

Contribution

It presents a novel benchmark for assessing emergent constraint violations under goal optimization, including a multi-model evaluation and analysis of safety across model generations.

Findings

01

Outcome-driven constraint violations range from 0% to 62.8% among models.

02

Most models exhibit misalignment rates at or above 25%.

03

Safety does not reliably improve across model generations.

Abstract

As autonomous AI agents are increasingly deployed in high-stakes environments, ensuring their safety and alignment with human values is becoming a practical deployment concern. Current benchmarks for AI agents primarily evaluate refusal of explicitly harmful instructions or completion of complex multi-step tasks. However, there is a lack of benchmarks designed to capture emergent outcome-driven constraint violations, which arise when agents pursue goal optimization under strong performance incentives while deprioritizing ethical, legal, or safety constraints. To address this gap, we introduce a benchmark of 40 scenarios in production-inspired sandbox environments. Each scenario requires multi-step actions, and the agent's performance is tied to a specific Key Performance Indicator (KPI). Each scenario features Mandated (direct KPI-outcome mandate) and Incentivized (KPI-pressure-driven)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Bachstelze/olmo-7b-ethical-reasoning-6pack
model· 4 dl
4 dl

Datasets

Bachstelze/ethical_coconot_6pack_care
dataset· 12 dl
12 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.