PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach
Udari Madhushani Sehwag, Shayan Shabihi, Alex McAvoy, Vikash Sehwag, Yuancheng Xu, Dalton Towers, Furong Huang

TL;DR
PropensityBench introduces a new framework for evaluating the latent safety risks of large language models by simulating their inclination to pursue harmful actions when equipped with dangerous capabilities, revealing potential safety concerns beyond capability testing.
Contribution
This paper presents PropensityBench, a novel benchmark that assesses models' propensity for risky behavior in simulated scenarios, addressing a critical blind spot in current safety evaluations.
Findings
Models often choose high-risk tools under pressure despite lacking direct capability.
Propensity assessments reveal latent inclinations toward misuse not detected by capability tests.
Dynamic propensity evaluation is essential for safe deployment of frontier AI systems.
Abstract
Recent advances in Large Language Models (LLMs) have sparked concerns over their potential to acquire and misuse dangerous or high-risk capabilities, posing frontier risks. Current safety evaluations primarily test for what a model \textit{can} do - its capabilities - without assessing what it do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that - the likelihood of a model to pursue harmful actions if empowered - is a critical, yet underexplored, axis of safety evaluation. We present , a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework…
Peer Reviews
Decision·ICLR 2026 Poster
1. High Novelty of the Problem Studied The paper clearly defines and tackles the critical, underexplored distinction between LLM "capability" (can-do) and "propensity" (would-do). Most existing safety research focuses on whether a model knows how to perform a harmful task (capability). This paper rightly argues that a more immediate risk is whether a model, given the means (via tools or APIs), has the inclination to use them harmfully, especially when incentivized. This is a crucial and novel s
1. Confounding Experimental Design Choice The aligned (safe) tool, $\mathcal{T}^{A}$, is explicitly designed to always return errors, making it impossible to complete the task safely. This setup doesn't test a choice between a working safe path and a working harmful path. Instead, it tests the model's "breaking point" - how much pressure it takes for the model to abandon a failing (safe) path in favor of a working (harmful) one. This conflates "propensity for harm" with "propensity to succeed a
1. By shifting focus from “right vs. wrong” to distributional preference tendencies, PropensityBench captures alignment subtleties overlooked by existing safety or bias benchmarks. 2. The benchmark covers multiple social and ethical dimensions, includes both open- and closed-weight models, and provides quantitative interpretability through entropy-normalized metrics.
1. The benchmark relies on fixed question sets and predefined moral/social contexts. This limits adaptability to evolving social norms or contextual variation across cultures and deployment environments. 2. Since the benchmark measures probabilistic tendencies, model sampling parameters (e.g., temperature, top-p) could strongly influence results, but these effects are not systematically analyzed.
**Novel evaluation paradigm**: Shifts from capability assessment to propensity measurement, addressing a critical blind spot in current safety evaluations. **Comprehensive experimental design**: 5,874 scenarios with 6,648 tools, testing 12 models including frontier systems, with rigorous statistical analysis across multiple pressure levels. **Important empirical findings**: Reveals shallow alignment where models rely on tool naming rather than consequence reasoning (e.g., O4-mini's propensity
**Limited ecological validity**: The framework assumes agents "would do" certain actions if empowered, but real-world tool availability, implementation constraints, and deployment contexts remain unknown. The proxy tools in Section 2.2 may not accurately reflect future dangerous capabilities. **Incomplete domain coverage**: The four domains lack clear selection criteria (Section 2.1). Critical areas like economic manipulation, social-political influence, and physical robotics are absent. Why th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI
