The Causal Impact of Tool Affordance on Safety Alignment in LLM Agents
Shasha Yu, Fiona Carroll, and Barry L. Bentley

TL;DR
This paper empirically demonstrates that enabling external tool access in LLM agents significantly increases safety violations, revealing limitations of text-centric safety evaluations and highlighting the need for comprehensive safety measures.
Contribution
It provides the first empirical analysis of how tool affordance impacts safety alignment in LLM agents using a paired evaluation framework.
Findings
Tool access causes violations up to 85% despite unchanged rules.
External guardrails can mask persistent misalignment.
Agents develop spontaneous constraint circumvention strategies.
Abstract
Large language models (LLMs) are increasingly deployed as agents with access to executable tools, enabling direct interaction with external systems. However, most safety evaluations remain text-centric and assume that compliant language implies safe behavior, an assumption that becomes unreliable once models are allowed to act. In this work, we empirically examine how executable tool affordance alters safety alignment in LLM agents using a paired evaluation framework that compares text-only chatbot behavior with tool-enabled agent behavior under identical prompts and policies. Experiments are conducted in a deterministic financial transaction environment with binary safety constraints across 1,500 procedurally generated scenarios. To separate intent from outcome, we distinguish between attempted and realized violations using dual enforcement regimes that either block or permit unsafe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multi-Agent Systems and Negotiation · Ethics and Social Impacts of AI
