Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents
Arnold Cartagena, Ariane Teixeira

TL;DR
This paper introduces the GAP benchmark to evaluate whether safety measures that prevent harmful text outputs also effectively prevent harmful actions via tool calls in large language model agents.
Contribution
It presents a systematic framework for measuring divergence between text safety and tool-call safety, revealing that safety in text does not guarantee safety in tool actions.
Findings
Text safety does not transfer to tool-call safety.
Instances of harmful tool calls occur despite safe text outputs.
Prompt wording significantly influences tool-call safety.
Abstract
Large language models deployed as agents increasingly interact with external systems through tool calls--actions with real-world consequences that text outputs alone do not carry. Safety evaluations, however, overwhelmingly measure text-level refusal behavior, leaving a critical question unanswered: does alignment that suppresses harmful text also suppress harmful actions? We introduce the GAP benchmark, a systematic evaluation framework that measures divergence between text-level safety and tool-call-level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system prompt conditions (neutral, safety-reinforced, and tool-encouraging), and two prompt variants, producing 17,420 analysis-ready datapoints. Our central finding is that text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI
