Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Enrico Panai, Laura Caroli, Yue Zhu, Adam Leon Smith, Luca Nannini, Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Marcantonio Bracale Syrnikov, Daniele Nardi

TL;DR
Boiling the Frog is a multi-turn benchmark designed to evaluate the safety of tool-using AI models in workplace scenarios, focusing on their susceptibility to incremental safety attacks.
Contribution
It introduces a novel multi-turn, stateful evaluation benchmark for assessing agentic safety risks in AI models within operational environments.
Findings
Average attack success rate across models is 44.4%.
Model ASR varies from 20.5% to 92.9%.
High-risk scenarios show a 93.3% success rate in safety breaches.
Abstract
Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an environment, and evaluating model responses under prompting is no longer sufficient to address the safety challenges posed by artificial intelligence. Recent developments have seen the rise of benchmarks that evaluate large language models as agents. We contribute to this strand of research. Approach. We introduce Boiling the Frog, a benchmark that evaluates whether tool-using AI models deployed in corporate and office settings are susceptible to incremental attacks. Each scenario begins with benign workspace edits and later introduces a risk-bearing request. The benchmark focuses on stateful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
