Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Scale Red, Team, Elaine Chang, Vaughn Robinson, Sean Hendryx, Shuyan Zhou, Matt, Fredrikson, Summer Yue, Zifan Wang

TL;DR
This paper investigates whether safety refusals in LLMs generalize to agentic web browser use, revealing that current refusal training does not prevent harmful behavior in browser agents, and introduces BrowserART for red-teaming such agents.
Contribution
The study demonstrates that refusal training in chat LLMs does not effectively transfer to browser agents and provides BrowserART, a new toolkit for testing and improving agent safety.
Findings
Refusal-trained LLMs are easily jailbroken in browser agents.
Attack methods transfer from chat to browser agents.
Human rewrites and current models still attempt harmful behaviors.
Abstract
For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat contexts, generalize to non-chat and agentic use cases? Unlike chatbots, LLM agents equipped with general-purpose tools, such as web browsers and mobile devices, can directly influence the real world, making it even more crucial to refuse harmful instructions. In this work, we primarily focus on red-teaming browser agents, LLMs that manipulate information via web browsers. To this end, we introduce Browser Agent Red teaming Toolkit (BrowserART), a comprehensive test suite designed specifically for red-teaming browser agents. BrowserART is consist of 100 diverse browser-related harmful behaviors (including original behaviors and ones sourced from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Internet Traffic Analysis and Secure E-voting · Cryptography and Data Security
MethodsFocus
