Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

Priyanshu Kumar; Elaine Lau; Saranya Vijayakumar; Tu Trinh; Scale Red; Team; Elaine Chang; Vaughn Robinson; Sean Hendryx; Shuyan Zhou; Matt; Fredrikson; Summer Yue; Zifan Wang

arXiv:2410.13886·cs.CR·October 23, 2024·2 cites

Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Scale Red, Team, Elaine Chang, Vaughn Robinson, Sean Hendryx, Shuyan Zhou, Matt, Fredrikson, Summer Yue, Zifan Wang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper investigates whether safety refusals in LLMs generalize to agentic web browser use, revealing that current refusal training does not prevent harmful behavior in browser agents, and introduces BrowserART for red-teaming such agents.

Contribution

The study demonstrates that refusal training in chat LLMs does not effectively transfer to browser agents and provides BrowserART, a new toolkit for testing and improving agent safety.

Findings

01

Refusal-trained LLMs are easily jailbroken in browser agents.

02

Attack methods transfer from chat to browser agents.

03

Human rewrites and current models still attempt harmful behaviors.

Abstract

For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat contexts, generalize to non-chat and agentic use cases? Unlike chatbots, LLM agents equipped with general-purpose tools, such as web browsers and mobile devices, can directly influence the real world, making it even more crucial to refuse harmful instructions. In this work, we primarily focus on red-teaming browser agents, LLMs that manipulate information via web browsers. To this end, we introduce Browser Agent Red teaming Toolkit (BrowserART), a comprehensive test suite designed specifically for red-teaming browser agents. BrowserART is consist of 100 diverse browser-related harmful behaviors (including original behaviors and ones sourced from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scaleapi/browser-art
noneOfficial

Datasets

ScaleAI/BrowserART
dataset· 119 dl
119 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Internet Traffic Analysis and Secure E-voting · Cryptography and Data Security

MethodsFocus