EasyJailbreak: A Unified Framework for Jailbreaking Large Language   Models

Weikang Zhou; Xiao Wang; Limao Xiong; Han Xia; Yingshuang Gu; Mingxu; Chai; Fukang Zhu; Caishuang Huang; Shihan Dou; Zhiheng Xi; Rui Zheng,; Songyang Gao; Yicheng Zou; Hang Yan; Yifan Le; Ruohui Wang; Lijun Li; Jing; Shao; Tao Gui; Qi Zhang; Xuanjing Huang

arXiv:2403.12171·cs.CL·March 20, 2024·6 cites

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu, Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, Rui Zheng,, Songyang Gao, Yicheng Zou, Hang Yan, Yifan Le, Ruohui Wang, Lijun Li, Jing, Shao, Tao Gui, Qi Zhang, Xuanjing Huang

PDF

Open Access 1 Repo

TL;DR

EasyJailbreak is a modular, unified framework that simplifies constructing and evaluating jailbreak attacks on large language models, revealing significant vulnerabilities across multiple models and supporting comprehensive security assessments.

Contribution

The paper introduces EasyJailbreak, a modular framework supporting 11 jailbreak methods, enabling standardized, efficient security evaluation of diverse large language models.

Findings

01

Average breach probability of 60% across tested LLMs

02

GPT-3.5-Turbo and GPT-4 have attack success rates of 57% and 33%

03

EasyJailbreak supports broad security validation and resource sharing

Abstract

Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. This modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

easyjailbreak/easyjailbreak
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Absolute Position Encodings · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Transformer · Softmax