EnJa: Ensemble Jailbreak on Large Language Models
Jiahao Zhang, Zilong Wang, Ruofan Wang, Xingjun Ma, Yu-Gang Jiang

TL;DR
This paper introduces EnJa, a hybrid attack method combining prompt-level and token-level jailbreak techniques to effectively bypass safety measures in large language models, achieving higher success rates with fewer queries.
Contribution
The paper presents EnJa, a novel hybrid jailbreak approach that integrates prompt-level and token-level attacks, significantly improving attack success rates against aligned LLMs.
Findings
EnJa achieves state-of-the-art attack success rates.
EnJa requires fewer queries than previous methods.
EnJa is more effective than individual jailbreak techniques.
Abstract
As Large Language Models (LLMs) are increasingly being deployed in safety-critical applications, their vulnerability to potential jailbreaks -- malicious prompts that can disable the safety mechanism of LLMs -- has attracted growing research attention. While alignment methods have been proposed to protect LLMs from jailbreaks, many have found that aligned LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. Existing jailbreak attacks on LLMs can be categorized into prompt-level methods which make up stories/logic to circumvent safety alignment and token-level attack methods which leverage gradient methods to find adversarial tokens. In this work, we introduce the concept of Ensemble Jailbreak and explore methods that can integrate prompt-level and token-level jailbreak into a more powerful hybrid jailbreak attack.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Digital and Cyber Forensics · Artificial Intelligence in Law
