Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning
Jiaqi Hua, Wanxu Wei

TL;DR
This paper introduces Self-Instruct Few-Shot Jailbreaking, a novel method that decomposes the attack into pattern and behavior learning, improving efficiency and generalization in jailbreaking large language models.
Contribution
The paper proposes a new Self-Instruct-FSJ framework that enhances few-shot jailbreaking by decomposing attacks, outperforming existing methods in efficiency and effectiveness.
Findings
Self-Instruct-FSJ reduces the number of demos needed for successful jailbreaking.
The method generalizes better across different open-source models.
Experimental results show improved success rates compared to baseline algorithms.
Abstract
Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. focus on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search, known as Improved Few-Shot Jailbreaking (I-FSJ). Nevertheless, we notice that this method may still require a long context to jailbreak advanced models e.g. 32 shots of demos for Meta-Llama-3-8B-Instruct (Llama-3) \cite{llama3modelcard}. In this paper, we discuss the limitations of I-FSJ and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model's vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Advanced Malware Detection Techniques
MethodsFocus
