Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern   and Behavior Learning

Jiaqi Hua; Wanxu Wei

arXiv:2501.07959·cs.AI·February 4, 2025

Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Jiaqi Hua, Wanxu Wei

PDF

Open Access 1 Repo

TL;DR

This paper introduces Self-Instruct Few-Shot Jailbreaking, a novel method that decomposes the attack into pattern and behavior learning, improving efficiency and generalization in jailbreaking large language models.

Contribution

The paper proposes a new Self-Instruct-FSJ framework that enhances few-shot jailbreaking by decomposing attacks, outperforming existing methods in efficiency and effectiveness.

Findings

01

Self-Instruct-FSJ reduces the number of demos needed for successful jailbreaking.

02

The method generalizes better across different open-source models.

03

Experimental results show improved success rates compared to baseline algorithms.

Abstract

Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. focus on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search, known as Improved Few-Shot Jailbreaking (I-FSJ). Nevertheless, we notice that this method may still require a long context to jailbreak advanced models e.g. 32 shots of demos for Meta-Llama-3-8B-Instruct (Llama-3) \cite{llama3modelcard}. In this paper, we discuss the limitations of I-FSJ and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model's vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iphosi/self-instruct-fsj
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Advanced Malware Detection Techniques

MethodsFocus