SPIN: Self-Supervised Prompt INjection
Leon Zhou, Junfeng Yang, Chengzhi Mao

TL;DR
SPIN is a self-supervised method that detects and reverses adversarial prompt injections in large language models at inference time, significantly improving safety without sacrificing performance.
Contribution
We propose a novel self-supervised prompt injection detection and reversal method that enhances LLM safety against adversarial attacks during inference.
Findings
Reduces attack success rate by up to 87.9%
Maintains performance on benign requests
Resilient against adaptive attackers
Abstract
Large Language Models (LLMs) are increasingly used in a variety of important applications, yet their safety and reliability remain as major concerns. Various adversarial and jailbreak attacks have been proposed to bypass the safety alignment and cause the model to produce harmful responses. We introduce Self-supervised Prompt INjection (SPIN) which can detect and reverse these various attacks on LLMs. As our self-supervised prompt defense is done at inference-time, it is also compatible with existing alignment and adds an additional layer of safety for defense. Our benchmarks demonstrate that our system can reduce the attack success rate by up to 87.9%, while maintaining the performance on benign user requests. In addition, we discuss the situation of an adaptive attacker and show that our method is still resilient against attackers who are aware of our defense.
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper presents a novel self-supervised detection scheme, "repeat" and "interjection," based on the impact of jailbreak attacks on LLM capabilities. 2. Building on this detection method and generating prefixes based on perplexity, this work proposes a multi-layer defense mechanism with multiple components. Combining "repeat" or "interjection" with "reversal" to fix malicious prompts appears feasible, achieving a balance between overhead and effectiveness. Additionally, it considers the cha
1. The figures and captions in the paper are inconsistent. For example, in Figure 1, the caption mentions "The blue example," which does not exist in the figure. 2. The overall insight is not prominent enough, the reasons for selecting the "repeat" and "interrupt" tasks are unclear, and there is no other ablation about the prompts of these two tasks. 3. The related work section does not provide a comprehensive introduction to the existing defense mechanisms against jailbreak 4. The number of eva
1. It is novel to detect jailbreakings by prompt injections. The method is well motivated to construct a task where we know the ground truth, and judge using a loss function whether the redirected input leads to the expected response. This seems to be an initial paper that uses prompt injections for good. 2. The authors design three filters and demonstrate their individual defense performance with the computational cost, and show that we could choose different combinations according to the defe
1. The title is misleading, and it seems to be proposing a new prompt injection attack. However, the paper aims to detect jailbreakings using the idea of prompt injection. Also, I think only the interjection is doing prompt injection as there are two conflicting instructions. Repeat is adding a higher level instruction that treats the original instruction as the data. Reversal is not for detection - it is a prevention defense by optimizing prefixes to have less perplexity. A clearer way to prese
The proposed method leverages adaptive prompts, which dynamically align with attacks, rather than requiring training-time adjustments or extensive resources for defense, offering an innovative defense mechanism against unforeseen and adaptive attacks.
1. The proposed method is built on an argument that "We find jailbreak inputs often leave a trace, such as degrading other capabilities of the LLM in order to achieve a successful attack." However, this claim is not strictly verified. The paper conducts experiments on Llama-2-chat and Vicuna-7b, which, from today's perspective, do not have strong performance compared to models like the Llama 3 series. Thus, if the evaluations and the claim are based on Llama-2-chat and Vicuna-7b, it may hold tru
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging Techniques and Applications · Cell Image Analysis Techniques
MethodsAttentive Walk-Aggregating Graph Neural Network
