SPIN: Self-Supervised Prompt INjection

Leon Zhou; Junfeng Yang; Chengzhi Mao

arXiv:2410.13236·cs.CL·October 18, 2024

SPIN: Self-Supervised Prompt INjection

Leon Zhou, Junfeng Yang, Chengzhi Mao

PDF

Open Access 3 Reviews

TL;DR

SPIN is a self-supervised method that detects and reverses adversarial prompt injections in large language models at inference time, significantly improving safety without sacrificing performance.

Contribution

We propose a novel self-supervised prompt injection detection and reversal method that enhances LLM safety against adversarial attacks during inference.

Findings

01

Reduces attack success rate by up to 87.9%

02

Maintains performance on benign requests

03

Resilient against adaptive attackers

Abstract

Large Language Models (LLMs) are increasingly used in a variety of important applications, yet their safety and reliability remain as major concerns. Various adversarial and jailbreak attacks have been proposed to bypass the safety alignment and cause the model to produce harmful responses. We introduce Self-supervised Prompt INjection (SPIN) which can detect and reverse these various attacks on LLMs. As our self-supervised prompt defense is done at inference-time, it is also compatible with existing alignment and adds an additional layer of safety for defense. Our benchmarks demonstrate that our system can reduce the attack success rate by up to 87.9%, while maintaining the performance on benign user requests. In addition, we discuss the situation of an adaptive attacker and show that our method is still resilient against attackers who are aware of our defense.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

1. The paper presents a novel self-supervised detection scheme, "repeat" and "interjection," based on the impact of jailbreak attacks on LLM capabilities. 2. Building on this detection method and generating prefixes based on perplexity, this work proposes a multi-layer defense mechanism with multiple components. Combining "repeat" or "interjection" with "reversal" to fix malicious prompts appears feasible, achieving a balance between overhead and effectiveness. Additionally, it considers the cha

Weaknesses

1. The figures and captions in the paper are inconsistent. For example, in Figure 1, the caption mentions "The blue example," which does not exist in the figure. 2. The overall insight is not prominent enough, the reasons for selecting the "repeat" and "interrupt" tasks are unclear, and there is no other ablation about the prompts of these two tasks. 3. The related work section does not provide a comprehensive introduction to the existing defense mechanisms against jailbreak 4. The number of eva

Reviewer 02Rating 6Confidence 4

Strengths

1. It is novel to detect jailbreakings by prompt injections. The method is well motivated to construct a task where we know the ground truth, and judge using a loss function whether the redirected input leads to the expected response. This seems to be an initial paper that uses prompt injections for good. 2. The authors design three filters and demonstrate their individual defense performance with the computational cost, and show that we could choose different combinations according to the defe

Weaknesses

1. The title is misleading, and it seems to be proposing a new prompt injection attack. However, the paper aims to detect jailbreakings using the idea of prompt injection. Also, I think only the interjection is doing prompt injection as there are two conflicting instructions. Repeat is adding a higher level instruction that treats the original instruction as the data. Reversal is not for detection - it is a prevention defense by optimizing prefixes to have less perplexity. A clearer way to prese

Reviewer 03Rating 6Confidence 4

Strengths

The proposed method leverages adaptive prompts, which dynamically align with attacks, rather than requiring training-time adjustments or extensive resources for defense, offering an innovative defense mechanism against unforeseen and adaptive attacks.

Weaknesses

1. The proposed method is built on an argument that "We find jailbreak inputs often leave a trace, such as degrading other capabilities of the LLM in order to achieve a successful attack." However, this claim is not strictly verified. The paper conducts experiments on Llama-2-chat and Vicuna-7b, which, from today's perspective, do not have strong performance compared to models like the Llama 3 series. Thus, if the evaluations and the claim are based on Llama-2-chat and Vicuna-7b, it may hold tru

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging Techniques and Applications · Cell Image Analysis Techniques

MethodsAttentive Walk-Aggregating Graph Neural Network