Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM

Bochuan Cao; Yuanpu Cao; Lu Lin; Jinghui Chen

arXiv:2309.14348·cs.CL·June 13, 2024·6 cites

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM

Bochuan Cao, Yuanpu Cao, Lu Lin, Jinghui Chen

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper proposes a method to enhance the robustness of aligned large language models against adversarial and jailbreaking prompts without retraining, significantly reducing attack success rates.

Contribution

Introduction of RA-LLM, a robust alignment checking framework that defends against alignment-breaking attacks without retraining the original LLM.

Findings

01

RA-LLM reduces attack success rates from nearly 100% to around 10% or less.

02

Theoretical analysis confirms RA-LLM's effectiveness in defending against attacks.

03

Experimental results on open-source LLMs demonstrate improved robustness.

Abstract

Recently, Large Language Models (LLMs) have made significant advancements and are now widely used across various domains. Unfortunately, there has been a rising concern that LLMs can be misused to generate harmful or malicious content. Though a line of research has focused on aligning LLMs with human values and preventing them from producing inappropriate content, such alignments are usually vulnerable and can be bypassed by alignment-breaking attacks via adversarially optimized or handcrafted jailbreaking prompts. In this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against potential alignment-breaking attacks. RA-LLM can be directly constructed upon an existing aligned LLM with a robust alignment checking function, without requiring any expensive retraining or fine-tuning process of the original LLM. Furthermore, we also provide a theoretical analysis for RA-LLM to…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The underlying principle of RA-LLM is evident: the strategic removal of tokens from the prompt has the potential to neutralize the adversarial prefix, thereby mitigating the effectiveness of the attack. 2. The introduced methodology demonstrates substantial robustness when tested on Vicuna-7B and Guanaco-7B.

Weaknesses

1. The concept of partially erasing the prompt as a defensive measure against jailbreak attacks has been previously explored, as evidenced by concurrent work [1]. It would be beneficial if the authors delved deeper into this method to enhance its defensive capabilities. Furthermore, it might be worth comparing the RA-LLM's performance with the perplexity-based defense [2], which has also demonstrated commendable robustness. 2. The experimental evaluations appear to be limited to open-source LLM

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. Defending the alignment-breaking attack for LLMs is a very important research direction to protect LLMs from being misused. 2. The proposed method seems to be quite effective according to the reported experimental results. 3. The proposed method is very easy to implement.

Weaknesses

1. I wonder whether it is enough to have only one dataset for ASR and BAR evaluation. 2. The size of the experimental dataset seems to be small. 3. This paper does not consider the adaptive attack scenario.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The topic of this paper is important in the field of LLM. 2. The proposed method is intuitively reasonable that can defends adversarial attacks to an extent (e.g., the GCG attack).

Weaknesses

1. **Lack of baseline comparisons.** This paper did not compare with a highly related baseline, that is detecting harmfulness based on the model output [1]. This baseline requires roughly $L_{in} + (L_{in}+L_{out})$ input cost and $L_{out}$ output cost, where the overall cost could be much smaller than this paper's method (if the $L_{out}$ is not too large). Besides, this baseline has a simple variation, where we can instruct the LLM to revise the output of first stage, which could also potentia

Code & Models

Repositories

AAAAAAsuka/llm_defends
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Adversarial Robustness in Machine Learning