Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Bochuan Cao, Yuanpu Cao, Lu Lin, Jinghui Chen

TL;DR
This paper proposes a method to enhance the robustness of aligned large language models against adversarial and jailbreaking prompts without retraining, significantly reducing attack success rates.
Contribution
Introduction of RA-LLM, a robust alignment checking framework that defends against alignment-breaking attacks without retraining the original LLM.
Findings
RA-LLM reduces attack success rates from nearly 100% to around 10% or less.
Theoretical analysis confirms RA-LLM's effectiveness in defending against attacks.
Experimental results on open-source LLMs demonstrate improved robustness.
Abstract
Recently, Large Language Models (LLMs) have made significant advancements and are now widely used across various domains. Unfortunately, there has been a rising concern that LLMs can be misused to generate harmful or malicious content. Though a line of research has focused on aligning LLMs with human values and preventing them from producing inappropriate content, such alignments are usually vulnerable and can be bypassed by alignment-breaking attacks via adversarially optimized or handcrafted jailbreaking prompts. In this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against potential alignment-breaking attacks. RA-LLM can be directly constructed upon an existing aligned LLM with a robust alignment checking function, without requiring any expensive retraining or fine-tuning process of the original LLM. Furthermore, we also provide a theoretical analysis for RA-LLM to…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The underlying principle of RA-LLM is evident: the strategic removal of tokens from the prompt has the potential to neutralize the adversarial prefix, thereby mitigating the effectiveness of the attack. 2. The introduced methodology demonstrates substantial robustness when tested on Vicuna-7B and Guanaco-7B.
1. The concept of partially erasing the prompt as a defensive measure against jailbreak attacks has been previously explored, as evidenced by concurrent work [1]. It would be beneficial if the authors delved deeper into this method to enhance its defensive capabilities. Furthermore, it might be worth comparing the RA-LLM's performance with the perplexity-based defense [2], which has also demonstrated commendable robustness. 2. The experimental evaluations appear to be limited to open-source LLM
1. Defending the alignment-breaking attack for LLMs is a very important research direction to protect LLMs from being misused. 2. The proposed method seems to be quite effective according to the reported experimental results. 3. The proposed method is very easy to implement.
1. I wonder whether it is enough to have only one dataset for ASR and BAR evaluation. 2. The size of the experimental dataset seems to be small. 3. This paper does not consider the adaptive attack scenario.
1. The topic of this paper is important in the field of LLM. 2. The proposed method is intuitively reasonable that can defends adversarial attacks to an extent (e.g., the GCG attack).
1. **Lack of baseline comparisons.** This paper did not compare with a highly related baseline, that is detecting harmfulness based on the model output [1]. This baseline requires roughly $L_{in} + (L_{in}+L_{out})$ input cost and $L_{out}$ output cost, where the overall cost could be much smaller than this paper's method (if the $L_{out}$ is not too large). Besides, this baseline has a simple variation, where we can instruct the LLM to revise the output of first stage, which could also potentia
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Adversarial Robustness in Machine Learning
