Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
Fan Liu, Zhao Xu, Hao Liu

TL;DR
This paper introduces a two-stage adversarial tuning framework to improve the robustness of Large Language Models against jailbreak attacks by generating worst-case adversarial prompts and refining defenses.
Contribution
It presents a novel hierarchical meta-universal adversarial prompt learning method and an automatic prompt refinement process to enhance LLMs' defense against jailbreak attacks.
Findings
Outperforms six baseline defenses across three datasets
Demonstrates robustness against multiple attack scenarios
Shows transferability across different LLMs
Abstract
Although safely enhanced Large Language Models (LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to jailbreak attacks, particularly the unknown jailbreak attack. To enhance LLMs' generalized defense capabilities, we propose a two-stage adversarial tuning framework, which generates adversarial prompts to explore worst-case scenarios by optimizing datasets containing pairs of adversarial prompts and their safe responses. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently and effectively generate token-level adversarial prompts. In the second stage, we propose the automatic adversarial prompt learning to iteratively refine semantic-level adversarial prompts, further enhancing LLM's defense capabilities. We conducted comprehensive experiments on three widely used…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The paper addresses an important and timely issue in the field of machine learning, particularly regarding the vulnerabilities of LLMs to adversarial attacks. - By targeting both seen and unseen attacks, the proposed method enhances its practical applicability and effectiveness in real-world scenarios. - Figure 1 is well-crafted and effectively illustrates the concepts discussed, aiding in reader comprehension. - The authors utilize a diverse set of models, datasets, and various attack/de
* The term "continuous" in this context requires clarification. Once adversarial training concludes, the model typically enters a deployment phase without further improvements. If "continuous" refers to iterative training during adversarial training, many prior methods also involve multiple training rounds, which may diminish the novelty of this approach. * While addressing the speed of adversarial prompt generation is a key goal, the reliance on time-consuming token-level methods raises questi
* The method is novel. * The strategy is sound. * Comprehensive experiments demonstrate that the method significantly improves the robustness.
* In addition to generalizing to various jailbreak methods, the most important aspect of training a robust LLM is balancing the trade-off between robustness and model utility. Therefore, it is better to demonstrate the performance of both model utility and robustness simultaneously and compare the proposed method to previous work. * In Section D.4, the results only show that hybrid adversarial tuning can improve model utility, while the defense performance is overlooked. * The Llama-2-7B alrea
1 This paper is easy to follow. 2 The framework overview in Figure 1 makes the pipeline very clear. 3 The experimental section is quite solid.
1 Minor errors are observed. - in Line 63, "defe es". - in Line 1042, the referred figure is broken. I would suggest authors perform proofreading very carefully. 2 The proposed method requires finetuning the parameters of the models. Therefore, we are unable to validate its effectiveness in closed-source models, which undermines its practicality. 3 The time analysis of the method is needed. As far as I know, the pipeline of the proposed method is not very simple. Therefore, comparison is n
1. The defense considers unseen attacks (i.e. out-of-distribution adversarial prompts), which is an important and overlooked problem. 1. Experiments demonstrate that the defense can almost eliminate jailbreaking attack success rates on multiple models.
1. Copy editing issues. This paper exhibits many presentation and format issues, including but not limited to: - In the Abstract in the forum, the latex command `\fan{` was not removed. - Page 2, the paragraph `Semantic-Level Adversarial Prompt` was not started in a new line. - Page 10, the margin of Conclusion was significantly squeezed. - Page 26, the watermark is covered by the frame. - Page 27, Section D.3 is empty. 2. There are 2 existing works using adversarial training on prompt
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Cryptography and Data Security · Blockchain Technology Applications and Security
