Boosting Jailbreak Attack with Momentum
Yihao Zhang, Zeming Wei

TL;DR
This paper introduces MAC, a momentum-accelerated attack method that improves the efficiency and success rate of jailbreak attacks on large language models by stabilizing and enhancing the adversarial prompt generation process.
Contribution
The paper proposes a novel momentum-based optimization technique for adversarial prompt generation, significantly improving attack success and efficiency against LLMs.
Findings
MAC outperforms baseline attacks in success rate
MAC enhances optimization efficiency
MAC remains effective under defense mechanisms
Abstract
Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks, notably the well-known jailbreak attack. In particular, the Greedy Coordinate Gradient (GCG) attack has demonstrated efficacy in exploiting this vulnerability by optimizing adversarial prompts through a combination of gradient heuristics and greedy search. However, the efficiency of this attack has become a bottleneck in the attacking process. To mitigate this limitation, in this paper we rethink the generation of the adversarial prompts through an optimization lens, aiming to stabilize the optimization process and harness more heuristic insights from previous optimization iterations. Specifically, we propose the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack, which integrates a momentum term into the gradient heuristic to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Network Security and Intrusion Detection · Adversarial Robustness in Machine Learning
MethodsRandom Search
