REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective
Simon Geisler, Tom Wollschl\"ager, M. H. I. Abdalla, Vincent, Cohen-Addad, Johannes Gasteiger, Stephan G\"unnemann

TL;DR
This paper introduces a novel, adaptive, and semantic-based reinforcement learning approach for adversarial attacks on large language models, significantly improving attack success rates over existing methods.
Contribution
It proposes a new distributional and semantic objective using REINFORCE, enhancing attack effectiveness against LLMs compared to prior techniques.
Findings
Doubles attack success rate on Llama3
Increases ASR from 2% to 50% with defenses
Outperforms existing jailbreak algorithms
Abstract
To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative response. An affirmative response is a manually designed start of a harmful answer to an inappropriate request. While it is often easy to craft prompts that yield a substantial likelihood for the affirmative response, the attacked model frequently does not complete the response in a harmful manner. Moreover, the affirmative objective is usually not adapted to model-specific preferences and essentially ignores the fact that LLMs output a distribution over responses. If low attack success under such an objective is taken as a measure of robustness, the true robustness might be grossly overestimated. To alleviate these flaws, we propose an adaptive and semantic optimization problem over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsREINFORCE
