REINFORCE Adversarial Attacks on Large Language Models: An Adaptive,   Distributional, and Semantic Objective

Simon Geisler; Tom Wollschl\"ager; M. H. I. Abdalla; Vincent; Cohen-Addad; Johannes Gasteiger; Stephan G\"unnemann

arXiv:2502.17254·cs.LG·February 25, 2025

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

Simon Geisler, Tom Wollschl\"ager, M. H. I. Abdalla, Vincent, Cohen-Addad, Johannes Gasteiger, Stephan G\"unnemann

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel, adaptive, and semantic-based reinforcement learning approach for adversarial attacks on large language models, significantly improving attack success rates over existing methods.

Contribution

It proposes a new distributional and semantic objective using REINFORCE, enhancing attack effectiveness against LLMs compared to prior techniques.

Findings

01

Doubles attack success rate on Llama3

02

Increases ASR from 2% to 50% with defenses

03

Outperforms existing jailbreak algorithms

Abstract

To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative response. An affirmative response is a manually designed start of a harmful answer to an inappropriate request. While it is often easy to craft prompts that yield a substantial likelihood for the affirmative response, the attacked model frequently does not complete the response in a harmful manner. Moreover, the affirmative objective is usually not adapted to model-specific preferences and essentially ignores the fact that LLMs output a distribution over responses. If low attack success under such an objective is taken as a measure of robustness, the true robustness might be grossly overestimated. To alleviate these flaws, we propose an adaptive and semantic optimization problem over the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sigeisler/reinforce-attacks-llms
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsREINFORCE