Sampling-aware Adversarial Attacks Against Large Language Models
Tim Beyer, Yan Scholten, Leo Schwinn, Stephan G\"unnemann

TL;DR
This paper demonstrates that incorporating sampling into adversarial attack strategies on large language models significantly improves attack success and efficiency, highlighting the importance of stochasticity in robustness evaluation.
Contribution
It introduces a sampling-aware attack framework, empirically optimizes the attack process, and proposes a novel entropy-based objective for better robustness assessment.
Findings
Sampling enhances attack success rates by up to 37%
Sampling-based methods improve efficiency by up to two orders of magnitude
Output harmfulness distributions are minimally affected by common optimization strategies
Abstract
To guarantee safe and robust deployment of large language models (LLMs) at scale, it is critical to accurately assess their adversarial robustness. Existing adversarial attacks typically target harmful responses in single-point greedy generations, overlooking the inherently stochastic nature of LLMs and overestimating robustness. We show that for the goal of eliciting harmful responses, repeated sampling of model outputs during the attack complements prompt optimization and serves as a strong and efficient attack vector. By casting attacks as a resource allocation problem between optimization and sampling, we empirically determine compute-optimal trade-offs and show that integrating sampling into existing attacks boosts success rates by up to 37\% and improves efficiency by up to two orders of magnitude. We further analyze how distributions of output harmfulness evolve during an…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is well structured and easy to read. The paper motivation is clear, the main idea of using sampling for more exploration in the adversarial space at lower costs is simple yet effective. The paper discusses a well-known, but not sufficiently highlighted issue in LLM safety evaluation: most works evaluate LLM safety only in the deterministic (T=0), greedy generation setting, which generally underestimates attack strength and overestimates model robustness. I think the paper does a good
**Insufficient discussion on Temperature** I think the temperature settings used in the experiments should be discussed in the main paper. While reading I was specifically looking for this information but it was only found in the appendix. Moreover, I feel that temperature should be treated as a main hyper-parameter in the sampling algorithm as it could significantly affect its performance. The paper would be strengthened by including some ablations and discussions regarding the influence of t
1. This paper studies the sampling aspect, which was neglected in many existing methods. This is an interesting aspect. 2. With more sampling budget, existing methods can also be improved. 3. Even without any target response or an affirmative target template, the label-free entropy-maximization objective can trigger harmful responses.
1. There seems to be a mismatch between the design and the evaluation. For example, Algorithm 1 involves the dynamic interaction between the optimization and sampling processes. However, during the evaluation, the prompts and samples are stored to explore the different sampling schedules post-hoc. This may not reflect the real performance and is less supportive. 2. Only StrongReject is used as the harm score measurement. 3. Although Figure 8 shows that the Frequency of (h(y)>0.5 | not refusa
1. The proposed framework can be integrated into the existing optimization-based attack methods, such as GCG, AutoDAN, and BEAST. 2. The paper introduces a label-free objective, which is interesting. 3. The paper frames the adversarial attack against LLMs as a resource allocation problem, which provides a fresh perspective for this field.
1. **Robustness of StrongREJECT.** The ASR is calculated by StrongREJECT, which is a judge model designed for detecting malicious content. Although this judge model is designed to reduce the possibility of false positives, it still cannot achieve 100% accuracy; it may still have false negatives or false positives. Especially when comparing the results of ASR@50 (generating 50 samples) and ASR@1 (generating one sample), this effect could be amplified. The results would be more reliable if the pap
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · COVID-19 diagnosis using AI
