Towards the Worst-case Robustness of Large Language Models
Huanran Chen, Yinpeng Dong, Zeming Wei, Hang Su, Jun Zhu

TL;DR
This paper investigates the worst-case robustness of large language models against adversarial attacks, providing theoretical bounds and certifying robustness levels for certain defenses.
Contribution
It introduces a tight lower bound for randomized smoothing defenses and certifies robustness against any attack with specific perturbation limits.
Findings
Most current defenses have nearly 0% worst-case robustness.
A new theoretical lower bound for stochastic defenses is proposed.
Robustness of smoothing with a uniform kernel is certified against attacks with specific perturbation limits.
Abstract
Recent studies have revealed the vulnerability of large language models to adversarial attacks, where adversaries craft specific input sequences to induce harmful, violent, private, or incorrect outputs. In this work, we study their worst-case robustness, i.e., whether an adversarial example exists that leads to such undesirable outputs. We upper bound the worst-case robustness using stronger white-box attacks, indicating that most current deterministic defenses achieve nearly 0\% worst-case robustness. We propose a general tight lower bound for randomized smoothing using fractional knapsack solvers or 0-1 knapsack solvers, and using them to bound the worst-case robustness of all stochastic defenses. Based on these solvers, we provide theoretical lower bounds for several previous empirical defenses. For example, we certify the robustness of a specific case, smoothing using a uniform…
Peer Reviews
Decision·Submitted to ICLR 2026
S1. Strong Theoretical Framework S2. Comprehensive Empirical Evaluation S3. The paper writting is clear and easy to follow
W1. The gap between theoretical guarantees and practical robustness remains large. W2. The propose framework cannot handle insertion/deletion attacks or long heuristic prompts. W3. While the theoretical analysis is strong, the paper doesn't propose new defense mechanisms that achieve better worst-case robustness.
Originality: This paper proposes (as far as I am aware) a novel reduction of stochastic defense certification to fractional knapsack optimization, simplifying a problem and allowing it to certify robustness results. Quality: The paper supports its theoretical claims with detailed proofs and explanations, and demonstrates the proposed methods experimentally. Significance: Adversarial attacks on LLMs are a significant problem, and certifying the robustness of defenses is important for understandin
Clarity: W1) Lou et al. (2023) is not a particularly helpful first citation for explaining stochastic defenses, as it covers discrete diffusion processes. It seems to me that the idea is that $z \sim p(z|x)$ is a discrete diffusion process, but the concept can use much more explanation in the introduction of the paper. It only seems to be explained in "Results on randomized defenses." and in section 4.1, making it much harder to understand the intended connection to the knapsack problem reductio
I appreciate the effort of the authors in trying to establish certified robustness guarantees for generative LLMs (although I personally think the proposed "CR-guarantee" is incorrect).
1. I think the authors cannot call their proposed method a "certified robust method". Specifically, when people say a ML model is "certified robust", it means that the model would not change its prediction for **ANY** possible input samples and their corresponding perturbed versions under a reasonable perturbation budget (see examples for vision models [r1, r2] and language models [r2]). However, according to Lines 138-139 of the paper, the "certified robustness guarantee" proposed in this paper
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsRandomized Smoothing
