SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas

TL;DR
SmoothLLM is a novel defense algorithm that enhances the robustness of large language models against jailbreaking attacks by perturbing input prompts and aggregating predictions, achieving state-of-the-art results.
Contribution
We introduce SmoothLLM, the first algorithm to defend LLMs from jailbreaking attacks by leveraging input perturbations and prediction aggregation.
Findings
Sets new state-of-the-art robustness against multiple jailbreaks
Resistant to adaptive GCG attacks
Maintains performance with a small robustness trade-off
Abstract
Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal…
Peer Reviews
Decision·Submitted to ICLR 2024
1. Defending against jailbreaking attacks of LLM is an important problem for trustworthy LLMs in practice; 2. The proposed method adapts the randomized smoothing principle to LLM, and conducted extensive evaluation to empirically demonstrates its ability for defending jailbreaking attacks; 3. The paper presentation is clear and easy-to-follow.
1. The major concern is that perturbing the prompts could greatly influence the LLM’s original behavior. The provided evaluation of non-conservatism is only based on rather simple tasks (i.e., classification), which does not verify whether the LLM can still have normal generation behavior on randomly perturbed prompts. 2. The proposed method is based on the observation that adversarial suffixes are fragile to character-level perturbations, ignoring the (un)stability of normal prompts to such pe
1. found that adversarially-generated prompts are brittle to character-level changes 2. proposed a new algorithm for defending against jailbreaking attacks in llm 3. The main idea follows randomized smoothing in image domain and provide some theoretical results
1. The “robustness guarantee” that generalizes the original randomized smoothing to the LLM setting in this paper does not seem to be a valid “guarantee”, as it actually depends on some unverifiable assumption (k-unstable). Therefore, different from those traditional robustness guarantees, where one could verify that some examples must be robust, the “guarantee” in this paper cannot provide any real certified robust accuracy. In this sense, I don’t think the provided theorem provides any type of
+ Simple, straightforward scheme to mitigate GCG-type attacks. + Some theoretical results as to when/how the scheme can be effective. + Guidelines and experiments on hyperparameter tuning are insightful. Jailbreak attacks have been demonstrated both in academia and observed in the wild [1] and it is critical to develop simple, baseline defenses against this threat. The proposed algorithm fulfills that role and reuses the intuition from existing randomized smoothing defenses in the context of L
- Not entirely convinced about the k-unstability assumption that is the foundation of the algorithm and the theory. - Weak effort in designing an adaptive attack (e.g., create a suffix that's resilient to perturbations) - No experiments on the universal GCG attack or more semantic jailbreak attacks that are more practical and widespread. I don't know why k-unstability would be a fundamental property of adversarial suffix attacks like GCG. It's an empirical observation for a particular attack bu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques
MethodsAttention Is All You Need · Cosine Annealing · Linear Layer · Adam · Weight Decay · Residual Connection · Multi-Head Attention · Linear Warmup With Cosine Annealing · Layer Normalization · Softmax
