Certifying LLM Safety against Adversarial Prompting
Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil, Feizi, Himabindu Lakkaraju

TL;DR
This paper introduces erase-and-check, a novel framework providing certifiable safety guarantees for LLMs against various adversarial prompt attacks, enhancing robustness while maintaining performance on safe prompts.
Contribution
The work presents the first certifiable safety framework for LLMs against adversarial prompts, with implementations using Llama 2 and DistilBERT and multiple empirical defenses.
Findings
Strong safety guarantees against adversarial prompts
Effective empirical defenses like RandEC, GreedyEC, and GradEC
Maintains good performance on safe prompts
Abstract
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt to bypass the safety guardrails of an LLM and cause it to produce harmful content. In this work, we introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees. Given a prompt, our procedure erases tokens individually and inspects the resulting subsequences using a safety filter. Our safety certificate guarantees that harmful prompts are not mislabeled as safe due to an adversarial attack up to a certain size. We implement the safety filter in two ways, using Llama 2 and DistilBERT, and compare the performance of erase-and-check for the two cases. We defend against three attack modes: i) adversarial suffix, where an adversarial sequence is appended at the end of a harmful prompt; ii) adversarial insertion, where…
Peer Reviews
Decision·Submitted to ICLR 2024
Language model alignment to ensure helpfulness and harmfulness is critically important. Recent work has shown that it can be relatively straightforward to bypass model alignment, where the language model generates obviously problematic completions. To my knowledge, this paper proposes the first method to certify that a harmful prompt is not misclassified as safe. This makes the work a valuable contribution and potentially a good candidate for publication at ICLR. Erase-and-check is a simple,
Potential weaknesses are also raised in the "Questions" section below. The title of the paper, "*Certifying LLM Safety against Adversarial Prompting*". In my view, this title is too broad and implies the work achieves more than it does (i.e., overclaims). The paper defends against a specific type of adversarial prompting -- token insertions. For example, consider the "*jailbreak via mismatched generalization*" attack in Wei et al. [2]. Their attack is simple and effective; however, this pape
**Novelty.** This is among the first algorithms designed to verify the safety of LLMs against adversarial prompting. There is a novelty inherent to studying this problem, which is a major strength of this paper. **Writing.** The writing is relatively strong in this paper. Aside from a few minor typos, the paper is free of grammatical mistakes and the structure is clear. **Provable attack detection.** The idea of *provably* detecting adversarial jailbreaking attacks is novel and interest
**"Fundamental property."** The authors base their `erase-and-check` algorithm on the following observation: > "Our procedure leverages a fundamental property of safe prompts: Subsequences of safe prompts are also safe. This property allows it to achieve strong certified safety guarantees on harmful prompts while maintaining good empirical performance on safe prompts." I'm not sure whether this "fundamental property" is true. As an example, consider the following sentence: "How did you make
### Originality The proposed erase-and-check strategy is new. ### Quality N/A ### Clarity The overall presentation is clear, and the methodology is easy to follow. ### Significance N/A
### Originality **Q1: The claimed "certified" defense is essentially an exhaustive search of the original unperturbed prompt.** I am a bit concerned about the novelty of this paper, since the proposed "certified" defense is essentially an exhaustive search to recover the original unperturbed prompt. This notion is different from the compared randomized smoothing (Q7), which leverages random sampling and the Neyman-Pearson lemma to estimate the probabilistic certificate. While it is okay for th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsAttention Is All You Need · Dense Connections · Dropout · Linear Layer · Weight Decay · Adam · Multi-Head Attention · Residual Connection · Softmax · WordPiece
