Certifying LLM Safety against Adversarial Prompting

Aounon Kumar; Chirag Agarwal; Suraj Srinivas; Aaron Jiaxun Li; Soheil; Feizi; Himabindu Lakkaraju

arXiv:2309.02705·cs.CL·February 6, 2025·25 cites

Certifying LLM Safety against Adversarial Prompting

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil, Feizi, Himabindu Lakkaraju

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces erase-and-check, a novel framework providing certifiable safety guarantees for LLMs against various adversarial prompt attacks, enhancing robustness while maintaining performance on safe prompts.

Contribution

The work presents the first certifiable safety framework for LLMs against adversarial prompts, with implementations using Llama 2 and DistilBERT and multiple empirical defenses.

Findings

01

Strong safety guarantees against adversarial prompts

02

Effective empirical defenses like RandEC, GreedyEC, and GradEC

03

Maintains good performance on safe prompts

Abstract

Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt to bypass the safety guardrails of an LLM and cause it to produce harmful content. In this work, we introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees. Given a prompt, our procedure erases tokens individually and inspects the resulting subsequences using a safety filter. Our safety certificate guarantees that harmful prompts are not mislabeled as safe due to an adversarial attack up to a certain size. We implement the safety filter in two ways, using Llama 2 and DistilBERT, and compare the performance of erase-and-check for the two cases. We defend against three attack modes: i) adversarial suffix, where an adversarial sequence is appended at the end of a harmful prompt; ii) adversarial insertion, where…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

Language model alignment to ensure helpfulness and harmfulness is critically important. Recent work has shown that it can be relatively straightforward to bypass model alignment, where the language model generates obviously problematic completions. To my knowledge, this paper proposes the first method to certify that a harmful prompt is not misclassified as safe. This makes the work a valuable contribution and potentially a good candidate for publication at ICLR. Erase-and-check is a simple,

Weaknesses

Potential weaknesses are also raised in the "Questions" section below. The title of the paper, "*Certifying LLM Safety against Adversarial Prompting*". In my view, this title is too broad and implies the work achieves more than it does (i.e., overclaims). The paper defends against a specific type of adversarial prompting -- token insertions. For example, consider the "*jailbreak via mismatched generalization*" attack in Wei et al. [2]. Their attack is simple and effective; however, this pape

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

**Novelty.** This is among the first algorithms designed to verify the safety of LLMs against adversarial prompting. There is a novelty inherent to studying this problem, which is a major strength of this paper. **Writing.** The writing is relatively strong in this paper. Aside from a few minor typos, the paper is free of grammatical mistakes and the structure is clear. **Provable attack detection.** The idea of *provably* detecting adversarial jailbreaking attacks is novel and interest

Weaknesses

**"Fundamental property."** The authors base their `erase-and-check` algorithm on the following observation: > "Our procedure leverages a fundamental property of safe prompts: Subsequences of safe prompts are also safe. This property allows it to achieve strong certified safety guarantees on harmful prompts while maintaining good empirical performance on safe prompts." I'm not sure whether this "fundamental property" is true. As an example, consider the following sentence: "How did you make

Reviewer 03Rating 1· strong rejectConfidence 4

Strengths

### Originality The proposed erase-and-check strategy is new. ### Quality N/A ### Clarity The overall presentation is clear, and the methodology is easy to follow. ### Significance N/A

Weaknesses

### Originality **Q1: The claimed "certified" defense is essentially an exhaustive search of the original unperturbed prompt.** I am a bit concerned about the novelty of this paper, since the proposed "certified" defense is essentially an exhaustive search to recover the original unperturbed prompt. This notion is different from the compared randomized smoothing (Q7), which leverages random sampling and the Neyman-Pearson lemma to estimate the probabilistic certificate. While it is okay for th

Code & Models

Repositories

aounon/certified-llm-safety
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsAttention Is All You Need · Dense Connections · Dropout · Linear Layer · Weight Decay · Adam · Multi-Head Attention · Residual Connection · Softmax · WordPiece