Jailbreaking as a Reward Misspecification Problem

Zhihui Xie; Jiahui Gao; Lei Li; Zhenguo Li; Qi Liu; Lingpeng Kong

arXiv:2406.14393·cs.LG·April 22, 2025

Jailbreaking as a Reward Misspecification Problem

Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper links LLM vulnerabilities to reward misspecification, introduces ReGap and ReMiss for detecting and exploiting this flaw, and demonstrates their effectiveness in improving adversarial prompt generation and model robustness evaluation.

Contribution

It presents the ReGap metric for quantifying reward misspecification and introduces ReMiss, a novel automated red teaming system that outperforms existing methods in adversarial prompt generation.

Findings

01

ReGap effectively detects harmful backdoor prompts.

02

ReMiss achieves state-of-the-art attack success rates.

03

Attacks transfer well to closed-source models and out-of-distribution tasks.

Abstract

The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. This misspecification occurs when the reward function fails to accurately capture the intended behavior, leading to misaligned model outputs. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark against various target aligned LLMs while…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper presents a novel perspective on the vulnerabilities of large language models (LLMs) by framing them as a reward misspecification problem, which adds depth to the discourse on model safety. 2. The introduction of the ReGap metric is a significant contribution, providing a new method to quantify the extent of reward misspecification, thereby enhancing our understanding of how misalignment occurs. 3. ReMiss effectively generates adversarial suffixes with low perplexity, indicating that

Weaknesses

1. The baselines in Table 1 have selected weaker settings from [1], with GCG and AutoDAN opting for the universal rather than individual settings, and AdvPrompter not utilizing the warmstart setting. What considerations led to these choices? Additionally, I could not find details on how the ASR measurements in Table 1 were obtained. Specifically, was the ASR for ReMiss derived from generating a suffix for each harmful instruction individually, or was it based on a single universal suffix? If ReM

Reviewer 02Rating 6Confidence 3

Strengths

- The ReGap provides novel and valuable insights into the LLM misalignment, and the proposed ReMiss justifies the effectiveness of the revealed vulnerability. - Extensive experiments have been conducted to provide a good insight into the components of the proposed method. - The paper is generally well-written, with clear illustrations and tables.

Weaknesses

Implementing the **ReMiss** involves stochastic beam search and requires significant computational resources compared to prompt-based methods like CihperChat [1] or language model-based methods like PAIR [2]. The significant resource requirement to conduct the ReMiss attack could be considered a defense, which reduces the proposed method's misused risk. The idea of utilizing the reference model to jailbreak target LLM is interesting. However, ReMiss's effectiveness relies on the accurate modelin

Reviewer 03Rating 5Confidence 4

Strengths

**Experiments.** The experiments are comprehensive. The authors include a number of ablation studies, and indeed, the proposed method improves significantly over AdvPrompter, particularly in the ASR @ 1 metric. For ASR @ 10, the methods do more or less the same on most of the models (except for Llama3.1, although this comparison isn't complete as the other baselines were completely omitted). The authors use a wide range of models, although it would make a stronger case if the authors could show

Weaknesses

**Experiments.** * A worthwhile question here is whether AdvBench is the right dataset to focus the analysis on. It is well-known that the train/test split of AdvBench is contaminated in the sense that both splits contain the same behaviors (e.g., asking for instructions to build a bomb). It may be worth focusing on HarmBench (which is used for the OOD experiments) or JailbreakBench. * The experiment in Section 3.2 is not well explained. * The authors omit most of the details regarding how

Code & Models

Repositories

zhxieml/remiss-jailbreak
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, Economics, and Judicial Systems