Jailbreaking as a Reward Misspecification Problem
Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong

TL;DR
This paper links LLM vulnerabilities to reward misspecification, introduces ReGap and ReMiss for detecting and exploiting this flaw, and demonstrates their effectiveness in improving adversarial prompt generation and model robustness evaluation.
Contribution
It presents the ReGap metric for quantifying reward misspecification and introduces ReMiss, a novel automated red teaming system that outperforms existing methods in adversarial prompt generation.
Findings
ReGap effectively detects harmful backdoor prompts.
ReMiss achieves state-of-the-art attack success rates.
Attacks transfer well to closed-source models and out-of-distribution tasks.
Abstract
The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. This misspecification occurs when the reward function fails to accurately capture the intended behavior, leading to misaligned model outputs. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark against various target aligned LLMs while…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper presents a novel perspective on the vulnerabilities of large language models (LLMs) by framing them as a reward misspecification problem, which adds depth to the discourse on model safety. 2. The introduction of the ReGap metric is a significant contribution, providing a new method to quantify the extent of reward misspecification, thereby enhancing our understanding of how misalignment occurs. 3. ReMiss effectively generates adversarial suffixes with low perplexity, indicating that
1. The baselines in Table 1 have selected weaker settings from [1], with GCG and AutoDAN opting for the universal rather than individual settings, and AdvPrompter not utilizing the warmstart setting. What considerations led to these choices? Additionally, I could not find details on how the ASR measurements in Table 1 were obtained. Specifically, was the ASR for ReMiss derived from generating a suffix for each harmful instruction individually, or was it based on a single universal suffix? If ReM
- The ReGap provides novel and valuable insights into the LLM misalignment, and the proposed ReMiss justifies the effectiveness of the revealed vulnerability. - Extensive experiments have been conducted to provide a good insight into the components of the proposed method. - The paper is generally well-written, with clear illustrations and tables.
Implementing the **ReMiss** involves stochastic beam search and requires significant computational resources compared to prompt-based methods like CihperChat [1] or language model-based methods like PAIR [2]. The significant resource requirement to conduct the ReMiss attack could be considered a defense, which reduces the proposed method's misused risk. The idea of utilizing the reference model to jailbreak target LLM is interesting. However, ReMiss's effectiveness relies on the accurate modelin
**Experiments.** The experiments are comprehensive. The authors include a number of ablation studies, and indeed, the proposed method improves significantly over AdvPrompter, particularly in the ASR @ 1 metric. For ASR @ 10, the methods do more or less the same on most of the models (except for Llama3.1, although this comparison isn't complete as the other baselines were completely omitted). The authors use a wide range of models, although it would make a stronger case if the authors could show
**Experiments.** * A worthwhile question here is whether AdvBench is the right dataset to focus the analysis on. It is well-known that the train/test split of AdvBench is contaminated in the sense that both splits contain the same behaviors (e.g., asking for instructions to build a bomb). It may be worth focusing on HarmBench (which is used for the OOD experiments) or JailbreakBench. * The experiment in Section 3.2 is not well explained. * The authors omit most of the details regarding how
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, Economics, and Judicial Systems
