Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

Youliang Yuan; Wenxiang Jiao; Wenxuan Wang; Jen-tse Huang; Jiahao Xu; Tian Liang; Pinjia He; Zhaopeng Tu

arXiv:2407.09121·cs.CL·May 26, 2025

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, Zhaopeng Tu

PDF

Open Access 2 Repos 8 Models 1 Video 5 Reviews

TL;DR

This paper introduces Decoupled Refusal Training (DeRTa), a novel method to improve LLM safety by enabling models to refuse unsafe content at any response point, addressing bias issues in safety tuning data.

Contribution

The paper presents DeRTa, a new training approach with two components that significantly enhances LLM safety and refusal capabilities without sacrificing performance.

Findings

01

DeRTa improves safety across multiple attack scenarios.

02

Models trained with DeRTa outperform baseline safety methods.

03

DeRTa maintains model performance while increasing safety.

Abstract

This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

1. The paper identifies an important issue in current safety tuning practices - the refusal position bias 2. The proposed DeRTa method is reasonably motivated to address the identified problem 3. The ablations are fairly comprehensive across different attacks, LLMs, and benchmarks.

Weaknesses

1. In the experiment setup, the authors report using randomly sampled examples but only report the average results. It remains unclear how sensitive the results are to the sampling. Furthermore, the authors should report or clarify whether the reported results are for one or multiple seeds. 2. Limited comparison with other safety-enhancing techniques: The paper does not thoroughly compare DeRTa with other recent safety-enhancing methods beyond standard safety tuning such as R2D2 [1]. 3. The p

Reviewer 02Rating 5Confidence 4

Strengths

1. This paper features a systematic evaluation of the method across various datasets and models. I am especially happy with the ablations on the contribution of each training objective and the multi-objective measurements across harmlessness and helpfulness. 2. The proposed ideas are simple and easy to implement.

Weaknesses

1. I am not convinced by the strength of the baselines. - I can not seem to find where Vanilla- is defined in the paper - Why do some of the baselines get less data points? - How is the DPO fine-tuning set up? Specifically, does it also get access to completions that start harmful and turn into refusals? - In the related work, it is mentioned that prior work such as "safety shouldn't be a few tokens deep" had implemented a method quite similar to MLE with harmful prefix. Does you

Reviewer 03Rating 5Confidence 4

Strengths

1. The experiments seems helpful to defend different types of jailbreak attacks;

Weaknesses

1. The paper is lack of the comparison with other SOTA methodology against jailbreak attacks; The paper will be more persuasive if more well-designed experiments are presented;

Reviewer 04Rating 3Confidence 4

Strengths

The paper investigates safety tuning of LLMs, and proposes a method to encourage LLMs to refuse prompts at any position of responses.

Weaknesses

- The motivation of this paper is not clear and should be demonstrated using preliminary experimental results. - The technical contribution of this paper is incremental. The method applies existing machine learning techniques as solutions to solve the problem. The authors could clarify/highlight the novelty and contribution further. - No baseline comparison can be found in experiments, making it hard to judge the effectiveness of proposed method.

Reviewer 05Rating 5Confidence 3

Strengths

1. The author noticed the issue that the current standard safety training strategy has a refusal position bias, which may lead the model cannot correctly generate refusal once it starts to comply with the request at the first stage. 2. The author proposed a novel safety training strategy that can mitigate this bias and the reported results show some promise compared to the the vanilla training.

Weaknesses

1. "Refuse at a later stage" is not a good indicator to show the model's safety. Ideally, for a well-aligned model, when it faces a harmful request, it **should** refuse at the first stage, instead of generating a refusal after the request has been **fully satisfied** (For example, if you ask the model how to make a bomb, it first generate step-by-step guidance, then generate refusal. Obviously, we cannot treat this as a successful defense). Though the case studies reported in the paper do not s

Code & Models

Repositories

Models

Videos

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training· underline

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Software Reliability and Analysis Research

MethodsAttention Is All You Need · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout · Multi-Head Attention · Dense Connections