Robust LLM safeguarding via refusal feature adversarial training

Lei Yu; Virginie Do; Karen Hambardzumyan; Nicola Cancedda

arXiv:2409.20089·cs.LG·March 21, 2025·2 cites

Robust LLM safeguarding via refusal feature adversarial training

Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ReFAT, a new adversarial training method for LLMs that enhances robustness against attacks by simulating refusal feature ablation, reducing computational costs and improving safety.

Contribution

The paper reveals the universal mechanism of adversarial attacks involving refusal features and proposes ReFAT, an efficient training algorithm that improves LLM safety against adversarial threats.

Findings

01

ReFAT significantly boosts LLM robustness against diverse attacks.

02

ReFAT requires less computational resources than existing methods.

03

Refusal feature ablation approximates worst-case safety violations.

Abstract

Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The method is innovative, extending Arditi's Refusal Feature and further proving the similarity of adversarial attack and RFA mechanisms through causal theory, and proposing a more efficient adversarial training method based on this. 2. The method proposed in this paper can be objectively evaluated, including success rate of attack, generation performance, efficiency evaluation method, and the limitations are also analyzed. 3. The paper is well organized, first analyzing the general mechanism

Weaknesses

1. Using Llama-3-8B-Instruct-generated XSTest responses as the gold standard answers for supervised fine-tuning might pose potential issues in the design of scientific experiments, especially when Llama-3-8B-Instruct itself is also a subject of the subsequent experiments. This design could violate basic principles concerning independence in experimental standards. 2. The conclusion that rejecting direction leads to a performance degradation cannot be well demonstrated by experiments. The end of

Reviewer 02Rating 6Confidence 2

Strengths

1. The suggested method seems to be novel and reasonable. ReFAT dynamically computes the RF using two sets of inputs (harmful and harmless) and then ablates the RF for harmful inputs. This simulates the effect of adversarial attacks, training the model to make safety determinations without relying on the most salient features of input maliciousness. 2. The exploration of refusal features is interesting. The intervention mechanism seems to be reasonable while more discussion should be added to

Weaknesses

1. The authors rely on the previous findings of refusal features suggested by Arditi, and discuss the background in Sec 3.1 However, to me, I think the content is not self-contained enough, where I cannot understand the key heuristics behind Eq 3 as well as the following equation for their physical meanings. It seems that Eq 3 takes r_HH to somewhat measure the smoothness of the feature space, transforming h^l into a more informative space for intervention. I am quite curious about hope it is de

Reviewer 03Rating 3Confidence 4

Strengths

+ proposed a new adversarial training method for safeguarding LLMs with improved efficiency + experiments show improvements over existing adversarial training solutions

Weaknesses

+ The idea of manipulating the refusal features in the activation space is also introduced in other works on steering vector or activation steering such as the following ([1][2][3]). The authors might also want to discuss and comment on the differences. Specifically, [1][3] applied this type of activation engineering to the jailbreaking tasks and it is also recommended to compare with [1] Wang, Haoran, and Kai Shu. "Backdoor activation attack: Attack large language models using activation steer

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · VLSI and Analog Circuit Testing · Fault Detection and Control Systems