No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning   Attacks

Chak Tou Leong; Yi Cheng; Kaishuai Xu; Jian Wang; Hanlin Wang; Wenjie; Li

arXiv:2405.16229·cs.CL·May 28, 2024

No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

Chak Tou Leong, Yi Cheng, Kaishuai Xu, Jian Wang, Hanlin Wang, Wenjie, Li

PDF

Open Access

TL;DR

This paper investigates how different attack strategies compromise LLM safety, revealing that their mechanisms vary significantly across the safeguarding process and emphasizing the need for diverse defenses.

Contribution

It provides a detailed analysis of attack mechanisms on LLM safety, highlighting the divergence between explicit harmful and identity-shifting attacks across different safeguarding stages.

Findings

01

EHA targets the harmful recognition stage aggressively.

02

Both EHA and ISA disrupt later safeguarding stages.

03

Attack mechanisms differ dramatically between EHA and ISA.

Abstract

The existing safety alignment of Large Language Models (LLMs) is found fragile and could be easily attacked through different strategies, such as through fine-tuning on a few harmful examples or manipulating the prefix of the generation results. However, the attack mechanisms of these strategies are still underexplored. In this paper, we ask the following question: \textit{while these approaches can all significantly compromise safety, do their attack mechanisms exhibit strong similarities?} To answer this question, we break down the safeguarding process of an LLM when encountered with harmful instructions into three stages: (1) recognizing harmful instructions, (2) generating an initial refusing tone, and (3) completing the refusal response. Accordingly, we investigate whether and how different attack strategies could influence each stage of this safeguarding process. We utilize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques

MethodsActivation Patching