Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing   Guardrail Moderation

Tiansheng Huang; Sihao Hu; Fatih Ilhan; Selim Furkan Tekin; Ling Liu

arXiv:2501.17433·cs.CR·January 30, 2025

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces Virus, a novel attack method that bypasses guardrail moderation in large language models, demonstrating that relying solely on moderation guardrails is ineffective for preventing harmful fine-tuning.

Contribution

The paper presents Virus, a new red-teaming attack that can evade guardrail moderation, revealing the limitations of current safety measures for LLM fine-tuning.

Findings

01

Virus achieves up to 100% leakage ratio of harmful data.

02

Virus can bypass moderation filters with slight modifications.

03

Guardrail moderation alone is insufficient for safety in LLM fine-tuning.

Abstract

Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbf{it is reckless to consider guardrail moderation as a clutch at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

git-disl/virus
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection