Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Tiansheng Huang; Sihao Hu; Fatih Ilhan; Selim Furkan Tekin; Ling Liu

arXiv:2409.18169·cs.CR·April 27, 2026·2 cites

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu

PDF

2 Repos

TL;DR

This survey reviews harmful fine-tuning attacks on large language models, analyzing threat models, defenses, evaluation methods, and outlining future research directions.

Contribution

It systematically formulates the threat model, reviews existing attacks and defenses, and provides evaluation guidelines and a curated list of relevant papers.

Findings

01

Comprehensive overview of harmful fine-tuning attack variants.

02

Analysis of defense strategies and their effectiveness.

03

Guidelines for evaluating harmful fine-tuning methods.

Abstract

Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns: fine-tuning with a few harmful data uploaded from the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning attack, has generated broad research interests in both academia and industry. In this paper, we first systematically formulate the threat model and basic assumptions of harmful fine-tuning. Then, we provide a comprehensive review of harmful fine-tuning from three fundamental perspectives: attack setting, defense design, and evaluation methodology. First, we present the threat model of the problem and introduce the harmful fine-tuning attack and its variants. Next, we systematically survey representative attacks, defense methods, and mechanical analysis of adverse effects in the existing literature. Finally, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.