TL;DR
This survey reviews harmful fine-tuning attacks on large language models, analyzing threat models, defenses, evaluation methods, and outlining future research directions.
Contribution
It systematically formulates the threat model, reviews existing attacks and defenses, and provides evaluation guidelines and a curated list of relevant papers.
Findings
Comprehensive overview of harmful fine-tuning attack variants.
Analysis of defense strategies and their effectiveness.
Guidelines for evaluating harmful fine-tuning methods.
Abstract
Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns: fine-tuning with a few harmful data uploaded from the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning attack, has generated broad research interests in both academia and industry. In this paper, we first systematically formulate the threat model and basic assumptions of harmful fine-tuning. Then, we provide a comprehensive review of harmful fine-tuning from three fundamental perspectives: attack setting, defense design, and evaluation methodology. First, we present the threat model of the problem and introduce the harmful fine-tuning attack and its variants. Next, we systematically survey representative attacks, defense methods, and mechanical analysis of adverse effects in the existing literature. Finally, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
