Representation Noising: A Defence Mechanism Against Harmful Finetuning
Domenic Rosati, Jan Wehner, Kai Williams, {\L}ukasz Bartoszcze, David, Atanasov, Robie Gonzales, Subhabrata Majumdar, Carsten Maple, Hassan Sajjad,, Frank Rudzicz

TL;DR
This paper introduces Representation Noising (RepNoise), a novel defense mechanism that reduces harmful fine-tuning risks in large language models by obscuring harmful representations without impairing general capabilities.
Contribution
RepNoise is a new method that effectively prevents harmful fine-tuning by removing harmful information across all layers, even when attackers have access to model weights.
Findings
RepNoise significantly reduces harmful fine-tuning success.
The method retains the model's ability to perform harmless tasks.
Effectiveness depends on the degree of information removal across layers.
Abstract
Releasing open-source large language models (LLMs) presents a dual-use risk since bad actors can easily fine-tune these models for harmful purposes. Even without the open release of weights, weight stealing and fine-tuning APIs make closed models vulnerable to harmful fine-tuning attacks (HFAs). While safety measures like preventing jailbreaks and improving safety guardrails are important, such measures can easily be reversed through fine-tuning. In this work, we propose Representation Noising (RepNoise), a defence mechanism that operates even when attackers have access to the weights. RepNoise works by removing information about harmful representations such that it is difficult to recover them during fine-tuning. Importantly, our defence is also able to generalize across different subsets of harm that have not been seen during the defence process as long as they are drawn from the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsScientific Computing and Data Management
