Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning
Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong

TL;DR
This paper introduces Vulnerability-Aware Alignment (VAA), a novel method that identifies and mitigates data vulnerability patterns during harmful fine-tuning of language models, reducing risks while maintaining task performance.
Contribution
VAA is the first approach to explicitly estimate data vulnerability, partition data into groups, and use group DRO with adversarial sampling to improve safety during fine-tuning.
Findings
VAA significantly reduces harmful scores across tasks.
VAA maintains or improves downstream task performance.
VAA outperforms existing baselines in mitigating harmful fine-tuning.
Abstract
Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which estimates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
