GradShield: Alignment Preserving Finetuning

Zhanhao Hu; Xiao Huang; Patrick Mendoza; Emad A. Alghamdi; Basel Alomair; Raluca Ada Popa; David Wagner

arXiv:2605.14194·cs.CL·May 15, 2026

GradShield: Alignment Preserving Finetuning

Zhanhao Hu, Xiao Huang, Patrick Mendoza, Emad A. Alghamdi, Basel Alomair, Raluca Ada Popa, David Wagner

PDF

TL;DR

GradShield is a filtering technique that improves LLM safety during finetuning by removing harmful data points based on a computed harmfulness score, outperforming baseline methods.

Contribution

Introduces GradShield, a novel filtering method that uses FIHS and adaptive thresholding to enhance LLM safety without sacrificing utility.

Findings

01

GradShield maintains attack success rate below 6%.

02

It outperforms baseline methods in safety and utility.

03

Effective across multiple fine-tuning tasks.

Abstract

Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.