Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

Hao Li; Lijun Li; Zhenghao Lu; Xianyi Wei; Rui Li; Jing Shao; Lei Sha

arXiv:2507.18631·cs.CR·July 28, 2025

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

Hao Li, Lijun Li, Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, Lei Sha

PDF

Open Access 1 Video

TL;DR

This paper introduces LARF, a layer-aware filtering method that detects and removes safety-degrading samples from fine-tuning datasets, thereby preserving the safety alignment of large language models during adaptation.

Contribution

LARF is a novel approach that identifies safety-sensitive layers in LLMs to filter out unsafe training data, improving safety during fine-tuning.

Findings

01

LARF effectively detects safety-degrading samples in datasets.

02

Removing identified samples mitigates safety degradation in LLMs.

03

LARF enhances safety alignment without compromising model performance.

Abstract

With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a Layer-Aware Representation Filtering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment· underline

Taxonomy

TopicsSecurity and Verification in Computing · Advanced Malware Detection Techniques · Adversarial Robustness in Machine Learning