Secure LLM Fine-Tuning via Safety-Aware Probing

Chengcan Wu; Zhixin Zhang; Zeming Wei; Yihao Zhang; Xiaokun Luan; Meng Sun

arXiv:2505.16737·cs.LG·April 24, 2026

Secure LLM Fine-Tuning via Safety-Aware Probing

Chengcan Wu, Zhixin Zhang, Zeming Wei, Yihao Zhang, Xiaokun Luan, Meng Sun

PDF

1 Repo

TL;DR

This paper introduces a safety-aware probing framework that improves the safety-utility balance during LLM fine-tuning by steering updates away from unsafe regions while maintaining task performance.

Contribution

It proposes a novel contrastive safety signal-based optimization method that enhances LLM safety during fine-tuning without sacrificing utility.

Findings

01

SAP reduces harmful scores significantly compared to standard fine-tuning.

02

SAP outperforms strong baselines in safety-utility tradeoff.

03

SAP demonstrates robustness against harmful data poisoning and adversarial attacks.

Abstract

Large language models (LLMs) have achieved remarkable success across many applications, but their ability to generate harmful content raises serious safety concerns. Although safety alignment techniques are often applied during pre-training or post-training, recent studies show that subsequent fine-tuning on adversarial or even benign data can still compromise model safety. In this paper, we revisit the fundamental question of why fine-tuning on non-harmful data may nevertheless degrade safety. We show that the safety and task-performance loss landscapes are partially decoupled, so updates that improve task-specific performance may still move the model toward unsafe regions. Based on this insight, we propose a safety-aware probing (SAP) optimization framework for mitigating safety risks during fine-tuning. Concretely, SAP uses contrastive safety signals to locate safety-correlated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ChengcanWu/SAP
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.