TL;DR
This paper introduces a safety-aware probing framework that improves the safety-utility balance during LLM fine-tuning by steering updates away from unsafe regions while maintaining task performance.
Contribution
It proposes a novel contrastive safety signal-based optimization method that enhances LLM safety during fine-tuning without sacrificing utility.
Findings
SAP reduces harmful scores significantly compared to standard fine-tuning.
SAP outperforms strong baselines in safety-utility tradeoff.
SAP demonstrates robustness against harmful data poisoning and adversarial attacks.
Abstract
Large language models (LLMs) have achieved remarkable success across many applications, but their ability to generate harmful content raises serious safety concerns. Although safety alignment techniques are often applied during pre-training or post-training, recent studies show that subsequent fine-tuning on adversarial or even benign data can still compromise model safety. In this paper, we revisit the fundamental question of why fine-tuning on non-harmful data may nevertheless degrade safety. We show that the safety and task-performance loss landscapes are partially decoupled, so updates that improve task-specific performance may still move the model toward unsafe regions. Based on this insight, we propose a safety-aware probing (SAP) optimization framework for mitigating safety risks during fine-tuning. Concretely, SAP uses contrastive safety signals to locate safety-correlated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
