Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining
Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng

TL;DR
This paper introduces MSCP, a training-free method that enhances safety in fine-tuned large language models by aligning safety representations across multiple levels without retraining, reducing harmful outputs while maintaining utility.
Contribution
The paper presents a novel multi-level safety projection technique that implicitly aligns safety features and enables continual safety defense without retraining of large language models.
Findings
Significantly reduces harmfulness scores and attack success rates.
Preserves model utility while enhancing safety.
Demonstrates effectiveness across multiple fine-tuned LLMs.
Abstract
While fine-tuning services drive the rapid expansion of task capabilities in large language models (LLMs), they are often accompanied by the degradation and reorganization of safety-aligned representations, making models more prone to deviating from human preferences and exposing them to emerging jailbreak risks. Existing post-fine-tuning defense methods predominantly rely on single-scale safety correction mechanisms, which struggle to achieve a robust balance among safety, model utility, and continual adaptability. We propose Multi-Level Safety Continual Projection (MSCP), a training-free post-fine-tuning safety enhancement method that implicitly aligns global and localized safety activations through coordinated multi-level representations to isolate sparse neuron clusters governing safety-sensitive behaviors. It then applies composable safety-direction projections without retraining,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
