Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining
Chongxin Li, Hanzhang Wang, Lian Duan

TL;DR
This paper introduces Safety-Potential Pruning, a one-shot pruning method that enhances safety prompts in vision-language models by activating safety pathways, significantly reducing jailbreak success rates without retraining.
Contribution
The paper proposes a novel pruning framework that exposes and amplifies safety-related pathways in VLMs, improving jailbreak resistance without additional training.
Findings
Reduces attack success rates by up to 22%
Maintains strong performance on benign tasks
Applicable across multiple VLM architectures
Abstract
Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
