SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge
Adeel Yousaf, Joseph Fioresi, James Beetham, Amrit Singh Bedi, Mubarak Shah

TL;DR
SaFeR-CLIP is a fine-tuning framework that enhances vision-language model safety by minimally adjusting unsafe concepts to their closest safe counterparts, preserving performance and introducing a new safety benchmark.
Contribution
We propose SaFeR-CLIP, a proximity-aware fine-tuning method that maintains model performance while improving safety, and introduce NSFW-Caps, a benchmark for safety evaluation under distributional shifts.
Findings
Recovered up to 8.0% in zero-shot accuracy compared to prior methods.
Successfully balanced safety and performance in vision-language models.
Provided a new benchmark, NSFW-Caps, for safety testing.
Abstract
Improving the safety of vision-language models like CLIP via fine-tuning often comes at a steep price, causing significant drops in their generalization performance. We find this trade-off stems from rigid alignment strategies that force unsafe concepts toward single, predefined safe targets, disrupting the model's learned semantic structure. To address this, we propose a proximity-aware approach: redirecting unsafe concepts to their semantically closest safe alternatives to minimize representational change. We introduce SaFeR-CLIP, a fine-tuning framework that applies this principle of minimal intervention. SaFeR-CLIP successfully reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety. To support more rigorous evaluation, we also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
