SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge

Adeel Yousaf; Joseph Fioresi; James Beetham; Amrit Singh Bedi; Mubarak Shah

arXiv:2511.16743·cs.CV·November 24, 2025

SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge

Adeel Yousaf, Joseph Fioresi, James Beetham, Amrit Singh Bedi, Mubarak Shah

PDF

Open Access 1 Video

TL;DR

SaFeR-CLIP is a fine-tuning framework that enhances vision-language model safety by minimally adjusting unsafe concepts to their closest safe counterparts, preserving performance and introducing a new safety benchmark.

Contribution

We propose SaFeR-CLIP, a proximity-aware fine-tuning method that maintains model performance while improving safety, and introduce NSFW-Caps, a benchmark for safety evaluation under distributional shifts.

Findings

01

Recovered up to 8.0% in zero-shot accuracy compared to prior methods.

02

Successfully balanced safety and performance in vision-language models.

03

Provided a new benchmark, NSFW-Caps, for safety testing.

Abstract

Improving the safety of vision-language models like CLIP via fine-tuning often comes at a steep price, causing significant drops in their generalization performance. We find this trade-off stems from rigid alignment strategies that force unsafe concepts toward single, predefined safe targets, disrupting the model's learned semantic structure. To address this, we propose a proximity-aware approach: redirecting unsafe concepts to their semantically closest safe alternatives to minimize representational change. We introduce SaFeR-CLIP, a fine-tuning framework that applies this principle of minimal intervention. SaFeR-CLIP successfully reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety. To support more rigorous evaluation, we also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling